Web Magazine for Information Professionals

Displaying SGML Documents on the World Wide Web

David Houghton discusses a method by which documents marked up using Standard Generalised Markup Language (SGML) can be used to generate a database for use in conjunction with the World Wide Web.

This article discusses a method by which documents marked up using Standard Generalised Markup Language (SGML) can be used to generate a database for use in conjunction with the World Wide Web. The tools discussed in this article and those that were used in experiments are all public domain or shareware packages. This demonstrates that the power and flexibilty of SGML can be utilised by the Internet community at little or no cost. The motivation for this work stems from the lack of standardisation on display techniques for SGML presentation.


Ever since the World Wide Web came into being in the early 1990's, the SGML community have become excited about the possibility of realising the potential of such a markup method on a global scale. The concept of an SGML WWW browser became a real possibility. SGML was suddenly recognised as being the parent of the concept of Hypertext Markup Language (HTML) and as such it could be used to develop the next generation of Web browsers.

Sadly, the initial enthusiasm of the SGML community has been dampened by the software industry's failure to pick up on the concept of SGML use, despite the efforts of such people as C. M. Sperberg-McQueen and Robert F. Goldstein [1]. The reasons for this failure are disputable but include the key concepts of SGML presentation techniques. The lack of a suitable standard in this area has lead to SGML product manufacturers developing their own methods of presentation. As far as standardisation of HTML is concerned, the dominance of the Netscape WWW browser has complicated this issue.

Given this state of affairs, how can SGML users harness the power of the Internet and the flexibility of their data ? The solutions presented in this report relate to the use of public domain and shareware products to provide a mechanism of using SGML-based documents in a WWW environment. There are no doubt many commercial alternatives, but as we shall discuss shortly the method of presentation of SGML presents the major problem.

Presentation issues for SGML

Those companies and institutes who use SGML on a regular basis will no doubt be aware of the issue of presentation. Unlike other markup methods such as LaTeX and Word, SGML documents themselves have little or no presentation markup information. This is because they are written using the concept of logical markup rather than presentation. SGML emphasises the structure of the document rather than how it appears. This makes it possible to construct documents that are independent of the system for representing the document.

For those readers who are not familiar with this principle it is suggested that references [3] and [5] should be consulted. SGML's power lies in the fact that logical documents can be manipulated and used in a wide range of applications such as databases, without the overheads that relate to presentation aspects.

So how are SGML documents presented ? There is essentially no easy answer to this as the method of presentation will depend on the software product used to 'view' the documents. There are at present a whole range of products from an ever-increasing number of vendors that attempt to provide a easy to use and flexible presentation method. The list of public and commercial products provided in [6] illustrate the huge range of packages available. The reader will soon discover that the common element of all these packages is the lack of standardisation on presentation method.

In order to overcome this lack of standardisation, a great deal of effort has gone into producing ISO/IEC DIS 1017.92 Document Style Semantics Specification Language (DSSSL).Unfortunately at the time of writing, no software manufacturer has implemented this standard.

The pragmatic approach taken in this report is to accept the lack of presentation standard and to accept that products such as Netscape provide sufficient flexibilty to provide an 'acceptable' viewing platform. Of course, Netscape uses HTML as its markup language and includes vendor specific features. This fact too will require some digestion.

Background for this study

The motivation for this study has arisen from work carried out as part of the Electronic SGML Applications (ELSA) [14] project at the IIELR, De Montfort University. The ELSA project is concerned with the investigation of the use of SGML as a method of delivering scientific journal articles for use in an electronic library environment. The SGML articles were provided by Elsevier Science.

The work carried out for the ELSA project was intended to demonstrate that an on-line journal article service could be set up easily and efficiently. A prototype system using HTML versions of the SGML articles was set up as described in the following sections. The primary goal of the prototype system was to assess the user interface and to gain valuable information on the users' reaction to such a system.

The details

The problem of converting documents marked up in SGML into some form of HTML has been solved by numerous methods. In essence, the DSSSL standard mentioned above, when implemented, may remove the necessity of even this step as DSSSL includes a transformation process (SGML Tree Transformation Process STTP). This study uses the Copenhagen SGML Tool (CoST), a publically available product developed by several people and available from [8]. The methods of conversion we have used in this study are based on the use of UNIX as the operating platform. This is because the majority of public domain products used in the study are only available for UNIX, and because the processing power associated with computers running UNIX is required for dealing with large numbers of documents.

The CoST converter

CoST (Copenhagen SGML Tool) is a structured controlled SGML application programming tool that uses TCL (Tool Command Language) as the programming language and the SGMLS SGML parser written by James Clark. Details of this converter can been found in the accompanying documentation of [8].

CoST enables the user to 'map' SGML elements and entities to a corresponding target markup format. The target format need not be SGML but can be any format that can be written in an ASCII format. The example in Appendix 1 shows how CoST can be used to translate SGML into LaTeX.

Using CoST it has been possible to map a given DTD into a corresponding HTML equivalent. It should be pointed out at this point that the limitations of HTML for presentation markup require compromises to be made and lead inevitably to a reduction in 'richness'. The transformation process cannot be a 1 to 1 mapping for the majority of DTDs and so acceptable compromises must be sought. Appendix 2 shows a sample of the CoST conversion mapping.

Having made the SGML to HTML conversion, there are a number of other factors that now need to be addressed. The presentatation method of some SGML viewers require that such features as external figures or pictures be treated in a particular way. Some viewers require figures to be treated as external entities that need defining in the DTD while others use the <LINK> feature of SGML. It should be born in mind that the original SGML document may have features in them that HTML browsers cannot handle. It is therefore recommended that HTML should be studied in detail before any SGML transformation process be undertaken.

Additional HTML browser features such as the use of BASE REF and BGCOLOR will need to be added to the target HTML by using UNIX scripts.

Translating the images

Image formats that are supported in Netscape include GIF and JPEG. If the source images are not in this format then it will be necessary to perform conversions. If the amount of data is large, a UNIX script will be required and tools such as convert from ImageMagick will need to be employed. Appendix 3 shows a sample of such a UNIX conversion script.

Often it will be necessary to change the size of images and/or make the images transparent. Again a UNIX script is ideal for this and an example of image size conversion is illustrated in Appendix 3.

The database

Once information has been transformed to HTML, the majority of users will require a new database system to be set up. SGML data may well be structured in the form of a database that uses the SGML fields. Although HTML will be used for presenting documents, there is no reason why the original SGML database cannot be used as long as it is accessible from the WWW. Proprietary database systems may present a problem of access and so a new database may require setting up. A typical SGML database is described in Appendix 4 and involves several thousand journal articles. The articles are arrange in directories that relate to the specific journals that they are associated with. Each journal is associated with a subject area. No attempt has been made to cross reference material in this database. Journals that appear in more than one subject area appear to be duplicated via the use of UNIX symbolic links.

The WWW and Internet communities have adopted a number of database technologies; amongst them FreeWAIS-SF [9] stands out as being the most powerful and flexible. Other database systems exist that may be of more use to the reader. These include Glimpse [10], ICE [11] and Harvest [12]. This study used the freeWAIS-SF package on a DEC Alpha OSF platform which allowed the database to be indexed and queried via a client-server architecture. The actual techniques adopted are discussed in [9] and an example indexing and query set are shown in Appendix 5.

The user interface

The user interface to the system described in this study is a set of HTML front end pages that enable users to browse and assemble queries. The input data from the query page is converted by a CGI script to interface with the database search mechanism in use.


In order to provide a HTML front end page for the HTML documents, the kidofwais.pl Perl script written by Mike Grady [13] was modified so as to provide support for HTML forms. Users enter information into this search form, and the parameters for the search are extracted and sent to the server in the form of a WAIS query command.

The ability to browse the database is provided by a subject area selection page and a journal list page. The latter allows the user to select information via a journal cover thumbnail image or an articles image. The thumbnail is linked to the journal information held on the journal home server while the the articles image is linked to a list of known articles in the current database for that specific journal.

The front end pages for browsing and searching are shown in Appendix 6.

The results

A test bed of 5000 SGML files has been set up on a Digital Alpha Workstation and indexed using FreeWAIS. Access to this database is via a HTML form that allows users to browse and search. Although the database is potentially available to the WWW user community, access is restricted by the use of the htaccess mechanism [15] to the De Montfort University domain user base. The freeWAIS search engine is sufficiently fast for local use although response times across the Internet are , of course, unpredictable as network congestion needs to be considered. However, this is a matter which is outside the scope of this project.

Complex search queries are still not possible with FreeWAIS, but boolean algebra is supported and provides sufficient functionality to make the system useful.

In order to evaluate the system a questionaire has been designed and will be used on-line in trials on the university campus-wide network. The system will provide a method of delivery of online electronic journals texts originally marked up in SGML using the publically available Elsevier 2.0.3 DTD.


The study has shown that by using public domain software it is possible to provide a powerful and useful database system that allows full text search and retrieval. Whilst the data has been transformed from SGML into its HTML equivalent form and hence lost an element of its 'richness', it has been converted into a form that is more accessible by a larger community. It is worth noting that by attracting a larger audience the cost of production remains constant while circulation has effectively and significantly increased.


The author wishes to thank Elsevier Science for its support and permission to reproduce for this study extracts of its set of scientific journals provided to De Montfort University via its collaboration in Project ELSA.

The author is also eternally grateful to the support and understanding of his fellow employees Owen Williams and Anil Sharma for their help in setting up the experiments mentioned in this report.

The author would also like to thank the people who have dedicated long hours to the production of the publicly available software packages that have been used in this project.


  1. C. M. Sperberg-McQueen and Robert Goldstein. HTML to the Max : A Manifesto for Adding SGML Intelligence to the World-Wide Web,
    http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Autools/sperberg- mcqueen/sperber g.html
  2. Berners-Lee, Tim, and Daniel Connolly. Hypertext Markup Language: A Representation of Textual Information and Meta information for Retrieval and Interchange. (Draft, expired 14 January 1994.)
  3. ISO (International Organisation for Standardisation). ISO 8879-1986 (E). Information processing -- Text and Office Systems -- Standard Generalised Markup Language (SGML). First edition -- 1986-10-15. [Geneva]: ISO, 1986.
  4. Schatz, Bruce R., and Joseph B. Hardin. NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet. Science 265 (12 August 1994): 895-901.
  5. SoftQuad SGML Primer. http://www.sq.com/sgmlinfo/primbody.html
  6. Robin Cover's SGML page. http://www.sil.org/sgml/sgml.html
  7. Document Style Semantics Specification Language (DSSSL) Lite standard, ISO/IEC DIS 1017.92, ISO, Paris (1995).
  8. Copenhagen SGML Tool, ftp://ftp.crl.com/users/ro/jenglish/cost- B4.tar.Z
  9. Ulrich Pfeifer. FreeWAIS Edition 0.4 for freeWAIS-sf 1.2 June 1995
  10. Glimpse, http://gd.tuwien.ac.at/infosys/indexe rs/glimpse/ghindex.html
  11. ICE, http://www.informatik.th-darmstadt.de/ neuss/ice/ice.html
  12. Harvest, http://harvest.cs.colorado.edu/
  13. Mike Grady. kidowais.pl, contact m-grady@uiuc.edu
  14. Project ELSA, http://www2.echo.lu/libraries/en/lib-link.html
  15. Mosaic User Authentication Tutorial, http://hoohoo.ncsa.uiuc.edu/docs- 1.5/tutorials/user.html

Appendix 1 : TCL spec for SGML to LaTeX

element ART {
 start{puts stdout "\\ documentstyle\{article\}\\ begin\{document\}"}
 end  {puts stdout "\\ end\{document\}"}

element TITLE {
  start {puts stdout "\\ begin\{center\}\{\\ LARGE \\ bf" }
  end   {puts stdout "\}\\ end\{center\}"}

element PAR {
  start {puts stdout "\\ vspace\{0.25in\}" }
element REF {
  start {puts stdout "\{\\ it" }
  end   {puts stdout "\}\\ \\ " }
element ADDRESS {
  start { puts stdout "\{\\ it" }
  end   { puts stdout "\}" }

element SECTION {
  start { puts stdout "\{\\ section*\{" }
  end   { puts stdout "\}" }

Appendix 2 : TCL spec for Elsevier DTD to HTML

element ATL {
  start {puts "<H1>" }
  end   {puts "</H1><P>"}

element IT {
  start { puts stdout "<A>" }
  end   { puts stdout "</A>" }

element P {
  start { puts stdout "<P>" }

element BB {
  end   { puts stdout "<BR>" }

element BIBL {
  start { puts stdout 
"<P><H2>Bibliography</H2><BR><ADDRESS>" }

element AU {
  start { puts stdout "<H2>" }
  end   { puts stdout "</H2>" }

element SNM {
  start { puts stdout " " }

element COR {
  start { puts stdout "<BR><ADDRESS>" }
  end   { puts stdout "</ADDRESS><BR>" }

element AFF {
  start { puts stdout "<BR><H3>Affiliation</H3><ADDRESS>" }
  end   { puts stdout "</ADDRESS><BR>" }

element RV {
  start { puts stdout " "}

element ABS {
  start { puts stdout "<BR><H3>Abstract</H3> <ADDRESS> " }
  end { puts stdout  "</ADDRESS>"}

element BDY {

element BF {
  start { puts stdout "<B>" }
  end   { puts stdout "</B>" }

element ST {
  start { puts stdout "<BR><H3>" }
  end   { puts stdout "</H3><BR>" }

element KWDG {
  start { puts stdout "<BR><H4>Keywords : </H4>" }

element KWD {
  start { puts stdout " " }

element SUP {
  start { puts stdout "\^" }

element TBL {
  start {
    if {[attrValue ID]=="table_1"} {
      puts stdout "<IMG SRC=table_1.gif>"

element FIG {
  start {
    if {[isImplicit ID]} {
      puts stdout "No Fig ID"
    } else {
      if {[attrValue ID]=="1"} {
        puts stdout "<IMG SRC=fig1.gif>"
      if {[attrValue ID]=="2"} {
        puts stdout "<IMG SRC=fig2.gif>"

      if {[attrValue ID]=="scheme_1"} {
        puts stdout "<IMG SRC=scheme_1.gif>"
  end {
    puts stdout "<P>"

  case $data in {
    {&tilde;} {set out \~}
    {&lt;}    {set out \{lt\}}
    {&szlig;} {set out \{szlig\}}
    {&deg;}   {set out \{deg\}}
    {&minus;} {set out -}
    {&quot;}  {set out \"}
    {&macr;}  {set out \^}
    {&pi;}     {set out \{pi\} }
    {&int;}    {set out \{int\}}
    {&rho;}    {set out \{rho\}}
    {&lambda;} {set out \{lambda\}}
    {&mu;}     {set out \{mu\}}
    {&beta;}   {set out \{beta\}}
    {&alpha;}  {set out \{alpha\}}
{&gamma;}  {set out \{gamma\}}
    {&prop;}  {set out \{prop\}}
    {&prime;}  {set out \{prime\}}
    {&times;}  {set out *}
    {&plusmn;} {set out \{plusmn\}}
    {&ndash;} {set out -}
    {&bull;} {set out \{bull\}}
    {&cir;} {set out \{cir\}}
    {&amp;} {set out \& }
    {&squ;} {set out \{squ\}}

    default  {set out ? }
  puts stdout $out nonewline

Appendix 3 : UNIX script used to convert images

set -x

JOURNALS="jpms ns cs corel infman lrp aca mlblue"

for journal in $JOURNALS
 cd $TOP/$journal

  for article in *
    cd $TOP/$journal/$article

    for tif in *.tif
      if [ ! -f ${file}.gif ]
        convert -geometry 600 -colors 2 tif:${file}.tif \
        convert -interlace LINE tif:${file}.new.tif \
        giftrans -t 1 ${file}.gif > ${file}.tmp.gif
        rm -f ${file}.new.tif
        mv ${file}.tmp.gif ${file}.gif

Appendix 4 : Database directory structure


The above diagram shows the hierarchy of directories for the experimental SGML/HTML database.

The SA numbers reflect a Subject Area division defined by the data authors.

The fourth column represents the Journal ID codes, for example, cs could be Computer Science.

Each article is uniquely defined by an article ID number which is used to name the corresponding HTML file.

Appendix 5 : Typical freeWAIS-SF indexing and query

Indexing using freeWAIS-SF

cat listoffiles | waisindex -t URL /usr/local/elsa/public_html /
                                   http://www.elsa.dmu.ac.uk/~elsa -d /
                                   /elsa3/elsa-waisindex/GASS -t fields /

where listoffiles is something of the format



waisindex -t URL
Basically, means that we are dealing with HTML documents to be served on the WWW. With this, you need to specify the bit to chop off the file path, and the bit to add to the file path, in order to make the URL to the document as follows :-
      /usr/local/elsa/public_html   is the bit to chop off

      http://www.elsa.dmu.ac.uk/~elsa -d  is the bit to add
Is the directory and filename in which to create the index ie, in dir /elsa3/elsa-waisindex/ create GASS.fmt GASS etc.
-t fields
Means that you are creating an index with fields in it. Without this you don't need the GASS.fmt file <the one with the region: and so forth in>
Take the list of files from stdin, instead of as command line arguments.

Querying using freeWAIS-SF

freeWAIS-SF gate is VERY customizable and can be set up to do all sorts of interesting things. The Search HTML source code for GASS is provided in Appendix 6.

When compiling SFgate, you must give it an application direrctory something like elsa/public_html SFgate. This is where the things like header and footers are to be kept.

If you want headers and footers, then you add the tag <INPUT TYPE="hidden" NAME="application" VALUE="FILEPREFIX" > FILEPREFIX GASS would tell SFgate to insert the header and footer GASS_header and GASS_footer from the application dir (compiled in to be /usr/local/elsa/public_html/SFgate).

The maxhits is set to 40.

The database is used to specify the database to use. local/GASS means use the local GASS database. the locate data base file is compiled in to be /elsa3/elsa-waisindex/

SFgate can also be use to search databases over the WWW. The hidden is used so it doen't show up as a button. A menu could be used to select different databases to be searched.

With SFgate, you can either have a text field that is dedicated solely to doing one type of search (eg, and author search) or you can tell it that it that the text field is of a specified list. We needed the second, more complex, way ....

<INPUT SIZE=35 NAME="fieldsel\_0\_content"> 
 <SELECT NAME="fieldsel\_0\_name">
        <OPTION VALUE="ke" SELECTED>Keyword
        <OPTION VALUE="ti">Title
        <OPTION VALUE="au">Author
        <OPTION VALUE="ab">Abstract
        <OPTION VALUE="bi">Bibliographic
        <OPTION VALUE="text">Default

As you can see, there are 2 parts to it, the text field <INPUT SIZE=35 NAME="fieldsel_0_content"> and the selecter <SELECT NAME="fieldsel_0_name">

To bind the two together, both have to have the identical prefix, here being fieldsel_0_. (also note that fieldsel_0_ is the value given to the group_1 at from earlier)

content is the word that you will be looking for, name is the name of the database field that will be looked in. ke, ti and soforth are the database fields, as specified in the GASS.fmt file, which is used during indexing.

<SELECT NAME="group\_2\_tie">
        <OPTION VALUE="and">AND

Bascially, this links the first field of the index, to the second, linking its self to the second field search with group_2_tie -> group2 -> fieldsel_1 (name of the second field)

Appendix 6 : HTML Browse and Search pages

<TITLE>Search the Elsa GASS Collection</TITLE>
<H2><IMG SRC="/~elsa/images/tree.gif" ALIGN=MIDDLE> Search the 
GASS Database</H2>
<FORM METHOD="POST" ACTION="/cgi-bin/GASStest.pl">
<B>Search Term 1:</B>
 <INPUT NAME="term1" SIZE="40">
<B>Search Term 2:</B>
 <INPUT NAME="term2" SIZE="40">
<B>Search Term 3: </B>
<INPUT NAME="term3" SIZE="40">
<B>Maximum Number Of Hits: </B>
<INPUT NAME="MaximumHits" SIZE="3" VALUE=60>
The <I>Elsa GASS Database</I> can be searched using the interface 
shown above. In order to search the database, search terms can be 
entered into any of the <I>Search Term Boxes</I> provided. A search 
term consists of any word the user wishes to search for.<P>
By using the <I>Boolean Operators</I>, <B>AND</B> and 
<B>OR</B>, the 
user can restrict or broaden their query so as to recieve as much 
useful information as they require.
As well as providing support for Boolean Searches, the Elsa GASS 
Search interface also provides support for <B>Right-hand truncation
</B>. This feature allows a user to enter a search term, such as 
'<I>astro*</I>' and the search will return documents containing the 
words '<I>astrophysics</I>' and '<I>astronomy</I>'. 

The Right-hand truncation operator is the asterisk (<B>*</B>).

<A HREF="/~elsa/GASS/Search/help.html"><B>Example 
can be found, by following this link.<P>
This page is maintained by <A HREF=http://www.elsa.dmu.ac.uk/~djh> 
D.Houghton</A>, and it was last modified on Apr 12, 1996.<p>

<A HREF ="/~elsa/GASS/Search">
<IMG SRC="/~elsa/Search.gif" ALT="[Search]"></A>
<A HREF ="/~elsa/GASS/TheJournals">
<IMG SRC="/~elsa/elsa-brws.gif" ALT="[Browse]"></A>
<A HREF ="/~elsa/GASS">
<IMG SRC="/~elsa/elsa-back.gif" ALT="[Home]"></A>
<A HREF ="/~djh/ELSA/research/feedback.html">
<IMG SRC="/~elsa/elsa-fdbk.gif" ALT="[Feedback]"></A>
<A HREF = "/~elsa/GASScopyright.html">
<IMG SRC="/~elsa/copyright.gif" ALT="[Copyright]"></A>
<TITLE>Browse the ELSA Database</TITLE>
<H2><IMG SRC=/~elsa/images/tree.gif" ALIGN=MIDDLE"> Browse the ELSA 
<LI> <A HREF=/~elsa/TheJournals/SA0/">Multidiscipline</A>
<LI> <A HREF=/~elsa/TheJournals/SA1/">Life and Medical Sciences</A>
<LI> <A HREF=/~elsa/TheJournals/SA2/">Physical & Environmental 
<LI> <A HREF=/~elsa/TheJournals/SA3/">Materials Science</A&g
<LI> <A HREF=/~elsa/TheJournals/SA4/">Engineering</A>
<LI> <A HREF=/~elsa/TheJournals/SA5/">Social / Behavioral Sciences and 
<LI> <A HREF=/~elsa/TheJournals/all.html">All Journals</A>

<A HREF =/~elsa/Search">
<IMG SRC=/~elsa/elsa-srch.gif" ALT="[Search]"></A> 
<A HREF =/~elsa/TheJournals">
<IMG SRC=/~elsa/elsa-brws.gif" ALT="[Browse]"></A> 
<A HREF =/~elsa">
<IMG SRC=/~elsa/elsa-back.gif"ALT="[Home]"></A> 
<A HREF ="http://www.cms.dmu.ac.uk/~djh/ELSA/research/feedback.html">
<IMG SRC=/~elsa/elsa-fdbk.gif"ALT="[Feedback]"></A>
<A HREF ="http://www.cms.dmu.ac.uk/~djh/ELSA/research/copyright.html">
<IMG SRC=/~elsa/elsa-copy.gif"ALT="[Copyright]"></A>




About this document ...

This document was generated using the LaTeX2HTML translator Version .95.3 (Nov 17 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.