Web Magazine for Information Professionals

Metadata: E-print Services and Long-term Access to the Record of Scholarly and Scientific Research

Michael Day looks at the long-term preservation implications of one of the OAI protocol's potential applications - e-print services.

In the April 2001 issue of D-Lib Magazine, Peter Hirtle produced an editorial highlighting the potential for confusion between the standards being developed by the Open Archives Initiative (OAI) [1] and the draft Reference Model for an Open Archival Information System (OAIS) [2]. He noted the frustration that can ensue when words that have a clearly understood meaning in one domain begin to be used by others in a different way. Hirtle ended his editorial with a suggestion for an OAIS-compatible OAI system that would offer “assurances of long-term accessibility, reliability, and integrity” [3]. While acknowledging the potential value of such a system, this column will confine itself to a preliminary investigation of the long-term preservation implications of just one of the OAI protocol’s potential applications, i.e. e-print services (or archives).

As the OAI Frequently Asked Questions (FAQ) Web page states, the use of the term ‘archive’ “reflects the origins of the OAI – in the E-Prints community where the term archive is generally accepted as a synonym for [a] repository of scholarly papers” [4]. This appears to be a development of the way computing scientists have used the term to refer to the creation of secure backup copies for a fixed period of time - a usage included in the current edition of the Oxford English Dictionary [5]:

In computing … to transfer to a store containing infrequently used files, or to a lower level in the hierarchy of memories.

The main problem with this usage is not just that it excludes connotations of “long-term value, statutory authorization and institutional policy” (to quote the OAI FAQ again) but that it might encourage complacency, subconsciously implying that all long-term digital preservation issues have been resolved.

E-print Services

E-prints are seen as a catalyst for the freeing of the scholarly and scientific literature from the cost barriers imposed by journal publishers. Supporters of the self-archiving concept (as it is sometimes known) argue that the easiest, fastest and cheapest way for authors to make their papers available is to store electronic copies of these (both pre-prints and reprints) on e-print servers. The successes of the arXiv e-print server [6] based at Los Alamos National Laboratory (LANL) and other e-print services are cited as exemplars of how authors’ distribution of e-prints can start a revolution in scholarly and scientific communication [7]. The LANL-based service initially gave access to e-prints in the domain of high-energy physics [8], but has since expanded to cover other areas of physics, mathematics and computer science. Stevan Harnad has recently noted that the Los Alamos physics service holds over 150,000 papers (and is currently growing by about 30,000 papers per year), is mirrored on over 14 sites world-wide and gets about 160,000 user hits each weekday at its US site alone [9].

The arXiv e-print server and services like CogPrints (the Cognitive Sciences Eprint Archive) [10] and WoPeC (Working Papers in Economics) [11], are examples of a centralised discipline-based service. Harnad initially suggested an alternative approach based on authors self-archiving their papers on their home pages or on institutional servers. The core of Harnad’s original ‘subversive proposal’ was the establishment by authors of publicly accessible ‘archives’ of their own papers [12].

It is a simple subversive proposal that we would make to all scholars and scientists right now: if from this day forward, everyone were to make available on the Net, in publicly accessible archives on the World Wide Web, the texts of all their current papers (and whichever past ones are still sitting on their word processors’ disks) then the transition to the PostGutenberg era would happen virtually overnight.

The main problem with this model was that papers distributed across many Web pages or FTP archives could not be collectively searched or retrieved. So, back in 1994, Paul Ginsparg argued that creating a distributed ‘database’ was technically possible, but not at that time logistically feasible [13]. The standards under development by the OAI, however, now mean that the distributed model can now be made to work. Harnad has argued that widespread implementation of the OAI standards would enable e-prints stored on both distributed institution-based servers and centralised services like arXiv.org to be harvested into a single virtual archive [14].

The new breakthrough is agreement on metadata tagging standards that make the contents of distributed archives interoperable, hence harvestable into one global virtual archive, all papers searchable and retrievable by everyone for free.

OAI-compatible software is available from initiatives like eprints.org. [15]

The broad concept of e-print services received a boost in 1999 with the US National Institutes of Health’s proposal for a service called PubMed Central that would give free online access to published material in the biomedical sciences [16]. The original proposal suggested the creation of two separate services, one that would publish papers with peer-review from the editorial boards of journals that would be participating in the initiative, and a second one for papers that had not been refereed. The proposed service received considerable criticism, partly based on the possibility of adverse impacts on public health or medical practice from papers published in the non-peer-reviewed section [17], but also motivated by suspicion of a government-controlled monopoly on scientific publishing [18]. In the event, the initial version of PubMed Central, officially launched in January 2000, only contained the peer-reviewed section of the proposed service [19]. It is being developed by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). Journals currently participating in PubMed Central include the Proceedings of the National Academy of Sciences of the United States of America and the BMJ. A separate service, tentatively known as PubMed Express, is planned to allow biomedical researchers to publish non-peer-reviewed work after some preliminary screening [20].

Following the establishment of the PubMed Central service, some authors have stepped up their campaign for the creation of services that give free access to all published papers. A number of scientists have appealed to journal publishers in the life sciences to co-operate with initiatives like PubMed Central by making their content available to publicly accessible e-print services a set time after publication [21]. In order to help bring this about, a group known as the “Public Library of Science” has been inviting scholars to sign an open letter. Signatories pledge their future intention (from September 2001) only to publish in, undertake peer-review for and serve on the editorial boards of those serials that will make papers freely available six months after publication. This initiative moves far beyond a general support for author self-archiving initiatives. It is using authors to put pressure on journal publishers to “give away” their content to publicly funded e-print initiatives. The Public Library of Science group argue that not only will this help to facilitate free access to the scientific literature but also suggest that open e-print archives could continue the historical role of research libraries with regard to preservation. They strongly argue that this should not be the function of publishers [22]:

We believe … that the permanent, archival record of scientific research and ideas should neither be owned nor controlled by publishers, but should belong to the public, and should be freely available through an international online public library.

What the letter does not say is how this preservation role would be undertaken, and how it would be co-ordinated.

Occasionally, proponents of the self-archiving concept imply that digital preservation is just a technical issue with few organisational or economic implications. Harnad once said that nothing was simpler and more natural than to arrange for the continuous “systematic uploading and upgrading pari passu with ongoing developments in the medium” [23]. The harsh reality of the situation is that while technical strategies available for digital preservation do exist, there is less certainty with regard to questions like who should be organisationally responsible for managing the preservation process and the likely cost implications of undertaking the role. This is especially true of e-print services, whose models tend to be concerned more with the rapid dissemination of current research than with ensuring continuing access to the record of scholarship [24]. There is nothing necessarily wrong with this, only that the proponents of e-print services need to be seen to be addressing the preservation issue seriously. This is not always the case. With reference to PubMed Central, the editor of the journal Academic Medicine noted that many of those who had supported the move to electronic publishing had “not recognised the expense and long-term difficulty of assuming the role previously played by libraries as the science community’s archivists” [25].

A recent paper on the arXiv service has demonstrated a growing awareness of some preservation issues. For example, it notes the importance of having multiple mirror sites around the world. The service also date-stamps all of the different versions of each paper deposited in the repository and ensures that all of these can be retrieved. There is also a hint at an awareness of the challenge and potential cost of migration strategies [26]:

… there are clearly cost issues surrounding the issue of constantly migrating technology and content formats … merely preserving the article itself cannot capture the value of an electronic article. Rather the value is in the associated contextual links, associated graphics, multi-media and connecting databases that have become intrinsic parts of modern scientific literature.

In theory, it would be easier to develop preservation strategies for centralised services like arXiv than to ensure the preservation of a ‘virtual database’ of e-prints stored on a large number of distributed Web servers. Initiatives like the OAI may need to investigate the production of tools that are able to harvest the content (rather than just metadata) of distributed e-print services into centralised repositories for preservation. This would, however, raise complex questions about who would have permission to do this and how the intellectual property rights of both authors and institutions would be respected.

Some Digital Preservation Issues

In short, digital preservation raises several issues that need to be considered by the proponents of e-print services. Firstly, there would be the general issue of who should be responsible for preserving the record of scholarly and scientific research. The Public Library of Science group assumes that this is a role best carried out by publicly funded initiatives like PubMed Central rather than by publishers. Some publishers disagree. For example, Elsevier Science declare in their licenses their intention to maintain the digital files of the ScienceDirect service in perpetuity, and commit themselves to transferring them to another depository if they find that they are unable to do so [27]. Others maintain that libraries will still have an important role in ensuring the preservation of the scholarly and scientific record, including e-prints. This may be facilitated, for example, by developments in some nations’ legal deposit legislation or through co-operation between research libraries and particular e-print services.

Another preservation issue is that of ensuring the continuing authenticity of the scholarly and scientific record. It will be important that users can be sure that a paper is what it claims to be and has not been accidentally or deliberately changed. In the digital world, it is possible to frequently update papers in order to take account of new data, more recent research or the comments of other scientists or scholars. This is one advantage of digital publication but, as Clifford Lynch notes, is culturally opposed to the traditional view of the scholarly record as comprising “a series of discrete, permanently fixed contributions of readily attributable authorship” [28]. The date-stamping mechanisms used by services like arXiv may help with the identification of particular versions, but ‘proving’ authenticity may need to depend more upon the consistent deployment of cryptographic technology [29].

Conclusions

There is no space in this column for a complete assessment of the digital preservation implications of the growing use of e-print services. It is hoped, however, that this contribution will help fuel a wider debate about e-print services and the long-term preservation of access to the scholarly and scientific record embodied in them. However, whether it will be possible to develop an OAI-based system compatible with the OAIS model remains to be seen.

References

  1. OAI: http://www.openarchives.org/
  2. Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-R-1 (1999). http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html
  3. Peter Hirtle, “OAI and OAIS: what’s in a name?” D-Lib Magazine, vol. 7, no. 4, April 2001. http://www.dlib.org/dlib/april01/04editorial.html
  4. OAI Frequently Asked Questions: http://www.openarchives.org/faq.htm
  5. Oxford English Dictionary, 2nd ed., Oxford: Clarendon Press, 1989, vol. 1, pp. 614-615.
  6. arXiv.org e-print server: http://www.arxiv.org/
  7. Stevan Harnad and Matt Hemus, “All or none: no stable hybrid or half-way solutions for launching the learned periodical literature into the post-Gutenberg galaxy,” in: Ian Butterworth, ed., The impact of electronic publishing on the academic community, London: Portland Press, 1998. http://tiepac.portlandpress.co.uk/books/online/tiepac/session1/ch5.htm
  8. Paul Ginsparg, “First steps towards electronic research communication,” Computers in Physics, vol. 8, no. 4, 1994, pp. 390-396.
  9. Stevan Harnad, “The self-archiving initiative,” Nature, vol. 410, 26 April 2001, pp. 1024-1025. http://www.cogsci.soton.ac.uk/~harnad/Tp/nature4.htm
  10. CogPrints: http://cogprints.soton.ac.uk/
  11. WoPEc: http://netec.mcc.ac.uk/WoPEc/
  12. Stevan Harnad and Jessie Hey, “Esoteric knowledge: the scholar and scholarly publishing on the Net,” in: Lorcan Dempsey, Derek Law and Ian Mowat, eds., Networking and the future of libraries 2: managing the intellectual record, London: Library Association Publishing, 1995, pp. 110-116, here p. 114.
  13. Paul Ginsparg, “ Who is Responsible?” in: Ann Okerson and James O’Donnell eds., Scholarly journals at the crossroads: a subversive proposal for electronic publishing, Washington, D.C.: Association of Research Libraries, 1995. http://www.arl.org/scomm/subversive/sub03.html
  14. Stevan Harnad, “The self-archiving initiative,” p. 1025.
  15. eprints.org: http://www.eprints.org/
  16. Harold Varmus, PubMed Central: an NIH-operated site for electronic distribution of life sciences research reports, Bethesda, Md.: National Institutes of Health, August 30, 1999. http://www.nih.gov/welcome/director/pubmedcentral/pubmedcentral.htm
  17. Arnold S. Relman, “The NIH ‘E-biomed’ Proposal - a Potential Threat to the Evaluation and Orderly Dissemination of New Clinical Studies,” New England Journal of Medicine, vol. 340, no. 23, 10 June 1999, pp. 1828-1829.
  18. Floyd E. Bloom, “Just a minute, please,” Science, vol. 285, 9 July 1999, p. 197.
  19. PubMed Central: http://pubmedcentral.nih.gov/
  20. Richard Horton, “The refiguration of medical thought,” The Lancet, vol. 356, 1 July 2000, pp. 2-4.
  21. Richard J. Roberts, Harold E. Varmus, Michael Ashburner, Patrick O. Brown, Michael B. Eisen, Chaitan Khosla, Marc Kirchner, Roel Nusse, Matthew Scott and Barbara Wold, “Building a ‘GenBank’ of the published literature,” Science, vol. 291, 23 March 2001, pp. 2318-2319.
  22. Public Library of Science: http://www.publiclibraryofscience.org/
  23. Stevan Harnad, “On-line journals and financial fire walls,” Nature, vol. 395, 10 September 1998, pp. 127-128. http://www.cogsci.soton.ac.uk/~harnad/nature.html
  24. Christine L. Borgman, From Gutenberg to the global information infrastructure: access to information in the networked world, Cambridge, Mass.: MIT Press, 2000, p. 91.
  25. Addeane S. Caelleigh, “PubMed Central and the new publishing landscape: shifts and tradeoffs,” Academic Medicine, vol. 75, no. 1, January 2000, p. 4-10, here p. 9.
  26. Richard E. Luce, “E-prints intersect the digital library: inside the Los Alamos arXiv,” Issues in Science and Technology Librarianship, no. 29, Winter 2001. http://www.library.ucsb.edu/istl/01-winter/article3.html
  27. Karen Hunter, “Digital archiving,” Serials Review, vol. 26, no. 3, 2000, pp. 62-64.
  28. Clifford A. Lynch, “Integrity issues in electronic publishing,” in: Robin P. Peek and Gregory B. Newby, eds., Scholarly publishing: the electronic frontier, Cambridge, Mass.: MIT Press, 1996, pp. 133-145.
  29. Peter S. Graham, “Long-term intellectual preservation,” in: Nancy E. Elkington, ed., Digital imaging technology for preservation, Mountain View, Calif.: Research Libraries Group, 1994, pp. 41-57.

Author Details

Michael Day
Research Officer
UKOLN: the UK Office for Library and Information Networking

E-mail: m.day@ukoln.ac.uk