ERPANET Seminar on Persistent Identifiers

monica duke

ERPANET Seminar on Persistent Identifiers

Monica Duke reports on a two-day training seminar on persistent identifiers held by ERPANET in Cork, Ireland over 17-18 June 2004.

Day One

Introduction
Welcome and Keynote
Overview of Persistent Identifier initiatives
URN
OpenURL - The Rough Guide
Info URIs
The DCMI Persistent Identifier Working Group
The CENDI Report
ARK
PURLs
Overview of the Handle System
DOI

Day Two

Identifiers at the Coal Face
EPICUR
The National Digital Data Archive (NDA)
NBN:URN Generator and Resolver
DIVA
The Publisher’s Perspective
Digital Object Identifiers for Publishing and the e-Learning Communities
Publication and Citation of Scientific and Primary Data
Information and the Government of Canada
Conclusion

This event, organised by ERPANET [1], brought together around 40 key players with an interest in the topic of persistent identifiers in order to synthesize the current state of play, debate the issues and consider what lies on the horizon in this field of activity. Participants included leaders from many of the main identifier and resolver initiatives such as DOI, openURL, infoURIs and Handles (to name a few), together with potential users of persistent identifiers from government organisations and libraries, representatives from publishing, and implementors of systems using various identifier and resolver strategies. As the event unfolded, it became clear that the concerns of the participants were not limited simply to debating the pros and cons of various technical approaches, but rather displayed a practical awareness of the social dimension that is inherent in all of the ongoing activities. This event was supported by University College Cork (UCC), MINERVA and the Digital Curation Centre.

University College Cork

Day One - Welcome and Keynote

We were welcomed on a sunny June day by Peter McKinney, ERPANET Coordinator, and Carol Quinn, from the University College Cork, which hosted the meeting.

The keynote address was given by Norman Paskin, International DOI Foundation [2] who set the scene for the event by reviewing the development of persistent identifiers, and asking the audience to reflect on the meaning of ‘persistent identifiers’. He drew attention to two landmarks in the evolution of persistent identifiers. The first was the automation of the supply chain, in which ISBNs started to be used as unique and non-changeable identifiers, and the second was the automation of information sharing brought about by the World Wide Web. Various ISO standards developed following the first use of ISBNs; similarly, the thinking about naming on the Web also went through a series of changes, reflected in the various interpretations of the URN acronym from Universal to Uniform to Unique.

The focus of the item being identified tends to shift according to the context. Technology often focuses on identifying ‘bags of bits’; libraries on the other hand place importance on intellectual content. Citations of a resource will refer to the work being cited, whereas in the context of purchasing, the manifestation may need to be specified (e.g. the pdf or HTML version).

The speaker made a number of other observations, emphasising the scope of the identification problem, which goes beyond the HTML Web (since the Web encompasses all objects transported by http and identified by URLs), and extends to digital content outside the Web. The discussion on standards is ongoing, and is wide-spread, extending across a number of different standards bodies and beyond. Finally, the role of metadata to assist in defining what is being identified was highlighted.

Overview of Persistent Identifier Initiatives

Following the keynote, the rest of the first day covered background information on the development of various persistent identifiers initiatives, while the following day was dedicated to sharing experiences with implementation of persistent identifier systems.

URN

Katrin Schroeder of Die Deutsche Bibliothek (DDB) [3] gave an overview of the URN system, covering its syntax, functional requirements, URN namespace definition and registration, and resolution. On the following day this presentation was complemented by a report from a project funded by the German department of research, which chose URN as the basic identifier of digital objects and examined their resolution based on the DNS system, (see EPICUR for details).

Developments on the URN front include an ongoing joint W3C/IETF planning group to resolve the current ‘contemporary’ view which contains confusion over the distinctions and relationships between URIs, URNs and URLs, with a draft report submitted as recently as April 2004. To date, 40 URI schemes have been registered.

The syntax of URN, which is made up of parts, provides the flexibility to incorporate various URN-‘namespaces’, such as ISBNs, into URNs. There are underlying assumptions in the URN-namespace registration set-up that the assignment and namespace in a URN is a managed process, although there is allowance for experimental, formal and informal namespaces. Browsers currently do not support the resolution process for URNs - resolution has to be enabled through plug-ins (or other technical solutions).

OpenURL - The Rough Guide

Tony Hammond,who is a member of the ANSI/NISO AX committee that is standardising the OpenURL Framework delivered an overview of OpenURLs.

OpenURLs [4] came out of the Digital Library communities. They work with persistent identifiers to address the location of scholarly bibliographic information. The user is seen as an active participant in the linking process and the context of the user can be taken into account in the process of resolution. OpenURLs come in two flavours, the de facto version, implemented first in SFX (but also since implemented by other commercial vendors) and a draft standard which is evolving within the NISO process, started in 2001, and now nearing completion (predicted to be signed off within 2004). The draft standard contains the evolved version of OpenURL, which is no longer simply a URL, but a framework with Context Objects and a Registry.

After reviewing the history of OpenURLs, the speaker gave an overview of Context Objects (which are containers of 7 entities such as referrer, requester and resolver), and Formats, for which two representations have been elaborated (Key Encoded Values, and XML).

Info URIs

Whilst introducing Info URIs [5], Stu Weibel, consulting research scientist at OCLC, related them to the larger context by identifying the different layers that need to be considered when discussing identifiers: social layer, business layer, policy layer, technology layer, functionality layer. Questions that need to be asked about identifiers could be organised around these layers. For example, in the functional layer questions about the longevity of persistence could be asked; the policy layer addresses questions such as the right to assign identifiers, governance models and identifier management, (for example, can identifiers be recycled?) The social model examines the issues of trust and guarantees of persistence.

Info URI is an Internet draft co-authored Herbert Van de Sompel, Tony Hammond and Eamonn Neylon. It separates resolution from identity, which is a controversial topic. The info URI first specifies the info namespace, followed by a namespace token, followed by anything at the discretion of the namespace managing authority. Resolution is not inherent in the standard, but is expected to emerge; adoption and use will ultimately determine the future of info URIs.

The DCMI Persistent Identifier Working Group

Robin Wilson, from The Stationery Office (TSO) reported on the activity of the DCMI WG (Dublin Core Metadata Initiative Working Group) [6], which was set up to meet a perceived need, but is not intended to function as a separate isolated effort. The focus of the working group is:

Broader understanding of the functional characteristics of identifiers
Explicit motivations for selection of identifiers
Clearer understanding of the available choices
Clearer understanding of the costs and benefits of assigning and maintaining identifiers

Currently the WG is carrying out an information gathering exercise, finding out what guidance is most needed. The collaboration of all present was invited in helping to define the priorities of the WG; for example the DCMI WG could provide a focus for gathering use cases on the use of identifiers. The working group is aiming to have formed a plan by the next Dublin Core conference in Shanghai China [7].

The CENDI Report

Larry Lannom, of the Corporation for National Research Initiatives (CNRI), presented the CENDI report. CENDI is a loosely federated, but growing, interagency group of information managers from various departments in the US government federal agencies (such as the National Archives and Records Administration and NASA). The aim of the report was to bring the issue of persistent identifiers onto the agenda of higher policy makers. The report, which was made available to the attendees of this ERPANET event, reviews a number of persistent identifier applications (PURLs and Handles) and makes a number of recommendations, chiefly the establishment of an e-government interagency committee to study the implementation issues, analyse costs and present recommendations in the US.

ARK

The California Digital Library’s efforts on the Archival Resource Key (ARK) [8] were presented by Mary Heath, Access Services Manager. ARK is a unique actionable identifier, intended to meet a specific list of needs, namely:

A simple system with a low overhead for maintenance
Non-proprietary
Long-tem and directly actionable identifier
Policy separation between the service provider and the assigner of the identifier
Creation of IDs that are unique in the world
Support for attached metadata and persistence level

Name assigning authorities (of which 10 are currently registered) provide the key to the persistence of ARKs. They control the assignment of object names, which are constructed according to a set of rules. The name assigning authority establishes the long-term association between identifiers and objects. The third entity is a name mapping authority, which provides the resolving mechanism. Together, the name mapping authority, name assigning authority and object name make up a URL. Http is used as the basis to provide the actionability of ARKs.

ARKs demand and reveal an organisational requirement to persistence, and the actionability of the ARK ties to 3 things: the digital object, the metadata describing the object, and a persistence commitment statement. The commitment statement is intended to reveal the degree of permanence of an object that is guaranteed in terms of availability and stability e.g. content may be stated to be unchanging or dynamic, guaranteed to be available, or non-guaranteed.

ARK works with other systems and can be referenced in openURL, can contain other identifiers (e.g. ISBN) and OAI harvesters can collect metadata from ARK repositories. The costs of ARK implementation include any costs associated with the required metadata creation and the organisational time taken to develop a persistence policy.

PURLs

Stu Weibel then took the stage again to review PURLs, (Persistent Uniform Resource Locators) [9]. PURLs are simply URLs which use no new protocols, but do make use of a set of tools that provide assistance to maintain URLs with a commitment to persistence. PURLs use the inherent redirection facility of the http protocol and provide persistence not of the resource but of the name. A PURL server links a symbolic URL to a network location. OCLC maintains a free server which manages creation and redirection of PURLs. On contacting the server with a request, (i.e. a PURL), the client is redirected to the networked resource through the http Get request. The software for setting up a PURL server, which performs a function similar to the OCLC one, is also free to download from the PURL Web site.

By May 2004, over 500, 000 PURLs have been registered and 86 Million plus have been resolved. Overall, PURLs were summarised as remaining in niche usage by a small number of organisations (albeit with high usage).

Overview of the Handle System

The Handle system [10] was developed within a larger project for managing digital (information) Objects. Larry Lannom summarised its function. The Handle system is part of resolution services that resolve a name to attributes. The Handle system is used by a number of different organisations - including the IDF, DSpace, ADLSCORM, and various digital library production and research projects.

Handles resolve to typed data; the handle itself does not have inherent semantics, there are no semantics bound to the resolution (however the handle could incorporate semantical identifiers e.g. ISBN). Handle queries fall into two categories:

those that return all data associated with a handle
those that return all data of a specific type (e.g. type URL)

The Handle resolution system has a root service as well as a number of local services distributed in layers, which can propagate, to deal with scale.

Clients first discover which handle service can resolve a handle through a global registry service. The client then retrieves the handle data from the local site holding the information. Proxies or plug-ins can handle the resolution for Web clients.

The handle specification is itself open; however its only known implementation is licensed. 12 million resolutions were carried out in the month of May. There is interest in embedding the handle system in the Globus Toolkit (subject to resolving licensing issues).

DOI

Norman Paskin took the stand again to give an overview of DOI [2]. The DOI aims to deliver more than a simply a label or an actionable label, but rather a complete system made up of four components: a Numbering scheme, Internet Resolution, tools for description by metadata, and policies.

The DOI syntax is an ISO standard and it is to be noted that an instance of a DOI does not necessarily encode semantics. The resolution of DOIs adopts the Handle system, and therefore resolution can be from a DOI to a number of things: such as a location URL, metadata or services. The level of use of DOIs can be gauged by the number of resolutions: five million per month arising from business users.

Metadata allows us to know what is being identified and the DOI uses the indecs Data Dictionary to provide interoperability, although any other metadata schemes could be used.

Costs are associated with number and metadata registration, providing the infrastructure and the governance. Currently such costs are borne by the assigner of the DOI but other business models could be used. Registration agencies ensure a commitment to basic standards.

The next steps for DOI will be to move the complete framework to ISO standard. Adoption of DOI includes The Stationery Office in the UK which will use DOIs for UK government publications, the EC office of publications and the German National Library of Science (using DOIs to identify data).

Graduates mingling

Day 2 - Identifiers at the Coal Face

The second day provided an opportunity to learn about the varied experiences from implementors of systems employing identifiers in a wide range of projects and initiatives.

EPICUR

Kathrin Schroeder gave an overview of the EPICUR [11] which built on the experiences of urn management for online dissertation identification gained within the CARMEN-AP4 Project. The DDB aims to have a corporate strategy for URNs to guarantee high technical availability of a digital object. The strategy includes assignment, registration, administration and resolution of URNs. URNs are assigned to items which have a perspective of long term preservation for example the objects archived by the DDB itself or those archived by certified repositories (e.g.DINI). DDB functions as the naming authority (delegations are also possible). URN-URL mappings must be forwarded to the DDB in order that the resolution service is updated. URNs can be mapped to multiple URLs so that different copies of the same object can be retrieved. The naming authority and the institution applying URNs have a contract - the URN-URL relations must be registered, the URL must be kept updated. An administration interface is available to manage the URN-URL relationships. The system is currently used by over 60 university libraries as well as some digital library tools. Usage statistics are available [12].

The National Digital Data Archive (NDA) of Hungary

Andras Micsik from SZTAKI at the Hungarian Academy of Sciences was the next to speak. NDA is an initiative of the Hungarian government to deliver the content infrastructure for open access to cultural assets available in digital form. It supports the digitisation of radio shows, museum objects and manuscripts, creation of metadata records, and availability via OAI-PMH. An ontology server and metadata schema server are planned. Persistent identifiers are needed not only for digital content objects and their metadata records but also events, persons (e.g. famous poets) and organisations. Locations are also candidates for being allocated an identifier since the location name (e.g. the street name) changes over time for historical reasons. The identifiers will need to handle high volumes, integrate with the OAI infrastructure and must be based on open standards. The identifier solutions are being studied and a decision has not yet been made.

NBN:URN Generator and Resolver

Adam Horvath from the National Szechenyi Library in Hungary introduced another URN resolver system which has been implemented for its simplicity and reliability. It is one targeted only at Hungarians and can only be applied to HTML documents. Owners of an HTML document requiring a URN make an http request to a server, (Adam cited an example [13]). After some checks the server returns a URN which is incorporated into the head of the HTML document. Resolution of URNs also relies on http requests, and the results of resolution are returned as a list of URLs encoded in an HTML page. Digital objects which are not HTML documents need to be given an HTML cover page.

Some guidelines are issued to users for consistency - copies of the same object should be assigned the same URN; different versions (e.g. Word and HTML) should be assigned different URNs and a new URN should be assigned when the intellectual content of an object changes. A Web interface (in Hungarian) is available to invoke the URN assignment and resolution functions.

DIVA

Eva Muller, director of the Electronic Publishing Centre at the Uppsala University Library, Sweden introduced a publishing system used by 10 universities in 3 countries (mainly in Sweden, but including one in Denmark and one in Norway). The aim of the centre is to support full text publishing, storage and ensure future access. An automated and low-cost workflow was sought to support the long-term access strategy. Persistent identifiers (PID) were recognised as an important part of that strategy and a non-proprietary solution was required. The PID must be able to connect to a preservation copy because of the focus on guaranteeing long-term access. URN:NBNs were chosen as an identification system The National Library of Sweden assigns authority and has delegated to Uppsala University. The same PID is intended to be used for different manifestations of the same content. An item is a single publication, with no consideration of format.

URN:NBNs are not used uniquely as identifiers in the system - the metadata also incorporates use of other identifier systems such as ISSN. PIDs are also used to identify other components in the infrastructure, for example for file format registries. DIVA [14] hopes to interact with other file format registry initiatives in other countries in the long term as a strategy for guaranteed long-term access. The file format registry identifier would be used as a value for the manifestation information on the content metadata.

In the future, the co-ordination of electronic academic publishing at Swedish Universities, including long-term access and preservation, will be central to activity and the resolution service will be further developed on an international level.

The Publisher’s Perspective

Cliff Morgan from John Wiley and Sons offered the publisher’s take on the issue of identifiers by asking: What identifiers do publishers care about? In a nutshell, ISBN (“obviously!” said the speaker), ISSN and DOI were singled out as the most important. ISTC may become important in the future (though it is not well established yet), and ISWC, ISAN, V-ISAN may take on importance if multimedia becomes central. URLs are thought of mostly as locators. There has not been much uptake of PURL/POI (/PURL-based Object Identifier) nor is there awareness of ARK or XRI in the publishing community.

The ISBN is in revision (bringing the number up to 13 digits) due to some pressure on some number blocks and in view of harmonisation of product identifiers in 2007 [15]. Some discussions are still ongoing on e-versions, assignment to component parts and print on demand. ISSNs are also being revised but the process is at a much earlier stage; it involves several stakeholders in the discussion of controversial issues such as what constitutes a serial: are blogs serials? Is the ISSN a work, manifestation or expression identifier?

ISTCs (the equivalent of the music ISWC but for textual works) may become the main standard depending on the progress of its development, currently stalled considering the appointment of a registration authority. ISTC is designed to be a work identifier and has clear application to rights issues.

DOIs have seen phenomenal take-up by publishers, with CrossRef [16] providing the impetus for this success. The registration of DOIs with CrossRef as a main registration authority shows a rare instance of publisher collaboration happening internationally. Citation linking is an important aspect of DOI/CrossRef in use. The success of DOI was attributed to its well-established status with publishers, ease of implementation, reliability and ability to deliver extra or more targeted services.

ONIX and ONIX for serials were singled out as the main metadata standard of importance, with interest in Dublin Core to an extent; OAI-PMH and openURLs, PRISM, IEE/LOM/SCORM and some of the rights metadata could become important. ONIX is overwhelmingly important because it is a trading standard, and therefore metadata sets easily mapped to ONIX are more likely to be of interest. Moreover metadata that can be used to drive revenue, by improving reach, profile or brand would be seen as worth pursuing.

Digital Object Identifiers for Publishing and the e-Learning Communities

Robin Wilson summarised an activity carried out by The Stationery Office (TSO) to produce a report for JISC to help guide and assist the development of a JISC digital identifier policy for use in the UK Higher and Further Education. The report works around the idea of ‘Just in Time Learning’ as the future of e-learning, i.e. creating a mix-and-match culture individualised for learners. This implies content that can flow through: i.e. that a learning object is a collection of references to information objects. In the ‘just in time’ framework the persistence of identifiers is paramount along the whole life cycle. Several communities are involved in this model of e-learning e.g. learning technologists, content creators etc. Resource discovery involves not only the object discovery but also the checking of rights management, syndication of content, and federation. There are a number of scenarios where identifiers are applied: identifying a citable reference; identifying a metadata description, identifying a Web deliverable resource.

The report, which can be downloaded from the JISC [17] and the TSO site [18], reviews identifiers and systems, makes a number of recommendations to JISC and identifies a number of concerns that require further investigation: e.g. the nature of metadata to be used alongside adopted identifiers; management of resource persistence over time, management of ID persistence over time.

Publication and Citation of Scientific and Primary Data

Michael Lautenschlager introduced a two-year pilot implementation project funded by the DFG, which followed earlier activity that had produced a report for CODATA on the citation of Scientific Primary Data. This pilot implementation has as its application area climate data, which has traditionally had data sources which are known only within a small community and are often archived without context. A global identifier with a resolution mechanism was proposed for data archiving and context referencing. The aim is to close the gap between the scientific literature and related data sources. Citable data publications encourage inter-disciplinary data utilisation. However currently there is not enough motivation for the individual scientist since extra work for data publication is not acknowledged.

The presentation covered the metadata that would be captured for primary data, including the assigning of a DOI identifier. Persistent identifier allocation comes with added responsibilities; for example ensuring long-term availability of the data, quality assurance of the data, and the condition that data must stay unchanged, (like published articles). A stable connection between the identifier and the data must be ensured.

Some usage scenarios for the use of identifiers with climate data were then described, including submission of data and citation. The DOI can be resolved in three different ways, therefore citation can rely on any of the three resolution methods. Two of the methods involve the use of actionable URLs which can be directly used in a browser, whilst the third method requires a plug-in for browsers. A handle server [19] is made available as part of the project. Further details can be found on the project Web page [20].

Information in the Government of Canada

Cecil Somerton, Information Management Analyst with the Treasury Board of Canada gave us the perspective of the Canadian Government.

Governments are presented with a legislative context with respect to legal requirements for access to information and protection of privacy. In Canada the Access to Information Act and Privacy Act are examples in point. Other acts such as the Library and Archives Act contain inherent messages about the importance of persistence, although an official policy on persistent access to information has not yet been articulated by government. For governments, the issue of persistence must address the question of access and accountability over time, authenticity and authority and trusted communications. A report on persistent locators was presented to the Canadian government in 2002 and more recently a discussion paper on XML Namespace Management became available. Policy must address multi-channel delivery to the public, including Internet presence.

Conclusion

Participants were in agreement that, whilst having met its introductory aims, this event could be but a milestone on a journey that has scope for greater and better co-ordinated collaboration. Ideas were mooted for follow-on activities by means of on-line discussion fora, or mailing lists. The Digital Curation Centre was mentioned as a possible focus of activity; the DCMI Working Group [6] is also available for those wishing to participate in a collaborative activity. In collaboration with the speakers and the participants, ERPANET will be publishing a report on the seminar. The official report will appear on the ERPANET Web site [1] later in the year. The ERPANET Web site also contains a good collection of links related to the topic (follow the links to products, erpatraining, persistent identifiers, background information and reading lists). A mailing list has been set up following the workshop; interested parties can find out the instructions for this list by sending a “QUERY CORK-WORKGROUP” command to LISTSERV@JISCMAIL.AC.UK. ERPANET also hosts a discussion forum related to the training seminar [21]. Fabio Simeoni has also written a report on this event [22].

Acknowledgements

My thanks to Stu Weibel for kindly supplying some images of the conference location.

References

The ERPANET Web site http://www.erpanet.org/
The International DOI Foundation http://www.doi.org/
Die Deutsche Bibliothek http://www.ddb.de/index_txt.htm
Open URL http://library.caltech.edu/openurl/
Info URI http://info-uri.info
The DCMI Working Group on Persistent Identifiers http://www.dublincore.org/groups/pid
The DC-2004 Conference Web site http://dc2004.library.sh.cn/
ARK http://www.cdlib.org/inside/diglib/ark/
PURL http://purl.org
The Handle System http://www.handle.net
EPICUR http://www.persistent-identifier.de/
Usage statistics http://www.persistent-identifier.de/?link=540
Example: URN:NBN service in the NSZL http://nbn.oszk.hu/
DIVA http://www.diva-portal.se/about.xsql
Editor’s note: This subject of ISBN-13 will be addressed by Ann Chapman of UKOLN in an article in the next issue (41) of Ariadne.
Cross Ref http://www.crossref.org/
The Joint Information Systems Committee (JISC) http://www.jisc.ac.uk
The Stationery Office (TSO) http://www.tso.co.uk/
Handle System Proxy Server http://doi.tib-hannover.de:8000/
Citation of Scientific Data http://www.std-doi.de/
ERPANET FORUMS Fora on Digital Preservation and Access http://www.erpanet.org/www/workgroup/Forums/
Simeoni, Fabio A report on the ERPANET Seminar on Persistent Identifiers, 17-18 June 2004, Cork, Ireland http://hairst.cdlr.strath.ac.uk/documents/Erpanet%20Training%20Seminar%20on%20Persistent%20Identifiers.pdf

Author Details

Monica Duke
UKOLN
University of Bath, UK

Email: M.Duke@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/

Return to top