Metasearch: Building a Shared, Metadata-driven Knowledge Base System

terry reese

Metasearch: Building a Shared, Metadata-driven Knowledge Base System

Terry Reese discusses the creation of a shared knowledge base system within OSU's open-source metasearch development.

Surveying the current metasearch tools landscape, it is somewhat surprising to find so few non-commercial implementations available. This is especially true considering that, as a group, the library community has cultivated a very vibrant open source community over the past ten or so years. One wonders then, why this particular service has been ceded to the world of commercial vendors. One can speculate that the creation and management of a metasearch knowledge base has likely played a large role [1]. It could be argued that the creation and maintenance of a knowledge base is simply too big a task for a single organisation. In purchasing a metasearch tool, one is really purchasing the accompanying knowledge base in the hope that having a set of pre-defined resources will reduce the overall cost of ownership [2]. And while it is true that vendor-based products do simplify the support and maintenance of a metasearch knowledge base, certainly this ease of use is not simply restricted to this vendor model.

At Oregon State University (OSU), we hope to show that a metasearch tool developed outside the vendor-based model can indeed be successful. Through the assistance of an LSTA (Library Services and Technology Act) grant, OSU is currently in the process of developing an open source metasearch tool for the state of Oregon. However, in developing such a tool, questions relating to the knowledge base and knowledge base management need to be addressed. Moreover, given the relatively low technical expertise in many of the smaller public libraries in Oregon, the tool needs to be developed in a way that limits technical barriers to implementation. This means that the metasearch tool could not be developed using the traditional code-based connector structure employed in many vendor tools such as Innovative Interfaces' MetaFind or WebFeat's metasearch offerings. Rather, OSU's metasearch tool needed to be designed in a way that limited exposure to connection protocols such as Z39.50, OAI, etc. while still allowing users the ability to configure specific resources for search. Likewise, the tools should be designed to allow for the collaborative management of a system's knowledge base, while allowing for local customisations as well. The result of this vision was the development of a shared, metadata-driven knowledge base that uses an abstract connector architecture - the net goal being the separation of the knowledge base from the underlying components and protocols that execute requests within the tool. Moreover, this resource has been designed to be collaborative in that knowledge base repositories can be configured to allow shared knowledge base management between groups of users.

Metasearch Knowledge Base Systems

While metasearch tools themselves have been around for years [3] (note the Federal Geographic Data Committee's use of metasearch as regards FGDC GIS data nodes), their importance and usage within the library community has been a relatively new phenomenon. As libraries see more and more of their physical collections made available in digital formats, libraries are turning to metasearching tools as a strategy for bridging the gap between traditional print and electronic services. And while many recognise the inherent limitations linked to metasearch services, it has not prevented libraries from looking to these services as a tool that can bring together one's many heterogeneous collections and present them as part of coherent whole. This is the primary goal of the metasearch software, to present the user with a unified discovery experience [4]. But if the primary goal of metasearch software is shared, the methodology for creating these linkages is not. The vendor community has offered two competing methods for creating the required knowledge base needed to interact with the various targets within a metasearch utility. These two methodologies can be broken down into two primary camps, code-base connector systems used by tools such as Innovative Interfaces MetaFind and other resources employing the Muse Global connection repositories and metadata-driven knowledge base systems used by tools like ExLibris's MetaSearch and OSU's metasearch application.

Code-driven Systems

The two different approaches to knowledge base management are characterised by the degrees of separation of the knowledge base from the underlying components that interact with a set of metasearch target resources. For example, within a code-driven knowledge base system, a small code-based connector, or widget, must be created in order for the metasearch tool to interact with a target resource.

Figure 1: Search progress displayed by OSU's Metafind

The above image is taken from OSU's MetaFind implementation, which uses a code-driven knowledge base system. Through simple observation, it is easy to see how MetaFind is using individual code-based widgets to connect to each of the target resources. One can see how each target query opens a unique connector class (i.e., NewsBank.jar, EBSCOASE.jar, etc.) within the MetaFind application. This means that the metadata structure and format tend to be hard-coded directly into the resource connector, making these connectors susceptible to failure as target resources modify or change their metadata structures. In fact, allowing MetaFind to complete a search on the five resources above results in the following error:

Figure 2: Error display by OSU's Metafind

Since these connectors are code-driven, maintenance must take place at the code level. For this reason, most metasearch tools that employ a code-driven metasearch knowledge base generally restrict access to it - frequently providing sites with hosted solutions that are managed at the vendor level. Services like WebFeat, Proquest, Serials Solutions, etc. all use the hosted solution approach, limiting to some degree the customisation and integration options available. OSU's MetaFind implementation, for example, is completely managed through Innovative Interfaces, meaning that broken connectors or even user interface (UI) changes must be handled through the vendor. Moreover, changes to the UI tend to be strictly limited to the cosmetic in nature and functionality is limited to what is provided by the vendor. Furthermore, since adding additional resources may require the creation of a code-driven connector, adding local or non-traditional resources to this type of system can often be expensive and time-consuming.

Metadata-driven Systems

Metadata-driven metasearch tools have a number of distinct advantages over their code-driven counterparts. Firstly, there is an inherent separation between the defined resource and the underlying connection/formatting components in the system. Unlike a code based system, which has failure points associated with both protocol and metadata format within a given connector - metadata-driven systems separate those two distinct operations, better isolating failures at the component level. This separation of connection and formatting makes it much easier for a metadata-driven system to open up connector management at the user level - since modifications to the knowledge base do not take place at the code level. In practical terms, this allows metadata-driven systems to be more nimble, as organisations have more flexibility to add/edit/remove resources from their knowledge base without having the go through the mediation of a third-party. At the same time, this greater level of control requires users of metadata-driven systems to take a more active role in their knowledge base management. Unlike code-driven systems, which are traditionally reliant upon a third-party for knowledge base management, users of metadata-driven systems must take a much more active role in defining how the metasearch tool should communicate with a specific target.

OSU Metasearch Tool Knowledge Base Architecture

As noted above, OSU's metasearch tool uses an enhanced metadata-driven knowledge base system designed specifically around the issues of ease of use. Within the current application architecture, knowledge base development and management has been completely divorced from the underlying connection objects. Below is a rudimentary diagram of the tool's current infrastructure:

Figure 3: Infrastructure of OSU Metasearch Tool

Within the current application design, requests are passed via the API (Application Programming Interface) layer, which interacts with the knowledge base to gather information about the resources being queried. This information includes both the communication protocol information, as well as a metadata profile of the resource to be queried. The metadata profile itself stores a number of pieces of information, including information relating to the target metadata schema, the characterset-encoding of the target schema and instructions relating to the interpretation of the target schema. Once this information has been gathered, the API layer uses that information to initiate the abstract protocol connectors and retrieve the raw metadata from the targets. Within this infrastructure, the use of abstract protocol classes is paramount, since no connection class can be tied to any particular resource or metadata type. This also means that OSU's metasearch tool will only be able to query targets that make their resources available using a standard query protocol or XML gateway interface. Targets available only through traditional HTTP requests are not available within OSU's tool since these resources would require code connectors to screen scrape content, thus violating the abstract connector model. Moreover, the architecture requires that the connection class retain no information regarding the handling of supported metadata types. This is handled instead through a specialised translation service. Metadata is filtered through this translation service which is configured using the target's metadata profile retrieved from the knowledge base to normalise the metadata into an internal XML format. At this point, the items are resolved through an OpenURL resolver (to establish direct linking to resources) and passed through a custom relevancy-ranking algorithm before returning the enriched metadata back to the calling process.

Since the knowledge base plays such a critical role in ensuring the functionality of the application, a great deal of time was spent creating a knowledge base structure that would facilitate the maintenance and growth of the application. In thinking about how to structure the knowledge base, a number of issues were considered [5]:

Within a knowledge base, targets often share many properties. This is especially true of aggregate resources like EBSCOHost. For example, a user could have access to 14 different databases through EBSCOHost, but for the purposes of the knowledge base, many of the resources attributes will be shared between these 14 databases. For example, the connection host and metadata format would quite likely be identical between connection resources, meaning that the knowledge base should be able to share attributes between targets.
Targets may share logistical groupings that may need to be preserved to facilitate group searching of targets.
Some resources may provide multiple access points to resources outside the traditional metadata profile.

Of the 160 items or so initially targeted for metasearch at OSU, it was found that all of these resources existed within some 33 parent items. A parent item in this context is defined as the principal aggregator from which multiple resource databases could be accessed. An example of a parent item within this relationship is EBSCOHost, an aggregator that hosts access to a number of subject-specific databases. Using this information, the following knowledge base structure was developed:

Figure 4: OSU Metasearch Knowledge Base Structure

Within this knowledge base structure, there exists a top-level grouping element. This element is defined at the parent object level and inherited by the child. This grouping element allows users to create virtual collections, or sets within the knowledge base that can then be acted upon without knowing what elements are actually a part of the group. Therefore, if a request was made by the API to query all known image targets within the knowledge base, the application would simply have to pull all targets found in the image group object. Moreover, since the group attribute is inherited through the parent, child objects of the parent are automatically placed into the group. However, the child element can also be explicitly taken out of the group by overriding the appropriate attributes.

As defined, the parent?child object structure allows metadata to be defined at both the parent and child level. Parent objects act as primitives, defining the default object properties that are then inherited by their child objects. However, the option exists to override each value inherited by the parent, allowing custom properties to be set when the child and parent objects differ. So using EBSCOHost as an example, a parent object for EBSCOHost could be configured for all EBSCOHost resources. Once the parent resource has been constructed, one simply needs to generate the children objects defining each specific resource available within the parent. However, as noted above, each child element inherits all the common attributes of the parent, allowing specific information to be overridden at the child level.

Figure 5: Parent?Child inheritance diagram

Ideally, this parent?child relationship should have the effect of simplifying ongoing maintenance of a resource. Since the metadata profiles, hosts and authentication information of child objects would rarely differ from its parent, the job of the knowledge base administrator becomes much simpler. Moreover, if a change to the metadata profile, host or connection information did occur, one would only need to modify the parent object so that all its children could inherit those changes. Using OSU as a test case, the implementation of parent?child objects reduces the number of primary maintenance targets, (those that define metadata profiles, connection or authentication information) from some 160 resources to around 33, representing something like a 79% reduction in targets needing active maintenance.

Collaborative Knowledge Base Design

One of the shared disadvantages to any knowledge base is the need to develop a metadata profile for each particular target. Within a code-driven system, these profiles remain hidden from the collection administrator. On the other hand, a metadata-driven system remains open to collection administrators largely due to the fact that administrators are responsible for the creation and maintenance of these resources [4]. While protocols and metadata formats remain standard, how these protocols or metadata formats are implemented may vary widely between organisations [1]. Users are likely to encounter repositories that:

Do not delineate their metadata
Example:
(=773 \\$aQuaternary International May2006, Vol. 148 Issue 1, p113 25p, 1040-6182 vs. =773 \\$tQuaternary International$gMay2006, Vol. 148 Issue 1, p113$h25p$x1040-6182)
Use invalid XML structures:
Example:

<?xml version="1.0" encoding="ISO-8859-1"?>
  <!DOCTYPE dublin-core-simple>
  <record-list>
  <dc-record>
    <type></type>
    <title>100 hikes in Oregon : °bMount Hood, Crater Lake, Columbia Gorge, Eagle
    Cap Wilderness, Steens Mountain, Three Sisters Wilderness /°cRhonda &amp; George
    Ostertag ; [maps maker, George Ostertag ; photographer, George Ostertag].</title>

     <creator>Ostertag, Rhonda, °d1957-</creator>
     <creator>Ostertag, George, °d1957-</creator>

Use varied or custom implementation of metadata schemas like Dublin Core, MODS, etc.

Fortunately, though metadata profiles will vary significantly between targets, these profiles should not vary between institutions. This means that once a profile has been created for one institution, it should be valid for other institutions using the same resource. Code-driven repositories use this principle, creating a single code-based connector for a particular database and then making that connector available to all users. Metadata connectors have always been a bit trickier given the varied nature of the metadata that could be captured for a particular resource and the desired level of information granularity set to be captured from a target resource. However, those issues aside, a good argument can be made for creating a sharable metadata knowledge base.

As a part of the LSTA grant requirements, a number of Oregon public libraries have been selected to implement this tool upon completion. Since this tool is currently still in active development at OSU, this part of the project has yet to be realised - though should occur sometime in the late summer of 2006. Keeping in mind the wide variety of organisations that would eventually make use of the OSU metasearch tool, a hierarchical knowledge base management system has been folded into the application framework to aid organisations in managing their metadata knowledge bases. The framework works as follows:

Figure 6: Application framework to support management of metadata knowledge base

The metasearch application has the ability to configure a clustered knowledge base management structure. An institution would set up a knowledge base repository as the master node - but would run a separate implementation of the metasearch tool as a child node within the cluster. In this way, the master knowledge base is not directly associated with any active metasearch tool implementation. The master repository's position within the cluster is special, in so much that its purpose is to notify each registered node on the cluster when resources have been added or changed. Likewise, the master repository acts as a clearinghouse of definitions for those nodes within the cluster. Knowledge base entries are resolved nightly, with changes being propagated throughout the registered nodes either automatically or via notification. What is more, the system uses an opt-out methodology, in that register nodes can 'localise' instances of a metadata profile for their own node without contributing the changes to the central repository and the rest of the cluster.

At this point, OSU's metasearch tool has yet to be installed outside OSU, so we have yet to see how this collaborative approach to knowledge base management will work within the real world. But this is obviously the next logical step. As outside organisations adopt its usage, we will be able more fully to examine the feasibility of this model and its problems. For example, how conflicts are resolved between custom local metadata profiles is still up in the air and needs to be discussed. Moreover, how master repositories link and share metadata is an issue that has yet to be addressed. What we hope to find is that this collaborative approach to knowledge base management will lower barriers for organisations that might not otherwise have had the resources to use a metasearch application as well as exponentially reducing the maintenance burden for each organisation within the cluster.

Conclusion

At this point, the OSU metasearch tool is still very much a research project on many fronts. And as part of that research, this metasearch tool will be an exploration in collaborative knowledge base maintenance. As additional user nodes are added to the cluster, it is our hope that this shared metadata infrastructure will help to keep maintenance costs down for each institution, allowing staff time to focus on building exciting new tools and services upon their metasearch platforms.

References

Brogan, Martha. A Survey of Digital Library Aggregation Services Washington, DC: The Digital Library Foundation and the Council on Library and Information Resources, 2003.
http://www.diglib.org/pubs/brogan/
Mischo, William H. Digital Libraries: Challenges and Influencial Work. July/August 2005, D-Lib Magazine
http://www.dlib.org/dlib/july05/mischo/07mischo.html
National Information Standards Organization. NISO MetaSearch Initiative. NISO. Viewed March 21, 2006.
http://www.niso.org/committees/MS_initiative.html
Dempsey, Lorcan. The (Digital) Library Environment: Ten Years After. February 2006, Ariadne Issue 46
http://www.ariadne.ac.uk/issue46/dempsey/
Christenson, Heather and Roy Tennant. I. Oakland, CA: California Digital Library, August 15, 2005.
http://www.cdlib.org/inside/projects/metasearch/nsdl/nsdl_report2.pdf

Author Details

Terry Reese
Cataloguer for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR 97331
USA

Email: terry.reese@oregonstate.edu
Web site: http://oregonstate.edu/~reeset

Return to top