[LIS-Forum] Metadata Harvesting-Interview

10 Sep 2003

      Date: Tue, 9 Sep 2003 19:01:59 +0530
From: Aman Jha <aman.jha@ciionline.org>

Dear friends

The following piece of interview is for those who doesn't read OCLC Newsletter. The interview is relating to the protocol for metadata harvesting (PMH) which is about new protocols that enable users to search across multiple repositories and to view and use works from different digital collections from a single workstation. A link is also provided for more information about OAI-PMH.

Aman Kumar Jha
Librarian
Confederation of Indian Industries (CII)
New Delhi

Interview: Herbert Van de Sompel, Developing new protocols to support and connect digital libraries

Herbert Van de Sompel is a digital library research pioneer known for his cutting-edge work in linking technologies and metadata harvesting. He received a master's degree in mathematics in 1979, a master's degree in computer science in 1981 and a doctorate in communication science in 2000, all from Ghent University, Belgium. He was Head of Library Automation at Ghent University from 1981-1998. From 2000-2001, he served as Visiting Professor in the Computer Science Department of Cornell University, where he was part of the Digital Library Research Group and taught computing methods for digital libraries. Prior to accepting his current position as Digital Library Researcher at the Research Library at Los Alamos National Laboratory (LANL), he served as Director of e-Strategy and Programmes at the British Library. The 2003 recipient of the Frederick G. Kilgour Award for Research in Library and Information Technology, Dr. Van de Sompel has played a major role in creating the Open Archives Initiative Protocol for Metadata Harvesting, the OpenURL Framework for Context-Sensitive Services and the SFX linking server. He is also a member of the OCLC Research Advisory Committee.

What do you do at the Los Alamos Laboratories?

I lead the Digital Library Research and Prototyping Team, a group of highly qualified individuals who are well established in the digital library research community. Our team explores architectures/solutions for next generation digital library services in the context of the Los Alamos Research Library and beyond.

You recently won the Frederick G. Kilgour Award for Research in Library and Information Technology for your work in linking technologies and metadata harvesting. How important are these to libraries?

I think this work is significantly changing the nature of library services. Probably the OpenURL/SFX work on linking is the one that currently has the highest impact on libraries. There are quite a few academic and research libraries across the world operating an OpenURL-compliant linking server. Libraries now have a say in the nature and target of links, and users enjoy a consistent linking experience across resources. Interestingly, the linking work is not very well known in the digital library research arena. There, the impact of metadata harvesting protocols is more visible. Maybe that's because metadata harvesting has its origins in an effort to transform scholarly publishing. Initially, libraries had little involvement in those efforts as they were mostly initiated from within a given research community. As libraries start taking a more active role in that transformation, for example, through institutional repositories, they will be adopting a protocol to become part of a global, interoperable network of scholarly repositories. So, eventually, the impact of the OAI-PMH might become more profound.

What is the Open Archives Initiative (OAI)? Can you talk a little about the philosophy and driving force behind it?

Initially, OAI was about transforming scholarly communication. The idea was to increase the impact of communication through preprints by making repositories interoperable. For example, if we could make it easier to find preprints than electronic published papers, preprints might become a viable (and cheaper) alternative, which would change the dynamics of scholarly publishing. Later, mostly through the influence of the Digital Library Federation (DLF) and the Coalition for Networked Information (CNI), our work became more generic. Preprint repositories were not the only islands on the Web. So were digitized library collections. There was a clear need for interoperability at the resource-discovery level for all kinds of materials, not only for preprints. Supported by DLF and CNI, Carl Lagoze, Senior Researcher, Information Science Program, Cornell University and I embarked on a mission to create a generic protocol for metadata harvesting. The modus operandi is the creation of simple, generic, high-quality, vetted specifications that can enhance the experience of dealing with the networked information environment we live in.

What do you see OAI accomplishing for libraries?

There are a zillion things libraries could do with the OAI Protocol for Metadata Harvesting (OAI-PMH) because it is so generic. In Los Alamos, we use it for things as diverse as a read-only Repository Access Protocol, as a protocol to synchronize bibliographic data between different components in our infrastructure, and as a means to share digital library usage logs with our research collaborators, Johan Bollen (Old Dominion University) and Luis Rocha (LANL), who will be using them for creation of a recommendation system. I have been hoping that libraries would put pressure on publishers to make their metadata harvestable via the OAI-PMH. Unfortunately, so far, I haven't seen a lot of action. Not sure why that is. Probably academic libraries are too busy worrying about integration with educational systems.

How's the reception been to OAI-PMH around the world?

Beyond what I had expected or hoped for. The protocol is all over the place. Whichever meeting or conference I attend, the OAI-PMH comes up. As Cliff Lynch, Executive Director, CNI, mentioned, the protocol has become a part of our information infrastructure. I feel that the timeframe in which this has happened (less than 4 years) is quite surprising and impressive. It is really difficult to give quantitative measures about the adoption of the protocol because there is no requirement to publish usage, and in many cases usage actually occurs in an Intranet context. Recently, we have seen an increasing interest in the protocol from industry. Dot.coms and dot.nets that I had never heard of are playing around with the OAI-PMH.

What were some of the challenges in developing OAI and OAI-PMH?

With the OAI, things have been rather straightforward. Initially, in 1999, my ideas to move to action in the preprint realm were supported by Paul Ginsparg, Professor of Physics and Computing and Information Science, Cornell University; Rick Luce, Research Library Director at Los Alamos National Laboratory; Cliff Lynch; Don Waters, Program Officer, Scholarly Communications, the Andrew W. Mellon Foundation; Deanna Marcum, Associate Librarian for Library Services, Library of Congress; and Rick Johnson, Enterprise Director, Scholarly Publishing and Academic Resources Coalition (SPARC). This support was crucial for assembling a group of experts to discuss how to make preprint repositories interoperable. In Michael Nelson, Assistant Professor, Department of Computer Science, Old Dominion University and Thomas Krichel, Assistant Professor, Palmer School of Library and Information Science, Long Island University, I found the perfect partners to demonstrate a possible solution: metadata harvesting. At the expert meeting, my connection with Carl Lagoze was established. Initially our collaboration was quite informal. Later, the DLF and CNI expressed interest in generalizing the protocol, and generously provided us with some funds. With that came the creation of an advisory board, and Carl and I became the OAI executives, leading the effort to deliver a protocol. Things have been very smooth. The funding expired at the end of 2002, and ever since then OAI has faced organizational challenges that are well known to anyone who has been involved in defining infrastructural components.

Developing the OAI-PMH has not been a challenge but rather great fun. It has probably been the best experience in my professional life. Carl and I have been able to assemble two generations of the OAI Technical Committee with people who excelled through their talents, insights and determination to get the job done. If I would have to name a single challenge, it would be remaining faithful to the fundamental principles of sticking to scope and maintaining simplicity.

How much input from librarians went into OAI-PMH? How many libraries are involved with the project, and what are their roles?

We had people on the OAI Technical Committee from the Library of Congress, the California Digital Library, the CERN library, the British Library, the Engineering Library at the University of Illinois at Urbana-Champaign and OCLC. And we had some more people with library backgrounds on the OAI Steering Committee.

How can OCLC help advance open access content systems?

I think OCLC can play a very important role through the creation of tools that can easily be deployed in a variety of institutional contexts, through educating its constituency about the better kind of place we could live in if they would collectively move to action to initiate some real change in scholarly communication, and through using their longstanding and valuable experience in novel ways that help in making such action successful.

How do you see OAI evolving? What are the future plans?

We have several pieces of work in thinking or in the pipeline. An important one is an upcoming collaboration with the JISC RoMEO project, in the realm of expressing rights statements about metadata and content in the OAI framework. We will soon set up a new technical committee for that work. Another effort that we have already done significant work on is the so-called OAI Static Repository. It's a file-based solution to lower the barrier for people to share metadata through the OAI-PMH; they don't need to run any special server, only put a file of a specified format on a Web server. We also think about the creation of a SOAP version of the OAI-PMH. Our good friend, Hussein Suleman (Cape Town University), is currently doing research in that realm, the results of which will probably become the starting point of an OAI effort. And there is more great stuff we have in mind. Just trust us and wire the money so we can get going.

What other communities are working on linking technologies, and how is the library community connecting with them?

The problem that comes closest to the library's "appropriate copy" problem is the one in e-commerce whereby companies want to offer Web visitors a choice of online vendors of their products, but instead of providing that information themselves, they rely on third party services to do so. Interestingly, in their world, offering more choices is better than offering the "appropriate" choice. But it's a similar thing, and companies like Channel Intelligence are providing services in that realm. I am not sure about what exactly their solution is but I can see that it could be addressed by using OpenURL and an OpenURL Resolver. Then there's Microsoft's Smart Tags and related ideas that have led to some helper applications that let a user consult an authoritative database on a specific topic by selecting a word, or by clicking links overlaid by those applications. Again, this feels very similar to OpenURL ideas. As far as I know the library community is the first one to devise a standard in this realm of linking; we seem to be leading the way. The hope is that other communities will adopt the upcoming NISO OpenURL Standard to address their linking needs; by all means, the standard is written in such a generic manner that they could.

Tim Berners-Lee, the father of the Web, describes the next generation of the Web as the "Semantic Web," when people create many programs that collect Web content from diverse sources, process the information and exchange the results with other programs. How close are we to the Semantic Web that Berners-Lee describes?

Yes, and my American garage door talks to my Belgian toaster, and they agree I am hungry. Great idea. I think it will take long time to realize, and that we will go through several generations of enabling technologies before we find ones that are suitable to actually get the job done. "Suitable" in this case means both powerful and simple. Like the technologies that led to the emergence of the Web. Also, the Semantic Web will probably happen very gradually, with some sample implementations in niche areas illustrating the capabilities, and by doing so, slowly convincing the world at large. Quite a different deployment than the Web itself. While I have become increasingly receptive to the Semantic Web ideas in general, I remain puzzled as to why it takes so long for a killer illustration to emerge. My personal take on this is that too many people feel uncomfortable with the technologies at hand, or the lack thereof. While I keep hearing that RDF is simple, I remain to be convinced that the technology is straightforward. Also, maybe we are too focused on seeing a garage-door/toaster illustration of the Semantic Web concepts. Probably we should settle for an application that is less exotic, more useful and hence more convincing.

Herbert Van de Sompel; Jeff Young, OCLC Consulting Software Engineer; and Thom Hickey, OCLC Chief Scientist, wrote an article entitled, "Using the OAI-PMH...differently" for D-Lib Magazine. You can read it at http://dlib.org/dlib/july03/young/07young.html.

Source: OCLC Newsletter, July 2003

[LIS-Forum] Metadata Harvesting-Interview

Mailing List Manager