The impact of OAI-based search on access to research journal papers

STEVE HITCHCOCK,TIM BRODY, CHRISTOPHER GUTTERIDGE, LES CARR AND STEVAN HARNAD Intelligence, Agents, Multimedia Group, School of Electronics and Computer Science, University of Southampton Intuitively, if a product is useful and has both a priced and a free version its total usage rate would be expected to be higher than if there is only a priced version. Evidence is emerging that this is true for online research journal papers.Authors need accessible online sites in which to deposit their published papers, and users need a means of discovering and evaluating those papers.The Open Archives Initiative (OAI) has now produced free software packages for building OAI-compliant institutional archives and OAI search services, including a citationranked search and impact discovery service. New data from this service shows that higher usage of free papers leads directly to a higher number of citations and thus greater research impact. Institutional archives need far more papers to be deposited, and one way of bringing this about is to implement institutional and national policies mandating the selfarchiving of all funded research output in open access archives. This paper outlines why such policies are beneficial to researchers, their institutions, funders, and to research itself.


Introduction
Alert web publishers will have noticed a fundamental shift in the way users access information in a networked information environment.Instead of navigating web sites, users start with interfaces that allow them to perform particular tasks such as search and select.The most successful example is, currently, Google.
Electronic journals exist not just in a post-Gutenberg world, but a post-Google world too.The ability to locate a specified item of information precisely and instantly among the mass of information available on the web has profound implications.In the electronic environment the search engine has become the de facto interface to information, in place of the fragmented packages that have migrated from the print world.
Journal articles will also be accessed directly by search, but while Google's success has been based on an extension of the established scholarly practice of citation ranking, treating web links as citations, Google rankings do not make use of actual bibliographic citations within a paper.Further, most journal papers remain invisible to Google.
Recognising the importance to research of navigating citation space, the Open Citation Project has created a citation-ranked search and impact discovery service, Citebase (http://citebase.eprints.org/),for 'open access' (Suber 2003 1 ) journal articles (i.e.accessible for free on the web).Citebase was designed to take advantage of the growing prevalence of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) for describing the contents of distributed digital libraries.It extracts and indexes citations from published research papers stored in the larger OAI disciplinary archives -currently arXiv, CogPrints and BioMed Central -and is soon to include PubMed Central.
Citebase is now fully operational and is a featured service of the arXiv (http://arxiv.org)physics archives.It is more than a search engine, however.The data it collects offers new and compelling evidence that will induce a change in the way researchers access published papers, a change that will be every bit as profound as the one induced by Google on the global web.

Citebase: measuring citation impact against usage
Citebase has been described by Hitchcock et al.  (2002)  2 .A large-scale evaluation of Citebase concluded that web-based citation indexing of open-access e-print archives is closer to a state of readiness for serious use than had previously been realised (Hitchcock et al. 2003 3 .)The evaluation proved that Citebase can be used simply and reliably regardless of the background of the user showing, despite the bias of the current service towards physics, this powerful new functionality can be extended to all the other disciplines as well.
According to the evaluation, Citebase compares favourably with other bibliographic services, such as ISI's Web of Science, even though its content size and range are still much smaller.Citebase can also provide earlier predictors and measures of impact, at the preprint phase of research.
Citation indexing has had some unexpected consequences: ISI's Science Citation Index has become a career development tool (Guédon 2001 4 ).Authors publish for impact, which is classically measured by citations.In choosing a publication authors typically seek, however inexactly, to maximise the impact of their work.
Citebase can provide some of the established scientometric measures of research impact: citation counts for the article citation counts for the researcher co-citation (and eventually co-text) analyses as well as some new measures of impact: citation counts for the preprint phase usage measures ('hits', webmetrics) for preprints and postprints time-course analyses, early predictors, etc.

usage/citation correlators and predictors
For the first time, pre-and post-publication citations for individual papers can be measured against usage, i.e. web downloads.According to Kurtz et al. (2003)  5 : "Perhaps the most important new information to become available for bibliometric studies is the per article readership information." Records in Citebase plot usage and citations against time for each arXiv paper indexed, as shown in Figure 1 for a highly-cited example paper.The citations are from all other papers deposited in arXiv.(The usage data ('Hits') are based only on downloads from the arXiv UK mirror server since August 1999, possibly underestimating usage by a factor of 18 across the worldwide network of arXiv mirror sites.)These charts suggest the following cycle of user actions: the preprint or postprint appears; it is downloaded (and sometimes read); eventually citations may follow (for more important papers); this generates more downloads, etc.

Correlation Generator
With the advent of new online tools authors will not only have greater scope to measure impact, they will quickly recognise the critical factor in enhancing impact, which is to make their published papers openly accessible.Lawrence (2001) 6 showed "an average of 336% more citations to (free) online articles compared to offline articles published in the same venue" for papers in computer science, i.e. free online access improves impact by a factor of over three.Kurtz et al. (2003)  5 reached a similar conclusion for astrophysics, that access increases impact.They measured the impact of the Astrophysics Data System (ADS), a comprehensive collection of journal papers in a fee-based collection that, because it is available to almost all researchers in this field, effectively replicates open access.In this case impact was measured in a novel way: "We These startling findings can now be supplemented using the remarkable Correlation Generator based on Citebase data (http://citebase.eprints.org/analysis/correlation.php).This realtime Java tool, which plots the latest data based on user-set criteria, shows that usage impact is correlated with citation impact, i.e. the more often a paper is downloaded the more likely it is to be cited.This correlation is highest for high-citation papers and authors.Results obtained with the correlation generator are shown in Table 1, where the correlation coefficient (r) can be interpreted as the probability that a downloaded paper will be cited.It can be seen that r is higher for high-energy physics (hep), the largest sub-archives, compared to the whole arXiv, and larger for papers in the higher impact quartiles.
The dramatic conclusion from the studies so far is that as open access increases usage compared with fee-based usage and offline usage, this feeds directly into increased impact for authors.If he had measured usage as well, Lawrence would no doubt have found an increase in both usage and citations for free online articles compared with offline articles.
Work is ongoing to substantiate and quantify these results across other disciplines, by comparing citation rates for open-access papers with paired control papers: fee-access papers published in the same journal issue and volume, but not yet made openly accessible through selfarchiving by their authors.

Growth of Eprints.org and institutional archives
As results confirming the striking correlation between access and impact become more widely known, a change in the way authors make their papers available can be anticipated.As most journals are not open access, authors will have two options.
Wherever a suitable open-access journal already exists for the subject matter of their article (about 500 such open-access journals exist so far, http://www.doaj.org/),authors can choose to publish in one of these.But even according to the most optimistic estimates, less than 5% of the total number of refereed-journal articles published annually today (at least 2.5 million, in 24,000 journals) as yet have an open-access journal in which to publish them (Harnad 2003 7 ).
Most authors will continue to publish in established fee-access journals but they can in addition self-archive their papers in their own institution's open-access e-print archives.An analysis of publisher-author agreements shows that almost 55% (54.6%) of journal titles from the publishers surveyed already "explicitly left proprietary rights with the author" (Gadd et al. 2003 8 ).In other words, authors of papers in these journals can officially self-archive these papers.For the remaining papers not covered by such agreements, many of the journals will agree to selfarchiving if asked.
There are several free software packages that institutions can use to create archives for their research output.The most widely used archive software is Eprints.org(http://software.eprints.org/),now running over 100 archives worldwide, both institutional archives and disciplinary archives.Eprints.orgsoftware generates archives that are compliant with the OAI-PMH and, in conjunction with the OAI, Eprints.org has been a primary motivator for new institutional archives of research journal papers.
While the number of archives and self-archived papers is growing, the absolute number of papers accessible in these archives is still small -relative to the 2.5 million papers estimated to be published annually in peer-reviewed journals.In new institutional archives in particular, after an initial burst of activity, the number of deposits tends to tail off (Figure 2).These curves need to become convex upward if archive growth is to become fast enough to attain the degree of open access that is already within researchers' reach.This data is from the presentation The Research Impact Cycle, which contains further key data on the growth of open access through the selfarchiving of institutional (peer-reviewed) research (http://www.ecs.soton.ac.uk/~harnad/Temp/ self-archiving.ppt).Institutional archives may be the foundation for an expansion of open access to research papers, but this data shows that creating archives alone is not enough.This needs to be coupled with systemati c institutional and national policies focusing on the causal connection between access, usage and impact, to ensure immediate, rapid and substantial growth in selfarchived content across all research sectors in institutions.

Institutional and national policies for open access
If current data for the growth of institutional archives and their contents is not encouraging, this is misleading.The concept of open access and its effects on research impact is still new.The research community has not yet absorbed the implications of the findings on the access/usage/impact correlation.It is not only authors who benefit from maximising impact, but their institutions too, and the agencies that fund them.
Within institutions, departments are probably the best placed to implement self-archiving, through local policies, practices, and peer influences.Archive management might be best done either by the department or the institutional library.A sample policy has been formulated for the School of Electronics and Computer Science (ECS) at Southampton University and might serve as a suitable model for other institutions as well: 'All research output is to be self-archived in the departmental E-print archive.This archive forms the official record of the Department's research publications; all publication lists required for administration or promotion will be generated from this source.' From ECS Research Self-Archiving Policy http:// www.ecs.soton.ac.uk/ ~lac/archpol.html)Such policies, with institutional backing, should form the core of all institutional open access and research archiving policies.
In such a scenario the funding agencies are the remaining missing link, because they complete the virtuous circle of funding-researchevaluation-funding.Decisions on what research and researchers to fund, or fund again, are informed and guided by the track record of both the research and the researchers.Track records are in turn based largely on measures of research impact -both past impact and potential impact.So if impact is in turn dependent on access and usage, it stands to reason that whatever improves the impact of research and researchers, and also makes it more measurable, is also beneficial to research assessors and funders.It allows them to decide where to make their funding investment, and helps in evaluating the return on the investment.It also levels the playing field for researchers and their institutions: maximising the visibility and accessibility of a piece of research will not guarantee that it will be more widely used and cited: that also depends on the quality of the research.But open access does guarantee that potential impact will no longer be lost because would-be users could not access it.
In the UK the primary target of research evaluation is the Research Assessment (RA) exercise by the Research Councils.Harnad et al.

Figure 2 .
Figure 2. Latency of additions of records to new Eprints.orgarchives (broken line: new records in latency period; solid line: mean new records per archive)

Table 1 .
Correlation coefficient (r) between downloads and