Open Bibliography for Science, Technology, and Medicine



The concept of Open Bibliography in science, technology and medicine (STM) is introduced as a combination of Open Source tools, Open specifications and Open bibliographic data. An Openly searchable and navigable network of bibliographic information and associated knowledge representations, a Bibliographic Knowledge Network, across all branches of Science, Technology and Medicine, has been designed and initiated. For this large scale endeavour, the engagement and cooperation of the multiple stakeholders in STM publishing - authors, librarians, publishers and administrators - is sought.

BibJSON, a simple structured text data format (informed by BibTex, Dublin Core, PRISM and JSON) suitable for both serialisation and storage of large quantities of bibliographic data is presented. BibJSON, and companion bibliographic software systems BibServer and OpenBiblio promote the quantity and quality of Openly available bibliographic data, and encourage the development of improved algorithms and services for processing the wealth of information and knowledge embedded in bibliographic data across all fields of scholarship.

Major providers of bibliographic information have joined in promoting the concept of Open Bibliography and in working together to create prototype nodes for the Bibliographic Knowledge Network. These contributions include large-scale content from PubMed and ArXiv, data available from Open Access publishers, and bibliographic collections generated by the members of the project. The concept of a distributed bibliography (BibSoup) is explored.


This paper "eats its own dog food" by using the technologies described in the text. All bibliographic entry references and bibliographic entries were managed in BibJSON then included in the HTML document following the Scholarly HTML convention. The document itself is formally consistent with these specifications and can be read as a normal HTML document. It would alternatively be possible to embed bibliographic records in the document directly from BibJSON via JavaScript. The "flat HTML" should be taken as the definitive version, and can be re-purposed into other formats (PDF, DOCX, ODT).


We introduce the concept of Open Bibliography as a combination of Open Source tools, Open specifications and Open bibliographic data. Our Open Bibliography project is an umbrella of several other initiatives, most prominently the Open Knowledge Foundation’s Bibliographic Working Group [1], the JISC-funded JISC-OpenBib project at the University of Cambridge [2], and the NSF-funded Bibliographic Knowledge Network project [3]. These projects have all addressed the totality of Open bibliographic resources including design of systems, implementation of software, licenses for use and re-use, and the collection and hosting of substantial bibliographic datasets. In this article we shall concentrate on bibliographic data for articles in the Science, Technology and Medicine (STM) fields, but we introduce the reader to the wider elements of bibliography before the main results. We stress that the tools and formats exemplified here have a particularly simple modular form in STM article publishing; however, these tools and formats are designed to be both flexible and extensible, and are also capable of managing library and personal collections, monographs, multiple versions etc.

The atoms in the bibliographic universe are the records traditionally held as 3 X 5 cards in a library card catalogue, and more recently represented in structured text languages such as BibTeX, RIS, XML, JSON [4] [5] [6] [7]. This data has been commoditised, and is subject to a large scale cycle of use (publication / collection / abstracting / indexing / searching / citation) which involves all participants in scientific publication: authors, librarians, publishers and administrators.

The bibliographic data cycle is reminiscent of the water cycle, whereby water stored in the atmosphere rains down onto the land, where it may be collected in containers on various scales (drops, thimbles, buckets, ponds, and lakes), from which it flows down rivers and pipes to the ocean, thence back to the atmosphere by evaporation to close the cycle.

Similar features of the bibliographic data cycle are the variety of containers of different sizes that can be used to enclose it (individual records, publication lists, departmental collections, subject-specific repositories, and the databases behind large scale indexing services), the use of the internet to pipe data from one place to another, the way bibliographic items are recycled in the process of bibliographic citation, and the way bibliographic data is packaged and sold to universities. But there is a key difference which mitigates the prevailing business model of bibliographic data as a commodity controlled by bibliographic service providers and sold to universities at a substantial profit; unlike volumes of a liquid, bibliographic data is not subject to any conservation law. Rather, bibliographic data is subject to a process of continual creation and replication. The elements of bibliographic data are facts, which in most jurisdictions cannot be copyrighted. There are few technical and legal obstacles to widespread replication of bibliographic records on a massive scale - the main limitations of such activity are social: whether individuals and organisations are adequately motivated to create and maintain open bibliographic resources. But the dynamics of creation and replication of bibliographic records have been irreversibly changed by recent technological and social developments, most notably the emergence of:

Bibliographic data has long been understood to contain important information about the the influence and impact of various authors and journals on scientific disciplines [18] [19]. However, now, instead of a relatively small number of privileged data owners being able to manage and control large bibliographic data stores [20], an individual researcher could browse millions of records, view collaboration graphs, submit complex queries, make selections and analyses of data – all on their laptop while commuting to work, provided they have adequate tools to do so.

The tools for such easy processing are not yet adequately developed, so our aim is to provide Open tools and services to make the wealth of bibliographic data available to the widest possible audience, and to promote increased understanding of science and technology, especially in interdisciplinary areas.


Traditionally, bibliography has been regarded as the study of library holdings and catalogues, and, more recently, catalogues of material published by formal publishers, repositories and other collections. We wish therefore to explain the importance of bibliography to scientists, and to argue the merits of Open Bibliography, by which we mean systematic efforts to create and maintain stores of Openly accessible [21], machine-readable bibliographic data.

The unit of bibliography is the bibliographic record, which consists of the information necessary to locate and/or identify a scholarly publication (and, increasingly, other resources besides textual material, such as authors, images and scientific datasets). The term "bibliography" is also often used to represent a personal collection of bibliographic records (and in some cases is synonymous with citation lists). We refer to such a collection here as a "bibliographic dataset".

We use the term "citation" to mean a reference to a bibliographic record within the body of a document. A citation may also be called simply a "reference". We will not say much more about citations, except to point out that an improved approach to bibliography should also be of value to citation management and analysis.

Despite the importance of bibliography, including the widespread sale of bibliographic records, there is no single syntax or agreed semantics for the publication and exchange of STM bibliographic records; scientists use whatever representation is provided by their tools, the most Open and common being BibTeX from the LaTeX [22] authoring system; the publishing community commonly uses PRISM, although some publishers have their own representation of their bibliographic data, which often consists of a mixture of Dublin Core [23], PRISM, BibTeX and their own markup approach.

By "Open Bibliography", we do not imply that there must be a single central authoritative resource. We expect that the first stage, at least, will be the identification of bibliographic collections which are Open and where the collectors can offer them with the appropriate common technology. The aggregation of such distributed collections, with some cooperation about formats and interfaces, is what we mean by the Bibliographic Knowledge Network.

We are aware of other groups with explicit commitments to use of open licenses such as K4All [24] and also Wikipedia [25] which has a large implicit bibliography. There are also a substantial number of large bibliographic repositories which are operationally fairly Open, even if the data is not explicitly declared to be so, or available only through an API rather than in bulk, such as arXiv, RePEc, BibSonomy, PhilPapers, DBLP, CiteULike, Connotea, Zotero, Mendeley etc. [26] [27] [28] [29] [30] [31] [32] [33] [34]. We have worked closely with two major Open Access publishers (International Union of Crystallography (IUCr) and BioMed Central (BMC)) [35] [36], and have collaborated with PubMed [37], with Thomas Krichel’s AuthorClaim [38] and 3Lib [39] projects, and with the Sciplore team [40].

Because this article is limited to exploring bibliography for STM, we have taken a pragmatically simple approach. Systems such as FRBR [41] and BIBO [42] make provision for complex aspects of bibliography such as multiple manifestations and representations of works and multiple versions. While these are relevant to STM bibliography in some areas [43], for the most part we do not need the complexity of these and other RDF [44] approaches, although our tools and software should be capable of leveraging them if necessary.

We report a number of prototypes in both tools and collections, and also propose that STM bibliography can be adequately represented for most immediate purposes using BibJSON [45]. Due to the intense interest in Open Bibliography, we are now very actively working on future versions of BibJSON, but the examples given in this article are fully supported by current software. In the spirit of the "perpetual beta" approach on the web, we intend to release early and often in public view so that a broad community becomes intimately involved in the design of specifications. As a first example, the references in this document are stored in a prototype of BibJSON and can be rendered into the content via JavaScript.


We present in this section a list of reasons and use cases which motivate our commitment to Open Bibliography.

  1. Access to Information. There is currently no single place where a user can obtain a definitive statement of the identity and public domain components of a bibliographic record in STM publications. There are a number of organisations, many commercial, which supply bibliographic records but almost all of these are covered by licences which limit their re-use. This means, for example, that users cannot easily compare records from different suppliers, nor can these records be integrated into a single definitive resource. By contrast the idea of Open Bibliography is to empower and encourage individuals and organisations of various sizes to contribute, edit, improve, link to and enhance the value of public domain bibliographic records.
  2. Error detection and correction. We expect that, as for resources like Wikipedia and Open StreetMap [46], the community supporting the practice of Open Bibliography will rapidly add adequate means of checking and validating the quality of openly accessible bibliographic data. Errors in bibliographic data are common, and an Open approach allows for crowd-sourcing detection and correction of errors. In some cases this may be done by individuals (e.g. in Open StreetMap or ChemSpider [47]) and in other cases may be through organisations which appreciate Open Bibliography and contribute updates to it.
  3. Publication of small bibliographic datasets. It is common for individuals, departments and organisations to provide definitive lists of bibliographic records. Examples of these are reading lists produced by a lecturer, study lists created by students in the course of their studies, publications lists created by researchers and departmental and institutional lists reflecting on the work published from those organisations. The practice of Open Bibliography encourages individuals and small organisations to make such lists available as a shared, machine-readable resource. These lists then contribute to the quality of the open bibliographic aggregation, and reduce the effort of aggregating agents in compiling lists across a number of individuals or departments. RePEc provides the leading example of feasibility of this sort of bibliographic aggregation for a subject community.
  4. Merging bibliographic collections. We show here that Open Bibliographic collections will come from different subject groups e.g. bioscience, crystallography and mathematics. Sometimes there will be large overlap and sometimes the resources will be largely independent. In the next period of work we intend to create a merged bibliographic resource, BibSoup [48], as an aggregation of Open collections which can be readily queried to return basic bibliographic information in machine-readable format suitable for further processing. We do not expect an Open BibSoup to replace massive central search systems such as Google Scholar and Microsoft Academic Search which require considerable infrastructure to host and maintain. Rather, we expect that once a result set has been obtained from these or other search services, it will be possible to forward or link the result set to one or more simple services over the BibSoup. These services should return further information about results, especially community-validated machine-readable metadata for further use and processing, something currently unavailable from any large scale search service. Thus BibSoup implementations could take advantage of the work already done by Google [49], Microsoft and other search providers, to increase discoverability by improving the quality and ranking of search results.
  5. A bibliographic node in the Linked Open Data cloud. There are many reasons why the world may wish to discover STM bibliography and to link to it. For example, many Wikipedia articles cite STM publications and it would be valuable to know whether these exist, to obtain complete bibliographic metadata for referencing, and to know whether they can be read and re-used without permission. Communities can add their own linked and annotated bibliographic material to an LOD cloud [50].
  6. Collaboration with other bibliographic organisations. Many resources in academia are collected by and supplied by commercial organisations on a service basis. We expect this to continue and we offer the products of Open Bibliography as resources against which these suppliers can validate and compare their offerings. Examples of such organisations are reference manager suppliers (Zotero, Mendeley), reference and identifier systems such as CrossRef [51] [52], and academic libraries and library organisations.
  7. Mapping scholarly research and activity. Bibliographic records (including citations) are now frequently used as a means of assessing the value of individuals and institutions. Open Bibliography can provide definitive records against which these assessments can be collated. For example it allows us to create patterns of collaboration and to identify geographical locations in which work is performed. For researchers in this area, we expect the type of analysis shown in our geospatial examples to be of broad interest (even though the citations and abstracts may not always be Open).
  8. An Open catalogue of Open scholarship. Since the bibliographic record for an article is Open, it can be annotated to show the Openness of the article itself, thus bibliographic data can be Openly enhanced to show whether a paper is fully Open (e.g. CC-BY), freely-available (as in beer), and the website it was discovered, and the association of non-textual objects such as datasets, multimedia and other resources. Open bibliographic data can also include syntactic metadata such as the format size and technical accessibility of the resources. Beyond this, we believe that a large number of hitherto unpublished applications can be made on top of an Open bibliographic framework.
  9. Cataloguing diverse materials related to bibliographic records. We see the opportunity to list databases, websites, review articles and other information which the community may find valuable, and to associate such lists with open bibliographic records.
  10. Use and development of machine learning methods for bibliographic data processing. Widespread availability of Open Bibliographic data in machine-readable formats should rapidly promote the use and development of machine learning algorithms, allowing machines to largely automate tasks such as matching, de-duplication and classification of bibliographic records, and to make Open Source versions of these algorithms widely available for use by managers of Open Bibliographic data stores.
  11. Promotion of community information services. Widespread availability of Open Bibliographic web services will make it easier for those interested in promoting the development of scientific communities to develop and maintain subject specific community information services, featuring searchable lists of books, articles and web resources of interest to a community of practice. Every such service may be thought of as a node in the Bibliographic Knowledge Network, a node which acquires, refines and organises data from the larger BibSoup environment, and publishes this data Openly back to the network.


Bibliography, and bibliographic data, is sometimes regarded as referring to everything that is not part of the "full text" and "images" in an article. This can be problematic because some people and organisations regard material such as abstracts, annotations and citation lists as "copyrightable" and therefore not by default Open. In this article, we do not debate the ethics and legality of asserting ownership over certain types of bibliographic data, and our understanding of the agreed law and practice is that what we define as "core bibliographic data" below can be made Open by default.

By "core bibliographic data" we mean that data which is necessary to identify and / or discover a publication. It is generally held that such bibliographic data is NOT copyrightable and this has been confirmed by the Association of STM Publishers in a public reply to one of the authors [53].

It is difficult to get authoritative statements as to whether other fields are Open by default. But we would expect, for example, that the format of the work and the rights associated with it were by default Open, while the abstract and images were not. Traditionally collections of STM bibliographic data have been expensive to produce and most of these are therefore currently available only under licenses that restrict re-use. Because it is now technically possible to create large amounts of Open Bibliographic data, this opens the possibility of collections created from the start as Open and distributed for community re-use.

The following "core bibliographic data", as described by the Open Bibliography Principles [54], will be the subject of this article:

A number of ways of creating Open Bibliographic data may be identified:

  1. Contributions from a publishing agency under an Open license such as CC0 or PDDL (effectively putting the material into the public domain) or CC-BY which allows use of the data in exchange for links back to the source (especially suitable for data elements such as abstracts)
  2. Collections from Open Access digital repositories
  3. Collections developed by spidering the web and extracting public domain bibliographic data components from publication lists, in the manner of CiteSeer [55]
  4. Donations of data by individual researchers, departments and universities
  5. Donations from publishers of collected scientific information such as Medline [56]

Using these and other mechanisms, we believe that is it cost effective to create and maintain an Open bibliographic network of information about STM publications. These need not necessarily be electronic publications but the stress of this article will be on the collection of bibliographic data that refers to electronic journals, web pages, technical reports, theses, and documents available on the web, meaning the data that is required to locate and identify a document on the web, whether or not the full text of that document may is openly available. As an example, it is possible to extract bibliographic data for all the publications in the BMC collection of journals. The web has been crawled for many years and the technology for doing this is standard. It is polite, but not legally required, to agree large-scale crawling with a publisher or to create web-server-friendly robots which do not impose undue stress.


Most scientists require a single bibliographic record per publication. In other words, most scientists do not distinguish between a print version, an electronic version or a manuscript on an author’s web page or in their institutional repository. Scientists have the implicit model of a single platonic bibliographic record for an article. Our approach is based on this and while there may be occasional complexities that cannot be represented, we believe it is powerful enough to create a useful sustainable Open STM bibliography.

5.1 Vocabulary

The vocabulary terms used by publishers and other bibliography creators, often drawn from Dublin Core, PRISM, Medline or home grown element sets, are fairly, but not completely, interchangeable. For example. dc:creator might be used for authors in one source and editors or publishers in other sources, but usage is normally consistent within a given source. As a first step we propose to honour the terms used by the collectors rather than attempt to align and normalise them algorithmically. We are exploring whether there is a pragmatic "flattening" of the main concepts and whether it is possible to manage "most" STM bibliographic records with a small number of central terms; most STM articles in journals can be described with a very small subset of these vocabularies:

Types of entities
NameElement Set(s)Description
AgentFOAF, dctermsA resource that acts or has the power to act.
PersonFOAFA person
OrganisationFOAFAn organisation
Document, Bibliographic ResourceFOAF, BIBO, dcterms A document of some sort
Article BIBOAn article, typically in a Journal
IssueBIBO, BibTeXA journal issue or volume (expressed as a property in BibTeX, linked with dcterms:isPartOf in BIBO)
JournalBIBO, BibTeXA journal (expressed as a property in BibTeX, linked with dcterms:isPartOf in BIBO)
Properties or predicates
NameElement Set(s)Description
author, creator, contributor,editorBibTeX, dctermsPerson or organisation creatively responsible for some document
identifierBibTeX, BIBO, dctermsIdentifier of an entity such as an article or journal (including refinements such as ISSN, DOI, ISBN, etc. which are common BibTeX extensions)
institutionBibTeXThe institution involved in publishing
journalBibTeXA journal (see Journal in classes above)
month, year, publishedBibTeX, dctermsThe date of publication
name, labelFOAF, RDFS, SKOS [57]A name or label for a thing such as a person or organisation.
pages, extentBibTeX, BIBO, dctermsPage numbers
publisherBibTeX, dctermsA publisher
titleBibTeX, dctermsThe title of the work
volumeBibTeXThe volume of a journal (see Issue in classes above)

We could equally well have included the relevant fields from MARC21 and more [58] [59] in the above table. What these representations, MARC21, BibTeX and BIBO+dcterms, have in common is a flat representation of a bibliographic record. Contrast with the WEMI model of FRBR where a single bibliographic record for an article would be separated out into three or four related entities, Work, Expression, Manifestation at least, according to complicated cataloguing rules, without even considering the added dimensions of journals, authors and publishers. This flat representation is a core feature of our conceptual model.

5.2 Identifiers

Identifiers are critically important. They are necessary (but obviously not sufficient) to enable tasks like de-duplication - in order to identify duplicates, we need to be able to identify the things that are duplicated. They also make it possible to refer to entities outside of the current dataset; one might refer to the author of an article by their Wikipedia page, for example. This is not necessary, but it opens up many interesting possibilities for interlinking and correlating amongst datasets. Using a URI as an identifier where feasible is therefore a desirable feature [60].

Where a single, sustainable resource manages bibliographic data, it makes sense for it to generate its own unique identifiers, even if there is already a well-defined identifier system for some of the information. Thus, in working with the British Library [61] on the British National Bibliography [62], we have created a set of identifiers for their records. However, where collections come from several sources it is very difficult to create a global unique identifier system without a curating organisation. We therefore expect that each collection will create its own identifiers. We expect that different collections will contain bibliographic data for the same object and here we will create a mapping between the collections rather than trying to create a single global index.

5.3 Datasets

We have also worked with the following datasets (see also Section 6) and found that the records can be well represented by the concepts above.

  1. Bibliography extracted from the masthead (splash page) of 8000 Open Access articles from the IUCr. These already contain bibliographic information in PRISM and Dublin Core, together with some submitted by authors (e.g. email and addresses).
  2. The Open Access subset of PubMed Central (PMC). There are about 250,000 fulltext papers which contain bibliographic data but which vary due to the publishers’ syntax and semantics. These have been normalised so that the information is in a uniform schema, but publisher variation still exists in terms of metadata quality and how key information like DOIs and identifiers are represented.
  3. Recently we have obtained the full bibliographic records for 20 million Medline articles with metadata defined in the National Library of Medicine(NLM) Medline DTD.
  4. Personal bibliographies of about a hundred researchers in the fields of mathematics and statistics, including all Mathematics Faculty at U. C. Berkeley [63].
  5. Various lists of authors in mathematics, statistics and related fields [64] .

5.4 Serialisation

With this conceptual model in hand, we can turn our attention to exchanging information between systems that have similar or at least compatible models. For pragmatic reasons we propose to use JSON to exchange this information. JSON is widely implemented, simple to parse and easy to create either with a computer program or by hand in a text editor. A JSON-based format which uses dictionaries or associative arrays is also extensible since adding a new key to such a dictionary should not break any existing implementations which may not understand the meaning of the new key.

By using a JSON-based format designed for representation of bibliographic data - meaning data about documents of various kinds, and about the people, organizations and subjects connected to those documents - we can include guidance for creating records, linkages to existing ontologies, vocabularies and schema, and schema definitions, covering a wide range of bibliographic needs and drawing from a number of bibliographic metadata sources (BIBO, BibTeX, DC). If desired, a creator of a bibliographic dataset may add more information (e.g. language, format, editors, etc.). A consumer of this dataset may or may not read and understand this.


This paper is strongly informed by the work done in the JISC-OpenBib project, a collaboration between the University of Cambridge, the OKF, the British Library, Cambridge University Library and the IUCr. With the help of these partners, the high-level goal was to take exemplary bibliographic datasets and show that the principles of Open Bibliography, coupled with the formalisation and tools reported here, would be of great value to the scientific and informatics communities. We report a number of successful prototypes in this section where we have been able to acquire or collect an Open dataset, transform it to BibJSON or equivalent and re-purpose it, often with an interactive tool. The emphasis in this paper is on STM bibliographic resources but for completeness we also report on other collections.

6.1 Bibliographica

The OKF has developed a bibliographic management system (Bibliographica) which was functional at the start of this project. Although general, it has been primarily aimed at non-STM resources such as library collections and personal bibliographies. During the project, the British Library released under a CC0 licence the British National Bibliography (a collection of about 3 million records for monographs created in their role as deposit library). These have been converted to as queryable RDF using our Open Source OpenBiblio software [65] [66]. This provides an example of using the software to make bibliographic metadata available as RDF where required; this and other instantiations of OpenBiblio then act as resources for building bibliographic collections.

6.2 Medline

The largest collection of STM bibliographic data is provided by the NLM from the National Institutes of Health (NIH) of the USA. This is provided freely, and the records refer to both Open and non-Open publications. The Open publications are referred to as the "Open Access subset (OAS)" (the terminology is complex and our project has explained it [67] ). The OAS (ca. 250,000 records) contains full text and full reference lists (citations), and is of a very tractable size for carrying out prototypic work on bibliography and citations. The full Medline collection has about 20 million articles and in collaboration with the NLM we have obtained these records and converted them to RDF using a straightforward BIBO+dcterms representation. For the full record set we have been careful to include only those components of bibliography which are agreed to be Open (i.e. we have omitted abstracts and editorial annotation). Nevertheless, this collection is a major new resource in Open Scholarship.

We have converted both subsets to RDF, and found that while the Open subset is tractable with a wide range of common tools, the full records have problems of scale. It produced over 1 billion RDF statements; the resources required for querying this in an RDF store are beyond current scope, however a sample record is appended, and further information along with full content is Openly available [68] [69]. For the full subset we are now using a BibJSON-like approach and storing the records in a NoSQL database (CouchDB). This gives good performance for the sorts of queries that most people will initially wish to make.

Despite not having the abstracts or full text Openly, the Medline bibliographic dataset has still enormous value, particularly when used with new ways of navigation and display.

Although citations (bibliographic entry reference lists) are outside the scope of Open Bibliography, the OAS provides an opportunity to work on citations. This is less easy because the reporting of citations is poorly formalised (a major motivation for Open Bibliography and BibJSON), and they contain a large number of errors, including non-existent bibliographic objects. However, the potential is large and we display an analysis of citations related to a retracted paper [70].

A citation map of papers recursively referencing Wakefield's paper on the adverse effects of MMR vaccination

Figure 1: A citation map of papers recursively referencing Wakefield's paper on the adverse effects of MMR vaccination. A full analysis requires not just the act of citation but the sentiment, and initial inspection shows that the immediate papers had a negative sentiment i.e. were critical of the paper. Wakefield's paper was eventually withdrawn but the other papers in the map still exist. It should be noted that recursive citation can often build a false sense of value for a distantly-cited object.

6.3 Visualisations

Traditionally, bibliographic records have been seen as a management tool for physical and electronic collections, whether institutional or personal. In bulk, however, they are much richer than that because they can be linked, without violation of rights, to a variety of other information. The primary objective axes are:

  1. Authors. As well as using individual authors as nodes in a bibliographic map, we can create co-occurrence of authors (collaborations).
  2. Authors' affiliation. Most bibliographic references will now allow direct or indirect identification of the authors' affiliation, especially the employing institution. We can use heuristics to determine where the bulk of the work might have been done (e.g. first authorship, commonality of themes in related papers etc. Disambiguation of institutions is generally much easier than for authors, as there is a smaller number and there are also high-quality sites on the web (e.g. wikipedia for universities). In general therefore, we can geo-locate all the components of a bibliographic record.
  3. Time. The time of publication is well-recorded and although this may not always indicate when the work was done, the pressure of modern science indicates that in many cases bibliography provides a fairly accurate snapshot of current research (i.e. with a delay of perhaps one year).
  4. Subject. Although we cannot rely on access to abstracts (most are closed), the title is Open and in many subjects gives high precision and recall. Currently, our best examples are in infectious diseases, where terms such as malaria, plasmodium etc. are regularly and consistently used.

With these components, it is possible to create a living map of scholarship, and we show two examples carried out with our bibliographic sets.

A geo-temporal bibliographic map for crystallography

Figure 2: This is a geo-temporal bibliographic map for crystallography. The IUCr's Open Access articles are an excellent resource as their bibliography is well-defined and the authors and affiliations well-identified. The records are plotted here on an interactive map where a slider determines the current timeslice and plots each week's publications on a map of the world. Each publication is linked back to the original article. (The full interactive resource is available at

A geo-temporal bibliography from the Medline dataset

Figure 3: This is a geo-temporal bibliography from the full Medline dataset. Bibliographic records have been extracted by year and geo-spatial co-ordinates located on a grid. The frequency of publications in each grid square is represented by vertical bars. (Note: Only a proportion of the entries in the full dataset have been used and readers should not draw serious conclusions from this prototype). (A demonstration screencast is available at; the full interactive resource is accessible with Firefox 4 or Google Chrome, at

These visualisations show independent publications, but when the semantic facets on the data have been extracted it will be straightforward to aggregate by region, by date and to create linkages between locations.

6.4 Mashups

Bibliographic data are particularly valuable for mashups (i.e. the combination of data components that share one or more common values or identifiers). Thus, for example, it should be possible to link Open Bibliography to bibliographic entry references in Wikipedia. More generally, Open Bibliography is available for any author or organisation who wishes a definitive identification of the bibliographic entry references in a document.

Our mashups demonstrate that when data is openly available, it enables serendipitous and relatively quick development of useful tools. For example, we created a Wikipedia bookmarklet, a personal collections tool on Bibliographica, and a relevant reading list generator for the Edinburgh International Science Festival [71] [72] [73].

6.5 BibJSON-based collections and systems

In order to manage larger bibliographic datasets in a simpler format, we are now collaborating with Bibliographic Knowledge Network to develop the BibJSON format for representing bibliographic records. It is sufficient for most current purposes for basic STM articles, adequate also for all basic BibTeX types including monographs, and is extensible so it can easily support records for authors, journals, etc. The main virtues are:

We have created sample BibJSON conversions using our software (examples appended), and will continue to perform these conversions on the datasets now available. BibJSON can perform a similar function for communities wishing to share bibliographic data as GeoJSON [74] does for those sharing geospatial information.

6.6 BibSoup

The use of lightweight technology has allowed us to create a radically new approach towards the collection of Open bibliography. Conventional wisdom would suggest that all bibliographic records should be normalised and validated by a central authority. In BibJSON, we take the view that any Open bibliographic record (with its provenance) is potentially valuable, even though there may be duplicates referring to "the same bibliographic object". The question of determining whether two records relate to "the same object" is difficult and controversial and BibSoup deliberately avoids this. It consists of a number of collections of bibliography (initially in STM areas) united by a common syntax. It is left to humans and machines to develop annotations and equalities between the components of these collections. Thus, for example, "the same paper" will be reported in arXiv, DBLP and possibly even Medline.

The BibSoup approach encourages the contribution of Open bibliography without the overhead of de-duplication at contribution time. We expect that, as it grows, services will develop that help users and maintainers to manage the information. De-duplication into a central repository may be one solution (with the presumed platonic identity of STM bibliographic entries), but we also expect that software based on RDF will allow tools to manage alternative representations of bibliographic data, leaving the choice to the user as to what strategy they take. In short, current STM bibliography is a distributed mess. BibSoup takes this as a starting point and, where the political will and financial support is available, offers methods for tidying this up.


Via collaboration with the Scholarly HTML (ScHTML) [75] community we intend to follow conventions for embedding bibliographic metadata within HTML documents whilst also enabling collection of such embedded records into BibJSON and BibSoup, thus allowing embedded metadata whilst also providing additional functionality such as search. We are also working towards ensuring compatibility between ScHTML and [76], affording greater relevance and usability of ScHTML data.

We are continuing development of BibServer [77] along with the BibJSON specification as a way for individuals - or departments or research groups - to easily manage, present, and search their own bibliographic collections. Collections can be stored in BibTex files, in JSON files or a JSON database such as CouchDB, or in an OpenBiblio instance, or managed directly by the software. The key to the architectural design is that it will be possible for other interested parties to develop their own plugins both for ingest and storage, allowing flexibility in implementation.

These ongoing efforts to develop OpenBiblio, BibJSON and BibServer, will enable us to support large scale Open Bibliographic data – the BibSoup. We hope to attract further collaborations from other groups which realise the importance of Open Source code, Open Data and Open Knowledge to the future of scholarship.


8.1 Bibliographic records represented in BibJSON

The following examples demonstrate conversions of typical bibliographic records into BibJSON. Although BibJSON is not a complete standard, the aim is to demonstrate the simplicity with which we can represent this data in a JSON object, using namespaces to extend keys as necessary. The default namespace for BibJSON keys is essentially BibTex plus a few keys required to support BibJSON, such as “namespaces”; anything beyond the scope of BibTex should be added by using a namespace.

IUCr raw bibliography:

<link rel="schema.DC" href="" />
<link rel="schema.DCTERMS" href="" />
<link rel="schema.prism" href="" />
<meta name="DC.source" content="urn:issn:1600-5368" />
<meta name="DC.rights" content="" />
<meta name="DC.creator" content="Zheng, L." />
<meta name="DC.creator" content="Hu, F." />
<meta name="DC.creator" content="Zeng, X.C." />
<meta name="DC.creator" content="Li, K.P." />
<meta name="" content="2011-04-01" />
<meta name="DC.identifier" content="doi:10.1107/S1600536811007148" />
<meta name="DC.publisher" content="International Union of Crystallography" />
<meta name="" content="" />
<meta name="DC.language" content="en" />
<meta name="DC.description" content="The title compound, C11H14N2O5, was synthesized by condensation of (RS)-2-aminosuccinic acid dimethyl ester with 2-trichloroacetylpyrrole at room temperature. The amide group is twisted by 7.4 (1)degrees from the plane of the pyrrole ring. In the crystal, molecules are linked by intermolecular N-H...O hydrogen bonds into chains extending along the c axis." />
<meta name="DC.type" content="text" />
<meta name="DC.title" content="rac-Dimethyl 2-(1H-pyrrole-2-carboxamido)butanedioate" />
<meta name="DCTERMS.abstract" content="The title compound, C11H14N2O5, was synthesized by condensation of (RS)-2-aminosuccinic acid dimethyl ester with 2-trichloroacetylpyrrole at room temperature. The amide group is twisted by 7.4 (1)degrees from the plane of the pyrrole ring. In the crystal, molecules are linked by intermolecular N-H...O hydrogen bonds into chains extending along the c axis." />
<meta name="prism.number" content="4" />
<meta name="prism.volume" content="67" />
<meta name="prism.publicationDate" content="2011-04-01" />
<meta name="prism.publicationName" content="Acta Crystallographica Section E: Structure Reports Online" />
<meta name="prism.issn" content="1600-5368" />
<meta name="prism.section" content="organic compounds" />
<meta name="prism.startingPage" content="752" />
<meta name="prism.rightsAgent" content="" />
<meta name="prism.endingPage" content="752" />
<meta name="prism.eissn" content="1600-5368" />
<meta name="keywords" lang="en" content="" />
<meta name="ROBOTS" content="NOARCHIVE,NOINDEX" />


           "type" : "metadata",
           "namespaces" : {
                               "dc" : "",
                               "prism" : "",
                               "bibo" : “"
           "url" : "",
           "author" : [
                           "Zheng, L.",
                           "Hu, F.",
                           "Zeng, X.C.",
                           "Li, K.P."
           "abstract" : "The title compound, C11H14N2O5....",
           "journal" : "Acta Crystallographica Section E: Structure Reports Online"
           "bibo:issn" : "1600-5368",
           "bibo:doi" : "10.1107/S1600536811007148",
           "dc:rights" : "",
           "dc:date" : "2011-04-01",
           "dc:publisher" : "International Union of Crystallography",
           "dc:language" : "en",
           "dc:description" : "The title compound, C11H14N2O5...",
           "dc:title" : "rac-Dimethyl 2-(1H-pyrrole-2-carboxamido)butanedioate",
           "prism:number" : "4",
           "prism:volume" : "67",
           "prism:section" : "organic compounds"
           "prism:startingPage" : "752",
           "prism:endingPage" : "752",
           "prism:publicationDate" : "2011-04-01",
           "prism:eissn" : "1600-5368",
           "prism:rightsAgent" : "",

Atmospheric Chemistry and Physics BibTex:

AUTHOR = {Murphy, D. M. and Chow, J. C. and Leibensperger, E. M. and Malm, W. C. and Pitchford, M. and Schichtel, B. A. and Watson, J. G. and White, W. H.},
TITLE = {Decreases in elemental carbon and fine particle mass in the United States},
JOURNAL = {Atmospheric Chemistry and Physics},
VOLUME = {11},
YEAR = {2011},
NUMBER = {10},
PAGES = {4679--4686},
URL = {},
DOI = {10.5194/acp-11-4679-2011}

Atmospheric Chemistry and Physics BibJSON:

           "type" : "article",
           "author" : [
                           "Murphy, D. M.",
                           "Chow, J. C",
                           "Leibensperger, E. M.",
                           "Malm, W. C.",
                           "Pitchford, M.",
                           "Schichtel, B. A.",
                           "Watson, J. G.",
                           "White, W. H."
           "title" : "Decreases in elemental carbon and fine particle mass in the United States",
           "journal"  : "Atmospheric Chemistry and Physics",
           "volume" : "11",
           "year": "2011",
           "number" : "10",
           "pages" : "4679--4686",
           "url" : "",
           "doi" : "10.5194/acp-11-4679-2011",

J.ChemInf bibJSON (based on JChemInf RDF):

           "type" :"metadata",
           "namespaces" : {
                               "dc" : "",
                               "dcterms" : "",
                               "prism" : ""
           "url" : "",
           "bibjson:fulltext" : "",
           "abstract" : "",
           "title" : "ChemicalTagger: A tool for semantic text-mining in chemistry",
           "author" : [
                           "Lezan Hawizy",
                           "David Jessop",
                           "Nico Adams",
                           "Peter Murray-Rust"
           "journal" : "Journal of Cheminformatics 2011 3:17",
           "dc:date" : "2011-5-16",  
           "dc:identifier" : "",
           "dc:publisher" : "Chemistry Central Ltd",
           "dc:rights" : "",
           "dc:language" : "en",
           "dc:format" : "text/html"
           "prism:publicationName" : "Journal of Cheminformatics"
           "prism:issn" : "1758-2946",
           "prism:publicationDate" : "2011-5-16",
           "prism:volume" : "3",
           "prism:number" : "1",
           "prism:startingPage" : "17",
           "prism:copyright" : "2011 Hawizy et al;",
           "prism:rightsAgent" : "",

8.2 Medline sample record

## namespace prefixes used:
@prefix rdf:  .
@prefix rdfs:  .
@prefix owl:  .
@prefix dc:  .
@prefix dcat:  .
@prefix void:  .
@prefix bibo:  .
@prefix cito:  .
@prefix foaf:  .
@prefix skos:  .
@prefix opmv:  .
@prefix time:  .
@prefix xsd:  .

## metadata about the medline dataset the current record is in:
 a void:Dataset, dcat:Dataset ;
   ## licens terms of this (RDF) dataset
   dc:license  ;

   ## the RDF generation finished at this time
   dc:modified "2011-05-08T15:23:45Z"^^xsd:dateTime ;

   ## it came from this medline XML file
   dc:source "medline11n0421" ;

   ## which can be obtained here (in theory, not yet)
   void:dataDump  ;

   ## another way of expressing where it can be obtained
   dcat:distribution [
       dc:description "bzip2 compressed N-Quads" ;
       a dcat:Distribution ;
   ] .

## information about the medline record in question:
 a foaf:Document;

   ## link back to the dataset
   dc:isPartOf  ;
   ## license terms of this metadata
   dc:license  ;

   ## various timestamps relating to
   dc:created "2000-06-28"^^xsd:dateTime ;
   dc:issued "2000-06-28"^^xsd:dateTime ;
   dc:modified "2004-11-17"^^xsd:dateTime ;

   ## provenance information
   dc:source "MEDLINE" ;
   opmv:wasGeneratedBy [
       a opmv:Process

       ## we used version 1.3 of the medline software, and the
       ## indicated source dataset
       opmv:used , [
           rdfs:label "medline11n0421"
       ] ;

       ## it was me that did this conversion
       opmv:wasControlledBy  ;

       ## at this time
       opmv:wasPerformedAt [
           a time:Instant ;
           time:inXSDDateTime "2011-05-08T15:23:45Z"^^xsd:dateTime
       ] ;
   ] ;

   ## this metadata record is about this article
   foaf:primaryTopic  .

## some further information about me, the person who did the conversion
 a foaf:Agent ;
   foaf:mbox  ;
   foaf:name "William Waites" .

## some further information about the software that was used

   dc:version "1.3" ;
   rdfs:label "Go Medline 1.3" .

## this is the journal that the article was published in
 a bibo:Journal ;
   dc:identifier "0077-8923", "7506858" ;
   dc:title "Annals of the New York Academy of Sciences" ;
   bibo:issn "0077-8923" ;

   ## we can use the ISO abbreviation to build up a concept
   ## of standard names for journals
   skos:prefLabel "Ann. N. Y. Acad. Sci." .

## information about the actual article
 a bibo:AcademicArticle ;
   dc:title "Paradoxical phosphorylation of skeletal muscle glycogen
             synthase by in vivo insulin in very lean young adult rhesus
             monkeys." ;
   dc:published "1999-18" ;
   dc:language "eng" ;

   ## this article was published in this issue...
   dc:isPartOf [
       a bibo:Issue ;
       bibo:volume "892" ;
       dc:published "1999-18" ;
       dc:spatial "UNITED STATES" ;
       ## .. which is part of this journal
       dc:isPartOf  ;
   ] ;
   ## .. on these pages
   dc:extent "247-60" ;

   ## and here are the authors
   dc:creator [
       a foaf:Person ;
       foaf:familyName "Ortmeyer" ;
       foaf:givenName "H K" ;
       foaf:name "Ortmeyer, H K"
   ], [
       a foaf:Person ;
       foaf:familyName "Bodkin" ;
       foaf:givenName "N L" ;
       foaf:name "Bodkin, N L"
   ], [
       a foaf:Person ;
       foaf:familyName "Hansen" ;
       foaf:givenName "B C" ;
       foaf:name "Hansen, B C"
   ] .


[1] The Open Knowledge Foundation. :
[2] JISC Open Bibliography project. Funded by Joint Information Systems Committee ( ). Thanks to the JISC Open Bibliography team: Peter Murray-Rust (University of Cambridge), Rufus Pollock (Open Knowledge Foundation, University of Cambridge), Ben O'Steen (Cottage Labs), David Flanders (JISC Program Manager). :
[3] Bibliographic Knowledge Network is supported by funding from the National Science Foundation (Award #0835851). :
[4] BibTex bibliographic format. :
[5] RIS file format. :
[6] XML specification. :
[7] JavaScript Object Notation format. :
[8] Fielding, Roy Thomas. 2000. Architectural Styles and the Design of Network-based Software Architectures. Doctoral dissertation. University of California Irvine. :
[9] NoSQL database defintion. :
[10] Apache CouchDB project. :
[11] Apache Lucene project. :
[12] Apache SOLR project. :
[13] Elastic Search project. :
[14] The CERN Library publishes its book catalogue as Open Data. :
[15] Library of Congress Subject Headings, published as Linked Open Data. :
[16] UK government provides open data. :
[17] Libraries in Cologne open up bibliographic data. :
[18] Garfield, Eugene. 1986. Essays of an information scientist. ISI Press, Philadelphia, PA.
[19] E. Garfield. Using the impact factor. Current Contents, July 18 1994. :
[20] Richard K Belew. 2005. Scientific impact quantity and quality: Analysis of two sources of bibliographic data. :
[21] The open definition. :
[22] The LaTeX project. :
[23] The Dublin Core Metadata Initiative. :
[24] Knowledge 4 All project. :
[25] The WikiPedia project. :
[26] e-print archive. :
[27] Research papers in Economics. :
[28] BibSonomy social bookmark and publication system. :
[29] PhilPapers: online research in philosophy. :
[30] The DBPL Computer Science Bibliography. :
[31] CiteULike service for managing and discovering scholarly references. :
[32] Connotea: free online reference management for all researchers, clinicians and scientists. :
[33] Zotero tool to collect, organize, cite and share research sources. :
[34] Mendeley academic reference management software for researchers. :
[35] International Union of Crystallography. :
[36] BioMed Central: The Open Access publisher. :
[37] PubMed: U.S. National Library of Medicine. National Institute of Health. :
[38] AuthorClaim registration service. :
[39] The Freelib project. :
[40] SciPlore exploring science service. :
[41] K.G. Saur. 1998. Functional Requirements for Bibliographic Records, Final Report / IFLA Study Group on the Functional Requirements for Bibliographic Records. UBCIM Publications, New Series ; v. 19. :
[42] The Bibliographic Ontology. :
[43] Karen Coyle. Understanding the Semantic Web: Bibliographic Data and Metadata. :
[44] D Beckett. 2004. RDF/XML syntax specification (revised). :
[45] Jim Pitman,Nitin Borwankar. 2009. BibJSON Bibliographic Record Specification. :
[46] The OpenStreetMap project. :
[47] ChemSpider: The free chemical database. :
[48] The BibSOUP project. :
[49] L. Page,S. Brin,R. Motwani,T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. :
[50] Richard Cyganiak. 2007. The Linked Open Data Cloud. :
[51] The CrossRef DOI resolver. :
[52] G. Bilder. 2011. Content negotiation for crossref DOIs. :
[53] Peter Murray-Rust. Bibliographic data is open. :
[54] Peter Murray-Rust,Karen Coyle,Mark MacGillivray,Ben O'Steen,Jim Pitman,Adrian Pohl,Rufus Pollock,William Waites. The Open Bibliographic principles. :
[55] CiteSeer Scientific Literature Digital Library and Search engine. :
[56] U.S. National Library of Medicine MEDLINE factsheet. :
[57] SKOS - Simple Knowledge Organization System. :
[58] 2000. MARC21 specifications for record structure, character sets, and exchange media. Library of Congress Network Development and MARC Standards Office. :
[59] Ellen Gredley,Alan Hopkinson. 1990. Exchanging bibliographic data: MARC and other international formats. Library Association Publishing, London. ISBN 888022581. :
[60] T. Berners-Lee,R. Fielding,L. Masinter. 1998. Uniform resource identifiers (URI): Generic syntax. IETF RFC 2396. :
[61] The British Library. :
[62] 2010. British Library to share millions of catalogue records. :
[63] U.C. Berkeley Mathematical faculty list. :
[64] Bibliographic Knowledge Network: People. :
[65] Bibliographica. :
[66] The OpenBiblio software repository. :
[67] Mark MacGillivray. 2011. Getting open bibliographic data from PMC. :
[68] William Waites. 2011. Medline RDF. :
[69] Medline dataset available on CKAN. :
[70] Wakefield et al, 1998. RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, Volume 351, Issue 9103, Pages 637 - 641 :
[71] Tatiana De La O. 2011. Bibliographica gadget in Wikipedia. :
[72] Tatiana De La O. 2011. Collections in Bibliographica. :
[73] Mark MacGillivray. 2011. Bibliographica and Edinburgh International Science Festival. :
[74] The GeoJSON specification. :
[75] Scholarly HTML. :
[76] :
[77] The BibServer project. :