The power of persistent and unique identifiers

The ability to uniquely and persistently connect a string of numbers or characters to an object or entity confers a great power. Unique identifiers enable users to aggregate, link, filter, assign and otherwise use digital objects and their associated data in new and valuable ways. In recent years, considerable progress has been made in developing the architecture of unique identifiers within scholarly publishing, which is proving a boon not only for the organization of objects – such as the thousands of articles published each year – but also for the development of new tools and services.

Unique identifiers are not a new concept. For hundreds of years, biology has used binomial nomenclature to identify, track and communicate about different species. Unique identifying systems are commonly used in many areas, including commerce (e.g. bar codes and BIC codes), the US military, chemistry and others. And of course the uniform resource locator, or URL, of this webpage is a form of unique identifier.

In publishing, unique identifiers have been used for at least 40 years: international standard book numbers (ISBNs) were introduced in 1970 to identify different book editions and formats, and international standard serial numbers (ISSNs) were introduced soon afterwards to identify journals and periodicals. To be effective, a unique identifier must be easily generated by authorized users; be format (e.g. print or online) independent; be able to accommodate many types of publication; serve only to uniquely identify; and be compatible with existing standards. Identifiers often have a check digit incorporated into their construction to help avoid human input errors. For example, the check digit of the 8-digit ISSN is formed via a ‘modulus’ algorithm based on the first 7 digits. The eighth digit, the check digit, can be any number from 0-9 or an ‘X’, denoting 10.

The most entrenched and widely accepted unique identifier in publishing is the digital object identifier (DOI), which is used to identify journal articles. More recently, however, unique identifiers for people, data, institutions and funders have begun to gain traction. Despite their similar purposes, their creation and application are quite different.

Articles and other published outputs
Various efforts have been made to persistently identify individual articles but one has prevailed: the DOI. DOIs were developed by a suite of publishers in the early 2000s, who later formed Crossref, a not-for-profit membership organization funded mainly by publishers but also governments, universities and archives. Crossref is the main DOI registration agency, but there are others. DOIs can be assigned at more granular levels than the article; for instance, they can be used to identify figures and tables within a document or as separate entities.

Recent changes in how DOIs are used – such as using more than one DOI for a particular object (which, for example, can happen for re-publication of articles in anthologies) and whether preprints can be assigned a DOI (they can from August 2016) – have led Crossref to update its rules for assigning DOIs.

Construction: DOIs have a suffix and a prefix, separated by a slash. The prefix comprises the directory indicator, a full stop, and a number signifying the publisher or organization registering the DOI. The suffix is assigned by the registrant to each document following it own protocol. Many journals use a combination of letters denoting the journal name, followed by a number that resembles an article ID.

Example: doi:10.2151/jmsj.2016-010

The ‘10’ indicates that the character string is a DOI, ‘2151’ refers to the Meteorological Society of Japan and ‘jmsj.2016-010’ is the article within the Journal of the Meteorological Society of Japan. Adding ‘http://dx.doi.org/’ before a DOI creates a hyperlink that links straight to the article (note that some browsers allow direct resolution of a DOI).

People
The ambiguity of author names has long been recognized as a challenge in scholarly publishing. Name changes, language barriers, differing regional name conventions, and the variation in how authors use their own names and initials can make it difficult or even impossible to identify individuals – and their work – correctly. And of course, different people can have the same name.

To address some of these challenges, the Open Researcher and Contributor ID (ORCID) was introduced in 2012. ORCID is not the first project to address disambiguation in names (others are ResearcherID and Scopus Author ID), but it has the vital elements of not being owned by a single entity and of already attracting broad community support (more than 2.2 million IDs have been issued to date). ORCID was developed on top of previous efforts, including ResearcherID and the international standard name identifier (ISNI) framework.

Construction: The ORCID number, like the ISNI, is a 16-digit number, the last of which is a modulus check digit.

Example: orcid.org/0000-0002-8432-5341

Data, institutions and funders
In response to the need to identify data, data sets and related objects, a number of mostly European and American universities, libraries and information centres launched DataCite, a member-funded organization, in 2009. DataCite uses the DOI system to identify data sets.

Unique identification at the institutional level was demanded by requirements to track things like funding inputs and article outputs by universities and research organizations. The emerging standard appears to be the Ringgold Identifier, which is owned and assigned by Ringgold, Inc. Ringgold IDs, which are 4-6 digits long, are applied in simple numerical order to institutions involved in any aspect of scholarly publishing. Unlike the open and independent nature of many other unique identifier systems, the Ringgold system is a closed, paid service and registration is required for simple searches of their database. Other systems like the open ‘Global Research Identifier Database’ (GRID) are available, albeit GRID is also controlled by a commercial company.

Like institutions, funders need to track the inputs and outputs related to scholarly publishing. Crossref’s Open Funder Registry is a simple spreadsheet, published as public domain material, that contains a common taxonomy for over 11,000 funders. Publishers use these identifiers by encapsulating them within their manuscript submission systems and production workflows.

Use and integration of unique identifiers
Unique identifiers form an underlying infrastructure that supports other useful products and services. For example, the reference links in HTML and PDF versions of manuscripts are based on Crossref’s DOI resolution tool. New services such as ImpactStory, Kudos and Plum Analytics, which help authors and publishers to track and measure the impact of research outputs, rely on DOIs and ORCIDs. Other new businesses, such as Figshare, have offered DOI assignation as a product feature. Interestingly, several organizations in the UK’s building industry have recently joined together to investigate the feasibility of using DOIs to identify construction products and documentation. The power of good infrastructure is that it often has wider applications beyond its original purpose.

Author: Dugald McGlashan