BIOGUID.org

Frequently Asked Questions

General

So wait... what is the purpose of BioGUID.org, again?
Biodiversity data is being digitized at a rapidly increasing rate. Hundreds of databases of taxon names, specimens, literature, and a broad array of other related data are springing up all over the internet, and many large data agregators (like GBIF, Catalog of Life, Encyclopedia of Life, iDigBio, and many others) are developing tools to help organize all this information. Most of the databases and aggregators assign unique identifiers to the objects they manage. Despite a series of workshops, whitepapers, and various other earnest efforts to coordinate identifiers for biodiversity data, the reality is that most biodiversity data objects have multiple identifiers assigned to them. Shared items such as taxon names and literature records have many different identifiers assigned to them because many different databases need to reference the same object (e.g., the the literature record for Linnaeus' 1758 Systema Naturae has many different identifiers assigned to it, as does the bluefish, Pomatomus saltatrix. Even non-shared items, like Museum specimens, may have multiple identifiers assigned to them, in the form of Catalog Numbers, Accession Numbers, and identifiers assigned by GBIF, iDigBio, and others. The purpose of BioGUID.org is to index and cross-link all of these different identifiers within the field of biodiversity.

What do you mean by "Indexing" Biodiversity identifiers?
The primary function of BioGUID.org is to index identifiers that have been applied to data objects of relevance to biodiversity research. By "indexing", we mean that BioGUID.org includes a copy of the identifier itself, with core information about the nature of the identifier (e.g., what kind of identifier it is), who or what issued or assigned the identifier, and any web services that perform some sort of action on the identifier. Just having this basic information in one place for all the zillions of identifiers out there will allow us to start capturing the scope of identifiers that exist out there in the biodiveristy sphere.

What do you mean by "cross-linking" biodiversity identifiers?
While simply indexing the world's biodiversity identifiers is an important step towards mobilizing biodiversity data, the real power comes when biodiversity identifiers are cross-linked to each other. For example, we may know that the identifier "urn:lsid:zoobank.org:pub:2C6327E1-5560-4DB4-B9CA-76A0FA03D975" refers to a Reference in ZooBank, and that 542 is Title ID in the Biodiversity Heritage Library. But it's especially helpful to know that both identifiers refer to the same data object (in this case, Linnaeus' 10th Edition of System Naturae, published in 1758). And it's even more helpful to know that nearly two dozen other identifiers have been assigned to this same Reference citation.

How does "Cross-Linking" work?
The simplest way to cross-link identifiers is to express a direct relationship between them. For example, we might have a table that has the value "urn:lsid:zoobank.org:pub:2C6327E1-5560-4DB4-B9CA-76A0FA03D975" in a field called "resourceID" and the value "10.5962/bhl.title.542" in a field called "relatedResourceID", with some expression of how the two are related (e.g., "sameAs") in a field called "relationshipOfResource". However, with 22 different identifiers all assigned to the same object, there would be over 250 records just to represent each pair-wise relationship between all of the different identifiers (and that assumes the relationships are only expressed in one direction; double this number for bidirectional representation of each relationship). So, in order to keep things a bit more manageable, we chose a different strategy for cross-linking identifiers to each other, by anchoring all 22 different identifiers to the same conceptual Object. In this way, the same table that stores the identifier itself can also store the link to the conceptual object to which it refers. In this case, all 22 identifiers would link to te same Object in the BioGUID index. Thus, with the addition of one extra field to the table of identifiers, we avoid creating 250+ rows in a separate table of cross-links.

Okay.... so what is a BioGUID.org "Object", then?
Within the BioGUID.org index, an "Object" is a place-holder for a conceptual "thing". No, that's not a very helpful explanation... so an elaboration is in order. Most data objects in biodiversity informatics refer to conceptual objects. For example, when we wish to reference the publication Linnaeus 1758 (Systema Naturae) — for example, as the publication in which a taxon name was originally established — we don't mean a particular physical copy of that book (e.g., the copy in the Smithsonian Institution Library); rather we mean the conceptual representation of that published work. The physical books themselves are effectively just instances of this conceptual object; but it's the conceptual object to which we really want to refer to, and it's the conceptual object that most identifiers are assigned. Even objects that are very "physical" in nature — like specimens — are really represented as conceptual objects within biodiversity informatics. For example, in the core record in DarwinCore for representing a specimen (and indexed by GBIF) is the "Occurrence". In other words, it's not really the physical specimen per se, but the aggregate set of metadata associated with the collecting event (another conceptual object) when the organism (another conceptual object) was extracted from nature. In BioGUID.org, we generate a unique record to represent every unique conceptual object to which an identifier has been assigned. These are the "Object" records to which the identifiers are linked, allowing us to cross-link an unlimited number of identifiers to the same conceptual object.

Isn't BioGUID.org compounding the problem by minting even more identifiers to things that already have identifiers?
Actually, no. The original data model for BioGUID.org did assign a persistent UUID for each conceptual object, but the model was changed to eliminate that aspect. There is an internal primary key established for each Object, but this is only used for internal linking purposes, and is never exposed outside of BioGUID.org (just like the primary key values of many other database systems). The conceptual objects within BioGUID.org are identified by the one or more external identifiers that are linked to them. So, BioGUID.org is not adding to the soup of existing identifiers; its only job is to index and cross-link existing identifiers.

Identifier

Why do I get so many results when searching for an ISSN?

ISSNs (International Standard Serial Numbers; typically assigned to Journals) are tricky for a couple of reasons. First of all, they contain a dash (-) — one of the characters that BioGUID.org ignores when searching for matching identifiers. Because they are (mostly) numeric, stripping the dash makes the ISSN look like an integer, which means that it matches all kinds of other numeric identifiers. We have some ideas on how to solve that particular issue, but so far it doesn't seem to represent too big of a problem (except for ISSNs with “0s” (zeros) in several of the first few positions).

Another problem with ISSNs is that, in many cases, the ISSN is embedded within Digital Object Identifiers (DOIs) assigned to individual articles within the Journal — which means that a search for the ISSN will also yield results for many of the individual articles within the journal (because the search picks up the ISSN within the DOIs).

Yet another problem with ISSNs is that they are not always precisely mapped to Journals in the same way that other identifier systems apply identifiers to journals. An extreme example is Annals and Magazine of the American Museum of Natural History (ISSN: 0374-5481). Only one ISSN has been assigned to all sixteen series of this Journal; but other bibliographic databases treat each series as though it were a seperate "Journal" (i.e., distinct conceptual object). The question for BioGUID.org is: should it merge all of these different Series of thsi journal into a single data object (in which case the object will be associated with sixteen different identifiers for each of the Identifier Domains that regards each series as a distinct data object)? Or, should BioGUID.org maintain each Series as a distinct data object (in which case there will be sixteen different results for the same search on an ISSN number)? For now, we're sticking with the latter; but this may change.

Finally, unlike DOIs, there is no universal Dereference Service for ISSNs (at least none that aren't behind a pay-wall). In fact, there are numerous online ISSN Dereferencing Services, but in general, they serve a specific subset of Journals. Whenever you search for an ISSN you'll see a string of little icons next to each result record (i.e., the contents of the returned AlternateDereferenceServices column). Many of these may not have any information on any one particular ISSN, so it's not obvious which should be desigated as the PreferredDereferenceService for the ISSN Identifier Domain in BioGUID.org. For now, we've settled on ResearchGate as the preferred Dereference Service for ISSNs (as they seem to be among the most complete), but that may change.

IdentifierDomain

What is an "IdentifierDomain"?
An IdentifierDomain is a set of Identifiers that share certain properties, and are unique within the set (exceptions to uniqueness exist, and are accomodated by BioGUID.org, but are recognized as errors). For example, Digital Object Identifiers (DOIs) are ab IdentifierDomain. And so are ISSNs. In both cases, the identifiers are consistent in form, are (or at least are intended to be) globally unique within the Domain, and are applied to a certain class of objects. It's not always clear where to draw lines between IdentifierDomains. For example, catalog numbers assigned to specimens in a Museum collection could be treated as belonging to a collection-level IdentifierDomain (e.g., an integer catalog number series that is unique within a particular collection), or if all catalog numbers are combined with a "Collection Code" prefix, then a single institution with multiple collections could be thought of as a single institutional IdentifierDomain (where the combination of collection code and catalog number together represent unique identifiers across the entire institution). If one considers that institution codes are (essentially) unique on a global scale, then the combination of institution code plus collection code plus catalog number could be thought of as a globally uniqe set of identifiers, and as such one could think of all of these so-called "Darwin Core Triplets" as representing a single IdentifierDomain. In most cases, the boundaries between IdentifierDomains is fairly obvious.

All content within the BioGUID.org site is available under the Creative Commons Zero license (Public Domain).