Recently, Rob Whitton and I were visiting my mentor, Jack Randall, at his home in Kaneohe. Jack was telling us about his current project: a revision of the fish genus Pempheris in the western Indian Ocean. Jack mentioned to us that he had a specimen from the Natural History Museum in London that he was examining, which had incomplete label data. The catalog number on the label was “BMNH 1901.1.30.61-67”, and a Google search yielded no results. Rob and I had just finished setting up BioGUID.org, and the first round of data import and indexing had just completed. So... we looked at each other and thought, “What the hell...why not give it a try?”. I booted up my laptop and we fired up Chrome and opened the BioGUID.org website. We figured the odds were vanishingly small that a completely random catalog number (on the jar of a fish specimen that just happened to be on Jack Randall's home office counter when Rob and I were visiting him) would already be indexed in our little system (that was less than a couple days old and hadn't even been properly tested yet). But nothing ventured, nothing gained, so I entered the catalog number into the search box, clicked the search button, and...
...it just WORKED! Not only did it work, it was lightening fast. Before my finger was even off the mouse button, the results were already on my laptop display. Not only did it find the catalog number, it found the cross-links to two separate GBIF Identifiers, three different NHM identifiers, and a second representation from FishBase (see: here).
Thus, the use-case for looking up a random identifier having something to do with Biodiversity was off to a good start!
UPDATE (9 April 2015): What a DELIGHTFUL surprise! I was helping Barb Kennedy (our collections manager for the Herbarium here at Bishop Museum) with a local database issue just now, and we were discussing how to deal with the fact that we have both sheet numbers and barcode numbers assigned to all new plant specimens coming in. I decided to show her BioGUID.org, and how it can be used as a tool to help deal with multiple identifiers assigned to the same ‘thing’. To help explain it, I had her read this use-case. But when I went to show her how Google couldn't find anything related to 1901.1.30.61-67, I was stunned to see that unlike a couple weeks ago (when there were no results at all), there are now six results. My first reaction was “Damn! Now people will read this Use-Case and wonder why I had said there were no results...” My next thought was, “Huh.... what a coincidence that Google just happened to index the NHM portal right after I searched for a catalog number.” But then it suddenly hit me: “HOLY CRAP!! That’s ME!!” Among the six results in the Google search, two are links to the BioGUID.org site, two are the same two links that a search on BioGUID.org finds for the NHM Portal, and the other two are the same two links that a search on BioGUID.org finds for the GBIF Portal. Obviously, Google’s search bots stumbled on BioGUID.org (I have no idea how — I guess via the GBIF Challenge site?), crawled this Use-Cases page, and found the link (and associated results) to the search results on BioGUID.org ... thereby adding all six links to the Google search index. Of course, this only applies to this particular identifier because it’s featured on the BioGUID.org home page and this Use-Case page. However, now I'm encouraged to post a file somewhere with ALL of the cross-links in such a way that Google can discover the whole set of a billion+ identifiers and their corresponding web links. To say that I'm stoked would be a VAST understatement!
A common issue with GBIF Occurrence data is that the same occurrence may be represented multiple times. There are several reasons why this can happen, mostly involving inconsistent use of institutionCode or collectionCode values; or the same records provided by different datasets; or both (among other reasons). Although GBIF has tools that attempt to minimize this duplication of records, there are many records that slip through. Moreover, GBIF currently does not have an internal mechanism for flagging these duplicates when they are discovered.
BioGUID.org is capable of discovering and flagging these records, because Identifiers are anchored to conceptual objects. Thus, when more than one identifier issued by the same Identifier Domain is linked to the same object record in BioGUID.org, it reveals redundant identifiers (i.e., multiple identifiers from the same Identifier Domain assigned to the same data object). Here's how it works:
When we imported records from GBIF, we cross-linked the GBIF internal identifiers (gbifID) to the corresponding “Darwin Core Triplet” (institutionCode + collectionCode + catalogNumber), such that both identifiers were anchored to the same conceptual object in BioGUID.org. Then we looked for cases where the same DarwinCore triplet was represented by more than one BioGUID.org object. When we are confident that the two records represent the same occurrence instance (i.e., same conceptual object), we merge the two BioGUID.org objects into one. That way, all other identifiers anchored to the same duplicate objects are thereby cross-linked to each other as well.
After completing this conceptual object merge, it's straightforward to discover cases of multiple identifiers assigned to the same object by the same Identifier Domain. That is, cases where a single Identifier Domain has assigned more than one of its identifiers to the same conceptual object. In the case of GBIF identifiers for occurrences, there are more than half a billion identifiers, so it takes a while to discover all of the redundant identifiers. However so far we have found the following:
The Index of multiple identifiers assigned by the same Identifier Domain to the same data object can be downloaded here. For an explanation of the download file and columns, see the description under “Export Data Structure” on the API page. We are currently running scripts to identify additional examples of multiple identifiers from within the same Identifier Domain, and we will be adding more Identifier Domains as we go; so the contents of this file will expand over time.
|All content within the BioGUID.org site is available under the Creative Commons Zero license (Public Domain).|