Oct 12 2015 8:38AM (UTC) It goes without saying that I am EXTREMELY GRATEFUL! I am deeply grateful to the Ebbe Nielsen Challenge jury members, and especially to GBIF for its efforts, including the establishment of this Challenge, to help the world gain access to biodiversity data. And of course, congratulations to Peter Desmet, Bart Aelterman and Nicolas Noé for their work on Datafable, which very-much deserves the first-place award!
Oct 9 2015 10:39PM (UTC) If you look really closely, you might have noticed a very subtle change to the BioGUID logo (see the upper banner, and the bottom of the API Page; you might need to refresh your browser). The change involves a slight shift in the hue of the blue and green parts, plus the addition of a third (yellow) link in the chain between "Bio" and "GUID". Also, the background is now white (transparent for the png files). Why did I do this? Well, as I hinted in an earlier post, BioGUID is now formally part of the Global Names Architecture (GNA), and to represent this association, the BioGUID logo color scheme has been updated to match that of the GNA logo. Also, BioGUID's GitHub space has been moved to the GNA GitHub project. The old GitHub site will remain live, but only consists of a note re-directing visitors to the new BioGUID GitHub site. Over the next few weeks, I will gradually transfer all the documentation and source code over to the new GitHub repository.
Oct 8 2015 6:39PM (UTC) I discovered a bug that prevents uploading identifiers with no value entered for "RelationshipType" in the uploaded CSV. It should be easy to fix, but for now, the work-around is to make sure you include a value in the "RelationshipType" column of the uploaded file. If you're only submitting raw identifiers (with no related identifiers), just use "Congruent" for RelationshipType.
Oct 8 2015 11:55AM (UTC) Another late night...sigh... I wanted to do some performance tweaking on the bulk data import process, so I set up a series of import dataset CSV files, one containing 10 records, one containing 100 records, one containing 1,000 records, one containing 10,000 records, one containing 100,000 records, and one containing 475,574 records. Unfortunately, my import routine crashed on the first dataset (10 records). Surely the problem must be related to the new performance monitoring code I added to the routine. Right? Nope. SIX hours of frustrating hair-pulling later, I finally sleuthed out the bug, which was related to a rare circumstance that just HAPPENED to be represented in those first ten random records. (Note to self: no one wins when you try to find an obscure bug on dense code when you're sleep deprived. No one.) In any case, the 100,000 record batch took about 2 minutes to complete, and the 475,574 record batch took about 15 minutes to complete. Not bad, but not great. More performance tuning is in order, methinks.
Oct 8 2015 6:42AM (UTC) I noticed that earlier today someone batch-uploaded a dataset that contained several email addresses as identifiers mapped to Agents. The import was successful, and the identifiers were correctly incorporated into the BioGUID.org index. However, email addresses are one of the few Identifier Domains that is classified as "Hidden". This means that the results are not displayed in the search output. For obvious reasons, we don't want to expose email addresses on a web service such as this. However, we might want to use BioGUID as a way of locating people when you already have an email address. For example, if I search for '8C466CBE-3F7D-4DC9-8CBD-26DD3F57E212' (my ZooBank/GNUB UUID), I don't want my email address included in the results. However, if I search BioGuid.org for "firstname.lastname@example.org", I see no reason why I shouldn't display the results of other identifiers mapped to me (including my ZooBank/GNUB UUID). I'll give this some thought.
Oct 7 2015 9:52PM (UTC) This week, Dima Mozzherin and Alexander Myltsev are visiting me to work on the Global Names Architecture, and they are very kindly giving me a crash course in using GitHub correctly. I'll report more on our meeting soon (and the relevance of our GNA discussions to BioGUID.org), but over the next week or so, I will be moving more and more content and source code from BioGUID.org to GitHub. Watch this space!
Oct 7 2015 5:28PM (UTC) OK, it hasn't been 24 hours, but I'm satisfied now that the memory leak issue really has been solved. So I'm going to start playing with bulk import performance.
Oct 7 2015 9:29AM (UTC) The screen-scraping madness finally stopped! I hope I can now get an actual test on the memory leak issue. As soon as that's done, I've got a bunch of new identifiers to start importing!
Oct 6 2015 12:34PM (UTC) Oh, and by the way... while I was waiting for the Tomcat thing to sort itself out, I spent some time creating a new documentation page on the data Architecture. It's not finished yet, but it's important to try to get as much documentation online as possible. Nowhere near as much fun as writing Code, but important nonetheless. Oh, and another by the way: I updated the content on GitHub to include SQL scripts for generating all the objects in the four databases used to run BioGUID. You can access them directly here.
Oct 6 2015 12:29PM (UTC) Hokay... so I was able to track down the source of the runaway Tomcat situation. It turned out to be a screen-scraping exercise to harvest identifier cross-links against another website on the same server (yeah, I'm looking at you, Global Phylogeny of Birds people.... ahem). Actually, I have no problem with them doing it; it just messed up my memory leak test; so I'm going for another 24 hours to see if I really did fix the problem. In any case, all that screen-scraping thing really tells me is that I need to finish making BioGUID services much more functional, and much more visible, so people can just come and get the cross-linked identifiers, without having to ping them one at a time at a rate of about 20 per second...
Oct 6 2015 6:24AM (UTC) Well.... there's good news, and there's bad news. The good news is that, after some 20 hours (ish), the memory leak problem seems to be solved! (Woo-hoo!) The bad news is that I can't be sure if it's really solved, because I have a new (separate) issue with Tomcat eating up all my server CPU. SQL's also pegged out (as it can be at times when there is heavy usage), but Tomcat is saturating the CPU usage (which almost never happens), so something else seems to be happening now. I don't think my fixes of the memory leak issue touched Tomcat, so I suspect it's an unrelated issue. The problem is: I don't know whether I really solved the memory leak problem, or if the runaway Tomcat thing has just masked it (i.e., prevented it from happening by saturating the server). If Only I were a real computer programmer, instead of a taxonomist, I wouldn't feel like such a noob all the time. Sigh. OK, it will probably be a long night (again)....
Oct 5 2015 7:39AM (UTC) OK, except for taking a break to watch "The Martian" with my son (very good, but the book is better), I spent most of today fixing the orphan Identifiers object, dithering away on documentation for BioGUID.org, and tracking down the source of the memory leak. I think I figured it out, but I need to leave the system alone for 24 hours to make sure (I don't want to confound performance stats and memory usage with my efforts to monkey around with the underlying system). So, I'll leave the BioGUID.org site alone tomorrow to get some "clean" stats, and instead crank out some more documentation (data model, etc.). If all goes well, I'll be back to tweaking performance and increase robustness of the bulk import process, while simultaneously adding more content!
Oct 5 2015 3:55AM (UTC) By the way, I forgot to mention earlier that I decided to expand the categories of search result matches to three. I used to split "Exact Matches" from "Close Matches", but "Exact" matches were not really exact. So, now "Exact" matches literally mean exact matches, where the search term exactly equals the identifier. "Close" matches are now what "Exact" matches used to be: that is, if the search term is found verbatim within the Identifier, but it's not an exact match, then it's considered a "Close" match. Finally, other matches that somehow relate to the search term, but not closely, are displayed in the new category, "Other Matches". I hope that makes sense...
Oct 5 2015 2:29AM (UTC) The missing object records for the orphan Identifiers have now been completely generated. There were over 150 thousand of them, but because of the way SQL works to find such orphans (i.e., via an outer Join with missing link, or via a "NOT IN" WHERE clause), it actually takes a long time to find them all. At first, I was able to find them in batches of about 10,000 in 10 minutes (also allowing the server an additional half hour between each batch to update full-text indexing across the billion identifiers and half-billion objects). However, the fewer orphans there are, the longer it takes SQL to find them. The last three missing records took 3 hours and 42 minutes to find! Once the missing object records were created, it took an additional 2 hours to apply the referential integrity constraints on the database (again, a billion identifiers cross-linked to a half-billion object records) to prevent any more such orphan identifiers in the future. The site was slowed to a crawl during this time, but the dust now finally seems to be settled, so BiOGUID should be back up and running again at normal speeds.
Oct 4 2015 7:19PM (UTC) Oops! Well... I uploaded the files for the new data export services, but forgot to update the Data page with the user-interface to allow people to play with the new export services through the website. Fixed now.
Oct 4 2015 7:10PM (UTC) OK, after a brief nap, I've got a routine working now to fix all the orphaned identifiers (i.e., identifier record linked to Object records that don't yet exist in BioGUID.org). As soon as that's done, I'll alter the database to prevent it from happening again (rookie mistake on my part). Meanwhile, I just discovered a glitch that was causing the "Exact Matches" vs. "Close Matches" to give funky results when searching for identifiers. It should be fine now.
Oct 4 2015 1:48PM (UTC) YIKES! OK, I just discovered an issue where Identifiers might be linked to records that don't exist in the Object table! This was a serious oversight on my part in establishing the core database infrastructure. It's nearly 4am here in Hawaii, so I'll have to deal with this tomorrow. I don't know how extensive the problem is, but the server may be intermittently offline for the next day or two -- depending on how many records are affected.
Oct 4 2015 1:41PM (UTC) After returning from our expedition to the land of ample bandwidth (and putting out all manner of fires and catching up on some badly needed sleep), I was finally to upload the rest of the Code for BioGUID.org. The good news is that the full functionality of the site is finally working! The bad news is that there still seems to be a memory leak associated with bulk uploading datasets. I will continue to research this for the remainder of the weekend. I also need to update all the documentation, and get all the source code loaded onto GitHub. I hope to have this done within the next couple of days.
Sep 28 2015 10:44PM (UTC) For what it's worth, it took me several minutes just to upload the text of the previous news item, so you can imagine how difficult it is to manage a server through a remote connection at this speed!
Sep 28 2015 10:41PM (UTC) Well... my hopes of having better internet access from the ship have not panned out, so I am still unable to finalize the code on the BioGUID server. There have been some issues with the server itself, but it seems to be working most of the time. The good news is that I will be back in the land of high-speed internet in two days, so I hope to have everything uploaded by October 1.
Sep 21 2015 11:59AM (UTC) So... I forgot to mention — until I can figure out where the memory leak is coming from, the site will likely be intermitently unavailable and/or slow. Sorry about that! Maybe I'll get lucky and the problem will just go away on its own. Yeah, right....
Sep 21 2015 11:57AM (UTC) After eight consecutive days of deep diving, plus several days of no internet access (not to mention a temporarily broken keyboard on my very expensive laptop...), I'm finally now able to get back on the BioGUID server to finish transferring code for the new data services. Unfortunately, when I logged in to the server, I discovered there has been a pretty serious memory leak issue. This was not evident on the development server, and it might just go away when I get the rest of the code ported (oh, to be that lucky!). However, I'm bracing myself for several tedious hours of troubleshooting through what feels like the equivalent of a 9600baud modem connection to the internet. *sigh* On a brighter note, I was able to finish up a major NSF proposal due on Tuesday, and just finished uploading it to the NSF website. Of course, I need to do this at 1am local time to avoid competition for the precious few data bits I can send and receive to the internet...
Sep 11 2015 3:59AM (UTC) The server has been acting strangely all day long, and I'm not sure why. I've had to restart it several times, and it seems OK for a while, but then it grinds to a halt. I can't help but think that I might have broken something yesterday during my marathon session to port the new code over to the site via the ship's internet, but I don't see any obvious issues, and there are no smoking guns in any of the activity logs. I'll keep an eye on it and see if the problem continues.
Sep 10 2015 12:50PM (UTC) Damn! I just discovered that the background processing of identifiers associated with new objects (i.e., data objects not already in BioGUID.org) is extremely processor intensive, due to the cache updating and subsequent generation of full-text indexes. Not only does this slow the data import process for batch uploads, but it turns out that it brings the server to its knees. Obviously, we'll need to re-architect the batch import process (perhaps dividing large batches into smaller batches), but that will have to wait until we return to Hawaii. Until then, expect BioGUID.org to be slow or non-responsive while large datasets (more than a few thousand records) are imported.
Sep 10 2015 12:45PM (UTC) I've just run a test of the Batch Import feature and it seems to be working correctly. Currently, it requires careful formatting of the submitted CSV file, but when we return to Hawaii we plan to implement a system similar to the IPT system, where content providers can create a metadata file describing the structure of the batch file, and then generate updates on a web server that can be regularly updated and harvested by BioGUID.org. It had been my intention to have most of this implemented before I got on the NOAA ship, but I let it slide because I figured I could finish it on the ship. Little did I realize how painstaking that process would be! To give you an idea how tedious it has been, each keystroke and mouse click takes about 3 seconds through the ship's internet!
Sep 10 2015 10:58AM (UTC) More Progress! It took the better part of the day, but we finally got most of the code transferred to the production server through the ship's internet. HUGE thanks to Rob Whitton! We didn't get everything transferred (the code for two more data export services need to be ported), but we at least got the new API file updated, and the Batch Import system seems to be fully transferred and running. Read the API page for more details, or visit the Data page to try it out. We didn't spend much time making it look pretty, but we'll try to clean it up a bit over the next few days. And we'll definitely do a lot of clean-up work after we return to Hawaii at the end of September. Meanwhile, please report any issues to me at deepreef [at] bishopmuseum.org. The internet problems are mostly related to uploading content. Downloading (viewing web pages and receiving email) seems to be working a bit better on the ship.
Sep 10 2015 4:22AM (UTC) Progress! We seem to have the Domains Page functional, so you can now create new Identifier Domains through the live website. More coming soon!
Sep 9 2015 7:11PM (UTC) Making slow but steady progress transferring the Code files. Unfortunately, last night I tried to transfer some files but the internet on the ship cut out when the power cut out, and I ended up breaking the BioGUID home page! It's fixed now, but we still have more files to transfer, so there may be some wonkiness on the site over the course of the day.
Sep 5 2015 11:40PM (UTC) Crap! The ship's internet sort of works for email and web pages, but heavier lifting (like transferring Code to the BioGUID server) seems to be blocked. So.... some of the pages got trasnferred, but a lot of the back-end Code did not. I'm hoping to get this solved within the next few days. My apologies!
Sep 4 2015 7:03PM (UTC) To give you an idea of the internet situation, it took me more than a minute just to transfer that last newsitem text.... yikes!
Sep 4 2015 7:02PM (UTC) Oh, and by the way... I'm on a NOAA ship enroute to the remote Northwestern Hawaiian Islands (amid a few hurricanes), trying to upload two large video files through a very slow and tenuous satellite internet connection. Obviously, it would have been smart to have transferred the files before getting on the ship; but alas, the days leading up to this cruise were filled with preparations for the cruise!
Sep 4 2015 6:58PM (UTC) No, this site is note dead! (Despite what it might seem based on the nearly 5-month hiatus from updating this newsfeed.) I can't say I've been working on BioGUID.org continuously for the past five months (there was a major expedition and some other travelling, some grant proposals and reports to write, some major issues at the Museum, my daughter home for the summer, etc.) However, I can also say that an absence of newsfeed posts does NOT reflect an absence of progress! I will be reporting on the progress over the next few days, but for right now I need to transfer the latest Code to the Production site in time for the GBIF Ebbe Nielsen Challenge (Round 2) deadline.
Apr 16 2015 10:26AM (UTC) I'm still running batch processes behind the scenes, so I sincerely apologize for the inconsistent performance. Almost all of these resource-intensive processes are related to initializing datasets and generating additional indexes. Once the dust settles, routine content enrichment should be far less resource-intensive, and will have much less of an impact on the overall performance of the system. Until then, thanks for bearing with me!
Apr 15 2015 2:11AM (UTC) DAMN! I guess I picked the wrong time to run a bunch of background processes! To everyone visiting this site for the first time: I promise that the search results are normally very fast (1-3 seconds)! But I'm moving around 220GB of data right now (along side the 550GB BioGUID database), so the system is much slower than normal! I'll try to get it back to the normal stable/fast performance as soon as possible!
Apr 15 2015 2:01AM (UTC) Wow!! I am deeply honored, humbled, and excited that BioGUID.org has been selected as one of the finalists for the GBIF Ebbe Nielsen Challenge! A hearty THANK YOU to the jury — this has defininitely re-invigorated me in keeping this project moving forward! More to come soon!
Apr 15 2015 1:56AM (UTC) The FTP transfer of the new GBIF download is complete, and I'm running some batch processing in the background. Over the next few days there will be periods of time when the server is much slower than normal (including right now). I hope to have this done within the next couple of days, and I will try to confine the processing to the middle of the day in Hawaii (night-time everywhere else in the world!)
Apr 14 2015 5:33AM (UTC) Once again, an absence of posts to this news-feed does not reflect an absence of progress. Last week I downloaded a new cut of the GBIF dataset (identifier fields only — thanks again to Tim!), and I spent this weekend processing and indexing it. This time I did the processing on a different server so that I wouldn't impact performance on the production server. The bad news is that there are still 11 hours left on a 36-hour FTP transfer of the processed file to this server. Sigh...
Apr 9 2015 10:13PM (UTC) HOLY CRAP! It's WORKING!! No, I don't just mean that the BioGUID.org website and data service are working; I mean the BioGUID.org CONCEPT is working! Check out the “Update” to Use-Case #1...
Apr 8 2015 11:30AM (UTC) So... I noticed a lot of queries for filenames ending with ‘.jsp’ (obviously robot probes looking for security holes). Out of curiosity, I did a search on ‘.jsp’, and came up with three objects, one of which had a zillion linked identifiers. At first I thought it was an error, but it's actually correct (all are identifiers associated with the Black-headed Gull, Chroicocephalus ridibundus). However, at least two of the links were broken. The IUCN identifier had been linked to ‘106003240’ (should be ‘22694420’), and the OBIS identifier had been linked to ‘834981’ (should be ‘460266’). I corrected both records, but my concern is: how many other identifiers out there really aren't so peristent? And... should BioGUID.org track historical identifiers (with re-direction to the correct ones)? Food for thought.
Apr 8 2015 11:04AM (UTC) I was experimenting with building a full-text index on one of our import databases, and I inadvertently filled the hard drive to the point where queries slowed to a crawl. I've solved the problem, but for part of today the search results were very slow.
Apr 7 2015 12:47AM (UTC) I just realized that the Journal ZooKeys has its own internal identifiers assigned to articles, with its own Dereference Service. I indexed one record, but will ask Pensoft for the full set of identifiers.
Apr 7 2015 12:38AM (UTC) I was happy to see that our new paper on Identifiers was published in ZooKeys. I was EXTREMELY happy to see that the various identifiers for this paper were already in BioGUID.org! (...automatically imported from GNUB)
Apr 5 2015 8:36PM (UTC) Oops! I just discovered that I wasn't properly clearing the search buffer table, and it had grown to nearly 9 million records! I've purged it now, so I hope that search performance will imporve. NOW I'll spend some time with family!
Apr 5 2015 9:26AM (UTC) I'm going to leave the system alone for the next couple of days (while I spend time with family), and monitor performance. Next week I've got a few million more identifiers to import, then it's all about developing new services.
Apr 5 2015 9:24AM (UTC) Over the past few days as I've been running scripts in the background, the search performance slowed to roughly 10 seconds per search (with some searches taking 30-90 seconds). I've finished this for now, so searches are back down to 1-2 seconds.
Apr 5 2015 9:20AM (UTC) I've now completed the script to refesh new identifiers from ZooBank/GNUB. I'll continue testing it over the next few days, then implement it as a general service so any issuer of identifiers can maintain current identifiers indexed in BioGUID.org.
Apr 4 2015 8:54AM (UTC) One of the things I'm working on is a service to allow automatic updating of identifier lists from external sources. I spent today testing it with GNUB/ZooBank identifiers (still ongoing).
Apr 4 2015 8:44AM (UTC) After a week of working on other projects, I returned to development on BioGUID.org. The system is currently slow because of some background processing and indexing. More to report tomorrow.
Apr 1 2015 5:55PM (UTC) I added a feature that allows identifiers to be searched directly by placing them after “http://bioguid.org/” (e.g., http://bioguid.org/BPBM ent 12345)
Mar 30 2015 7:32AM (UTC) One last thing before calling it a night: it looks like almost all recent searches took less than a second! Fingers crossed that this speedy behavior continues!
Mar 30 2015 7:25AM (UTC) I decided to post my on-going “To-Do” List on the Home Page, so it's more apparent what has already been done, and what's planned for the near-term future. If nothing else, it will serve as a reminder to me!
Mar 30 2015 6:59AM (UTC) I added support on the identifier Search Results page for additional Dereference Services. The Preferred Dereference Service is the first icon after the identfier, and any other Dereference Services are shown as additional icons.
Mar 29 2015 11:11PM (UTC) I added a new output column to the Search Identifier service, called ‘AlternateDereferenceServices’. See the API page for more details.
Mar 29 2015 11:09PM (UTC) BioGUID.org now has a Logo! See the left end of the banner. OK, so I'm not an artist... but it's reasonable, I think.
Mar 29 2015 5:40PM (UTC) I changed the column name "Logo" to "IdentifierDomainLogo" in the existing APIs. I did this because I now am adding support for DereferenceServiceLogo as well.
Mar 29 2015 8:06AM (UTC) I want to leave the system alone for a few days, so it will actually work the way it's supposed to. However, next week I'll start looking into the next big batch of identifiers to index — probably the values of occurrenceID in the GBIF dataset.
Mar 29 2015 8:02AM (UTC) OK, things seem to be working properly now. Searches are still slow (averaging about 10 seconds) — but that's much better than it was before (several minutes). No more batch processing for a while...
Mar 29 2015 6:59AM (UTC) Residual indexing is still ongoing, so searches will continue to be slow. If this lasts more than another day, I'll restart the server. Apologies for the inconvenience!
Mar 28 2015 5:29PM (UTC) In case anyone is interested, the specific aspect of the server that is maxed out is disk I/O (not CPU, RAM or Network). This is why I think it's the internal SQL index refresh that is causing the problem.
Mar 28 2015 5:25PM (UTC) Apparently, simply stopping the batch processing did not unlock the server (most likely due to lingering indexing). For now, I'll wait to see if it eventually finishes; but for the time being, searches will be slow.
Mar 28 2015 4:59PM (UTC) Well... something went awry with my batch process to merge data objects, and it essentially locked up the server. My sincere apologies to anyone who tried to use BioGUID during the past few hours. I've stopped the batch processing for now.
Mar 28 2015 6:58AM (UTC) The system is now fairly stable. I'm continuing to run processes to locate and merge data objects based on certain identifier values. These are run in batches of 50,000 identifiers every 30 minutes or so, and result in momentary slow search performance.
Mar 27 2015 12:35PM (UTC) One final note before I call it a night: I'm still running scripts in the background to discover duplicate identifiers, so the search feature may be intermittently slow. When the scripts are not running, a search should only take 2-3 seconds at most.
Mar 27 2015 12:33PM (UTC) Just a quick follow-up to the previous post: the download files are intend to be examples of what BioGUID.org can do. Most of these will be developed as dynamic data services, rather than static download files.
Mar 27 2015 12:30PM (UTC) The first downloadable file index includes examples of multiple identifiers issued by the same Identifier Domain for the same data object (see Use Case 2.) Others will follow.
Mar 27 2015 12:28PM (UTC) I added more information on the API page, describing a new “Export Data Structure” download file format that allows cross-linking identifiers to each other.
Mar 27 2015 12:17PM (UTC) I created a new Use Cases page that helps explain the value and function of BioGUID.org through two specific use cases. There are MANY other use cases that will be added in the near future, so check back often!
Mar 27 2015 12:14PM (UTC) The absence of new News Items in recent days is not reflective of an absence of progress! The indexing finished a couple of days ago, and I have spent the past two days reviewing content and further developing the website. Details in a moment!
Mar 24 2015 11:42PM (UTC) Oops! I guess the indexes are not quite done yet. I discovered a glitch that affects the search feature using the index, which is currently updating. I'll make sure it's really ready before my next post (within the next few days).
Mar 24 2015 8:10PM (UTC) The first new dataset was just successfully imported! The entire import process took only 540 milliseconds. It only involved 28 new identifiers; but the important thing is the import code works! About a half-million more identifiers in the pipeline...
Mar 24 2015 7:59PM (UTC) It's READY! After two weeks of trying to get the indexing of over a billion identifiers for more than half a billion data objects correctly implemented, BioGUID.org is now fully functional! More to follow throughout the rest of today...
Mar 23 2015 1:16PM (UTC) The indexing is finally complete! But.... it's after 3am here in Hawaii, so time for some sleep. More in a few hours...
Mar 22 2015 8:32AM (UTC) We seem to be making steady progress on indexing records at a rate of 10 million objects every hour and 20 minutes or so. When we reach 530 million objects, the indexing will be complete and I'll start adding new content.
Mar 19 2015 10:55PM (UTC) After nearly a week, the indexes were still being built. I restarted the process in batches of 10 million records at a time. Based on the current batch rate, I expect the process to be complete in about two days. Fingers crossed!
Mar 18 2015 5:41AM (UTC) The indexing process is STILL ongoing (sigh...). There's not much I can do with the database itself until the indexing is complete, but I am preparing some new datasets for import as soon as it's done.
Mar 17 2015 2:34AM (UTC) Also, searching will be slow and results will be incomplete until the indexing has finished.
Mar 17 2015 2:34AM (UTC) The full-text indexing of identifiers is still ongoing. I don't know how much longer it will take, but I'm holding off adding more identifier records until after it completes. I'll post an update when that happens.
Mar 14 2015 4:57PM (UTC) As of this morning (Hawaii time), the indexes are still being rebuilt, so I apologize for the slow response of the server. There's no easy way to predict how long this will take, but I'm hoping it will be done by the end of the weekend.
Mar 13 2015 9:36PM (UTC) All GBIF-issued Occurrence identifiers (gbifID) and their correspoding DarwinCore "Triplet" (institutionCode+collectionCode+catalogNumber) identifiers have now been imported to BioGUID.org. Later we will also import values of occurrenceID.
Mar 13 2015 8:18PM (UTC) We just surpassed one BILLION indexed identifiers in BioGUID.org! It will be another day or so before the internal indexes are fully re-built, and the services functioning as they should. Then the real fun begins!
Mar 13 2015 6:51AM (UTC) I need to rebuild some indexes on the large tables, which means two things: 1) The rest of the GBIF identifiers won't be imported for another couple of days; and 2) the web services and web site will probably be extremely slow for a while.
Mar 13 2015 5:26AM (UTC) I tried importing 100 batches of records with one million records per batch (instead of ten batches with ten million records each), and that proved to be a mistake (the server bogged down after only 31 million records). I need to allow the server some time to breathe, then will finish importing the GBIF records in batches of 10 million records.
Mar 12 2015 4:24PM (UTC) Another hundred million records imported while I slept! And it only took three and a half hours. I need to cap the memory used by the database; otherwise it pegs out and brings the server to its knees. About to kick off another hundred million records...
Mar 12 2015 5:20AM (UTC) The time I spent revising the data model paid off — I just imported 60 million GBIF records in less than two hours. The optimum ratio seems to be batches of 10 million records separated by 10 minutes (to allow indexing to catch up).
Mar 11 2015 5:30PM (UTC) I've slightly revised the data model to improve import performance, and also to make the model more scalable for large numbers of data objects (>> 2 billion). This also simplified the model and eliminated the superfluous UUIDs generated by BioGUID.org.
Mar 11 2015 8:58AM (UTC) After some experimentation, I've decided to import GBIF Occurrence records in batches of 1 million records, at an interval of one batch every 30 minutes. At this rate, it will take some time to get all 528 million GBIF records imported, but fortunately this only needs to be done once!
Mar 10 2015 7:08PM (UTC) I can now process 100,000 GBIF records in about 20 seconds. I tried processing a million at a time but it was less efficient, so batches of 100,000 seem to be about the best performance. Later today I'll figure out how close together I can process the batches.
Mar 10 2015 10:15AM (UTC) The first 200,000 GBIF Occurrence identifiers were successfully imported into BioGUID! Unfortunately, the current process takes about 7 minutes per 100,000 records. I'll work on improving this performance.
Mar 10 2015 5:49AM (UTC) Unfortunately, I'm bumping up against some server limits in processing GBIF identifiers. While I get that sorted, check out the new FAQ page. It's still under development, but I'll add more content later this week.
Mar 10 2015 3:22AM (UTC) The first 1 million GBIF records took only about a minute to process. However, as the indexes grew, each successive million records slowed down. I'm now planning to do them in smaller batches.
Mar 10 2015 3:19AM (UTC) I was a bit ambitious and tried to process the first 100 million GBIF records in one go. After 7 hours and 45 minutes of processing, the server was brought to its knees, and needed to be restarted.
Mar 9 2015 8:27PM (UTC) I just discovered a glitch in the indexing search routine that failed to find matches for identifiers without any associated DereferenceServices (e.g., ISSNs). This has been fixed, so searches on ISSNs (and other non-service-based identifiers) should be working fine now.
Mar 9 2015 6:04PM (UTC) And now the real work begins! After downloading 528 million+ records from GBIF (thanks to Tim Robertson for creating an identifier field dump for me!), I'm now processing the records for importing to BioGUID.org.
Mar 9 2015 8:20AM (UTC) Imported 62,960 new identifiers (mostly ISSNs and DOIs), representing 39,326 Journals from the CrossRef Title List.
Mar 2 2015 3:14PM (UTC) I discovered an error in my batch import script that caused an excess of identifiers to be imported (I wasn't trapping for logical duplicates). The current totals stand at 1,298,749 identifiers linked to 448,502 unique objects, yielding an identfier-to-object ratio of about 2.9.
Mar 2 2015 1:03PM (UTC) It's submittied! With literally one minute to spare, no less! (Phew!) I still need to update the About page and write descriptions of the APIs. And we still need to make the IdentifierDomain submission form work. But we're getting there!
Mar 2 2015 12:11PM (UTC) OK, the website is functional (still tweaking a bit...). Time to generate the video file describing the site. Powerpoint, don't fail me now!
Mar 2 2015 8:52AM (UTC) Only a few hours to go before the deadline, and we're still tweaking the site. This is going to be close...
Mar 2 2015 2:16AM (UTC) More than 1.5 million identfiers assigned to over 800K objects are now in the system, and so far the search process is wicked fast! Kudos to Full-Text Searching!
Feb 25 2015 12:18PM (UTC) Now testing the batch import routines. We're going to use the existing internal identifiers (UUIDs) and cross-linked external identifiers within the Global Names Usage Bank database to seed the BioGUID system and test performance on the data services.
Feb 9 2015 12:18PM (UTC) Most of the basic stored procedures and functions are written and have been tested. Time to start building the web interface and web services!
Jan 27 2015 4:28AM (UTC) The data model behind the BioGUID indexing service is nearly complete. The next step is to write the stored procedures to create, edit, and search on the key data objects.
Dec 2 2014 8:06AM (UTC) The GBIF Ebbe Nielsen Challenge has just been announced! This is the final impetus we need to develop and implement our concept of a biodiversity identifier indexing and cross-linking service. Watch this space!
|All content within the BioGUID.org site is available under the Creative Commons Zero license (Public Domain).|