From January to May 2017, the Data Refuge project was underway and we helped volunteers at 50 data rescue events all over the country collect over 400 datasets from 33 federal agencies. This post is the story of those datasets – how they were collected and what is happening to them now.
The workflow* for harvesting and preparing data for datarefuge.org was developed mostly by our own Laurie Allen, as well as Delphine Khanna and Rachel Appel from Temple University Libraries, and Justin Schell from University of Michigan Libraries. Many others at DataRescue Philly also contributed. The issue of trust was very important to us in archiving this data. While our workflow isn’t the best method, there are multiple validation points and a documented chain of custody.
The workflow worked thusly: URLs identified as having data that wouldn’t be picked up by the End of Term Harvest webcrawlers were added to an app developed by a member of the Environmental Data Governance Initiative, with which we worked closely in supporting data rescue events. Data rescuers would select URLs and document as much information about what the data were in the Research phase. The URL would then move to the Harvest phase, when rescuers would go in and find ways to extract the data. Next, people with some familiarity with scientific or government data would Check the data against the original and sign off on what was there. Finally, datasets would be “Bagged” into BagIt files (a form of zip file), added to the datarefuge.org repository and given some metadata.
Since this workflow involved putting everything into these zipped files it’s hard to know what we actually have or how useful it is. To find out and make whatever is there more useful, we hired some student workers to unzip the data. Yixin Wu was one of our first student workers working on this and he unpacked about 60 EPA datasets. Francisco Saldana and Liu Liu are working this summer on NASA and NOAA datasets. This work would not get done without the work these great students are doing.
The process for unpacking these data is deeply tedious. First, they check the original link to the record to verify metadata. After fixing any errors, they download the zip file and extract the documents. They inspect the files to make sure they are what they should be and are in usable formats. Then they upload the individual files back into the record.
Next, they gather some more contextual information to help users of the data. They look up the original URL in the Internet Archive and include a link to the archived version of the site where the data originated in the record. They also go into the app used by most data rescue events and gather the information about how the data was copied. That information might include methods used to capture the data, any problems the participant encountered, or other information that might help someone using the data from datarefuge.org understand the copied data. That information is entered into a PDF that is also added to the item record. Here’s an example of a completed, unpacked dataset from NOAA – take a look at those PNG files!
So far, our student workers have unpacked about 100 datasets. They’ve uncovered just a few duplicate datasets, datasets that were too messy to be usable, or datasets that were no longer available at their original location. We do not have metrics on how many, if any, of these datasets are being used. We only know about data that goes missing on government website if someone reports it to us or the media (EDGI is doing some great work on monitoring changes to websites though). What we’ve heard has gone missinghas most often been moved to an archival site through the work of the National Archives and Records Administration or by some other means. Other missing data was retained by the Internet Archive. There are of course a few things we know have been taken down or are planned to be removed and are unfortunately not in our repository or rescued by any other group we know of.
A couple of universities have approached us about mirroring the rescued data at datarefuge.org. This is still in process and we’re working through the logistics of doing that but these will be great partnerships to safeguard this work.
Additionally, the Data Refuge Stories project has really taken off and doing great work – check out the revamped website for updates!