BagIt File Packaging Format
http://www.cdlib.org/inside/diglib/bagit/bagitspec.html
BagIt is a hierarchical file packaging format designed to support disk-based or network-based storage and transfer of generalized digital content. A bag consists of a "payload" and "tags". The content of the payload is the custodial focus of the bag and is treated as semantically opaque. The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method [ENCDEP] (Tabata, K., “A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method,” 2005.), sometimes referred to as "bag it and tag it".
Implementors of BagIt tools should consider interoperability between different platforms, operating systems, toolsets, and languages. In particular, differences in path separators, newline characters, reserved file names, and maximum path lengths are all possible barriers to successfully moving bags between different systems. Discussion of these issues may be found in the Interoperability section of this document.
Sample Bag File
mysecondbag/
|
| manifest-md5.txt
| (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt)
| (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt)
| (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz)
| (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz)
|
| fetch.txt
| (http://foo.example.com/gov-06-2006/gov-20060601-050019.arc.gz
| 26583985 data/gov-20060601-050019.arc.gz)
| (http://foo.example.com/gov-06-2006/gov-20060601-100002.arc.gz
| 99509720 data/gov-20060601-100002.arc.gz)
|
| bag-info.txt
| (Source-organization: California Digital Library)
| (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612)
| (Contact-name: A. E. Newman)
| (Contact-phone: +1 510-555-1234)
| (Contact-email: alfred@ucop.edu)
| (External-Description: The collection "Local Davis Flood Control
| Collection" includes captured California State and local
| websites containing information on flood control resources for
| the Davis and Sacramento area. Sites were captured by UC Davis
| curator Wrigley Spyder using the Web Archiving Service in
| February 2007 and October 2007.)
| (Bag-date: 2008.04.15)
| (External-identifier: ark:/13030/fk4jm2bcp)
| (Bag-size: about 22Gb)
| (Payload-Oxum: 21836794142.4)
| (Internal-sender-identifier: UCDL)
| (Internal-sender-description: UC Davis Libraries)
|
| bagit.txt
| (BagIt-version: 0.96)
| (Tag-File-Character-Encoding: UTF-8)
|
\--- data/
|
| Collection Overview.txt
| (... narrative description ...)
|
| Seed List.txt
| (... list of crawler starting point URLs ...)
Thoughts
The concept is good. A few things crosses my mind while reading the spec. GIT, JSON, and PAR files. First, the payload (data) are stored in the "data" directory implicitly making the data secondary. I can see from a one bag perspective, it makes sense. From a long term digital content preservation standpoint, it makes sense to make minimum changes to the payload (in this case, the placement of files). This brings GIT into the picture.
GIT
A git project normally consists of a working directory with a ".git" subdirectory at the top level. The .git directory contains, among other things, a compressed object database representing the complete history of the project, an "index" file which links that history to the current contents of the working tree, and named pointers into that history such as tags and branch heads. (http://www.kernel.org/pub/software/scm/git/docs/)
Sample git project
gitproject/
|
|
\--- data.mov
\--- abc.txt
\--- .git/
|
|
|-- HEAD # pointer to your current branch
|-- config # your configuration preferences
|-- description # description of your project
|-- hooks/ # pre/post action hooks
|-- index # index file (see next section)
|-- logs/ # a history of where your branches have been
|-- objects/ # your objects (commits, trees, blobs, tags)
`-- refs/
Git and other versioning tools focus on the payload (data) as the primary data and hides management contents in a hidden folder (ex. ".git"). This keeps the data as data and the metadata as metadata and in effect subscribe to the principals of progressive enhancement.
This started me thinking about using Git as packaging format. The following benefit is gained from using Git.
- A Version Control system for the content.
- If ever there are changes to the data, they can be tracked.
- Each Git package is self contained
- Git uses SHA1 to name and identify objects within its database. (a self payload manifest similar to the payload manifest on BagIt)
- All the benefit of Git
- Easy copying of Git projects ($git clone my_project my_website)
- Git supports a richer set of repository sources, including network names, for naming the repository to be cloned. A clone is a copy of a repository. A clone contains all the objects from the original; as a result, each clone is an independent and autonomous repository and a true, symmetric peer of the original. (Version Control with Git)
- Clone over a local file system
- Clone over a SSH connection
- Clone over HTTP and HTTPS URL
- Clone over the rsync protocol
- All these capabilities make for an easy backup and preservation strategy
Git manages change. Although change doesn't happen that much from a preservation standpoint, it does happen and when changes do occur, Git is there to manage it; with full history of all the changes.
JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
Sample JSON file
{
"Source-organization": "Computer History Museum",
"Organization-address": "1401 N. Shoreline Blvd, Mountain View, CA 94087",
"Contact-name": "Ton Luong",
"Contact-phone": "650-810-1010"
"Contact-email": "info@computerhistory.org"
"checksum-md5": {
"data.mov": "1c3fa74f1178f0eb493c04e5ee191fac",
"abc.txt": "900150983cd24fb0d6963f7d28e17f72"
},
"checksum-sha1": {
"data.mov": "a17c9aaa61e80a1bf71d0d850af4e5baa9800bbd",
"abc.txt": "a9993e364706816aba3e25717850c26c9cd0d89d"
},
}
Why JSON? JSON hit the sweet spot between plain text and XML; it provides structure to data without the overhead of XML. When metadata are stored using JSON, they can be easily read by humans and machines.
This bring up the the question of storing metadata. Metadata can be stored in a database or stored in a file or embedded within the payload. The main problem with database is that metadata are not directly connected to the data. It is more robust if metadata stays with the data and this is the route that BagIt takes with its bag-info.txt file. Embedding is not a solutions for many payload because it involves modifying the actual data.
The use of JSON provide one huge benefit when coupled with a document database like CouchDB http://couchdb.apache.org/. With CouchDB, we have a one-to-one correspondence between the metadata file and the database entry. Remembering that in the long run, simple is better than complex; a one-to-one correspondence makes life easier and provide greater flexibility.
By default, a CouchDB datadump results in a JSON file. And adding content to a CouchDB is as simple as executing a HTTP PUT with content from a JSON metadata file.
Parchive
http://parchive.sourceforge.net/ | http://en.wikipedia.org/wiki/Parchive
The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal. Our new goal with version 2.0 of the specification is to improve. It extends the idea of version 1.0 and takes the recovery process beyond the file-level barrier. This allows for more effective protection with less recovery data, and removes some previous limitations on the number of recoverable parts.
Parchive works by creating a set of parity files of the payload. Compare to a checksum (MD5, SHA1, etc...) which only provide data verification capability, a payload with parity files has both data verification and data recovery capabilities.