Biting Off More Than You Can Chew
The POSIX filesystem is poorly suited to data management tasks that are important to get right in scientific computing. Protecting against accidental data loss or corruption when handling multi-terabyte data sets requires a different approach.
Recently, a scientist in Denmark reported an interesting issue to the GNU "coreutils" mailing list. This researcher was trying to use /bin/cp to copy 39TB of data from one disk to another and ran up against some resource constraints that surprised him: cp's in-memory bookkeeping slowed the process to a crawl.
Experienced cluster administrators will recognize right away that cp was the wrong tool for this task to begin with. cp is a very robust tool — it's one of the oldest and most heavily used Unix or Linux tools — but it may have never before been exercised on a 39TB filesystem consisting of 400 million files. It's not entirely surprising that cp manifests such strange failure modes when run on such an extreme edge case. But it’s not just cp’s fault; the real problem runs deeper than that.
Guaranteeing correctness of data is obviously important in any discipline. It can be especially tricky for scientific computation, where data sets get analyzed over and over again in slightly different ways. It’s entirely too easy to smash a critical result without even noticing, by cp'ing or rm'ing the wrong file at the wrong time. The coreutils discussion illustrates just one way in which POSIX filesystems are fundamentally ill-equipped to ensure safe data reproduction at this scale: synchronizing thousands or millions of files means visiting each one individually, and often requires tracking metadata like hardlinks and permissions over the full duration of the operation. All of the traditional tools for addressing problems like these — cp, rsync, tar, and so on — suffer significant scaling problems in this situation.
So how do you copy millions of files spanning dozens of terabytes, quickly and safely, from one system to another?
In Arvados, we do it with content-addressed storage. This is a storage model that’s been used to great advantage by tools like git and Camlistore. In content-addressed storage, an object can be retrieved using a hash of its contents as a lookup key. When you can guarantee that each object on the system has exactly one, unique name, and the name is derived from the object’s content, suddenly you’re guaranteed of several other things:
- You haven’t duplicated any objects already on the system.
- The new object didn’t accidentally overwrite some other one.
- You haven’t accidentally stored an object under the wrong name (e.g. a typo).
- The object hasn’t been corrupted. If you address it by checksum at each stage of the computation, you can double-check at each point that the content still matches the checksum.
Git uses content-addressed storage for commits and files to guarantee that the same patch can't accidentally be applied twice to the same branch, and that it doesn’t accidentally overwrite a different patch that's already been applied. With Arvados, it ensures that a new 5TB data set hasn’t been uploaded over an important 20TB data set that was already there, and it provides a simple mechanism to copy data safely and reliably between systems.
An Arvados data collection is stored in Keep, our content-addressed storage system. Each collection is divided into 64MB data blocks. Each block is stored under its MD5 checksum. If a 40TB collection needs to be copied from one machine to another, it’s incredibly simple to be sure you’ve done it right: copy one block at a time, checking the MD5 sum of each block as it’s copied to ensure that data isn’t corrupted in the copying process. On top of that, the list of blocks that make up the collection (its “manifest”) is itself a string of data that gets hashed and verified in Keep, providing an additional level of protection against accidentally losing blocks in the copy. Since blocks can be copied asynchronously, and the checksums can be verified both during and after the copy is performed, clients gain a great deal of flexibility without losing reliability.
This is the right workflow for using enormous data sets and still getting reproducible results: a platform that guarantees your data won't be overwritten, won't be deleted by accident, and can be shared with collaborators via data federation. Preserving data provenance, eliminating accidental data loss, ensuring reproducibility: that's what next-generation scientific computing demands.