Happy New Year, everyone! We're poking our heads up through the snow here in New England to bring you our latest engineering updates!
One of the new features to come out of our most recent sprint is a revised Collection API (issue #4823) that more closely resembles the classic POSIX filesystem API. The Arvados storage system, Keep, is a content-addressed storage system that doesn't offer a POSIX interface for accessing data. A few sprints ago, in 7bf8f6c701, we released an interface that presents Arvados data streams as "file-like objects", but the overall collection API still just isn't very similar to the POSIX calls that Unix and Linux programmers are so familiar with.
Our new release brings us much closer to that goal, offering users a single API for both reading and writing collections, with familiar methods for addressing files like
remove() and so on. We anticipate that new users will find it much easier to get into the flow of using Arvados with these patterns, and existing users will find it more convenient to port pipelines to Arvados.
Crunch failure reporting¶
Another useful new tool is a Crunch job failure report (issue #4598). At present, while Crunch reports every job failure and success into Arvados, identifying the underlying causes of failed jobs can be a little tedious. Moreover, a breakdown report that gives visibility into why jobs have failed can be a huge boon to debugging.
With this tool, an administrator can now set up a nightly report breaking down job failures over the last 24 hours -- or get a report over any time period at all. We've already used this tool in development to help quickly debug knotty job failures, and expect that Arvados administrators everywhere will find it a great help.
New User Interfaces¶
We're also experimenting with newer ways of writing more responsive interfaces. Our primary interface to Arvados, Workbench, is written as a Rails application that uses the Arvados API instead of a local database. This makes it easy for Rails developers to work on, but also means that every request to Arvados goes through two software layers, which doubles the latency for every action a user takes.
Thanks to everyone for your ongoing input. Please give us a shout if you have any questions or idea for us!
As we start to wind down the year, we have some of our most exciting features yet to offer. Our Thanksgiving sprint was a very productive one: we found and fixed 29 bugs and implemented 7 new features.
One of our most exciting new features is a browser-based collection upload tool. If you have data sets already on your workstation to upload into Arvados, you can do so right from your browser:
Being able to upload collections directly from your desktop makes it a snap to get started using Arvados. Try it out yourself!
In addition to nifty web tools for making Arvados easier, we've been working hard on other projects.
- Curoverse has been an active participant in the Common Workflow Language working group. Our senior engineer Peter Amstutz contributed substantially to drafting the reference implementation. A tool for expressing bioinformatics workflows in a consistent, portable way across different systems is an important link in promoting collaboration between researchers on different projects. We see the Common Workflow Language as a critical component of modern bioinformatics platforms.
- Pipeline authors can now specify a particular SDK version in their pipeline computations, offering better control over reproducibility.
On top of that, we've fixed a bunch of niggling little bugs that have made Arvados much smoother to use: improved SSH key upload, more consistent handling of file selections in collections, pipeline rendering, Firefox SSL certificate bugs and much, much more!
We've just wrapped up another engineering sprint at Curoverse -- 35 bugs fixed and over a dozen major new features -- and are pretty excited to show you some of the new things you can do with Arvados.
Some of our most exciting new user interface features include:
- Real-time CPU and I/O graphs for running jobs. Now, while you're watching the status of a running job, you can also see a graph of the job's CPU and I/O activity update in real time.
- The new arv-run command provides a convenient shell-like syntax for composing and launching Arvados pipelines. Creating a new pipeline to run the same command in parallel on hundreds or thousands of input files is as simple as:
arv-run grep -H -n ATTGGAGGAAAGATGAGTGAC -- *.fastq
- A file-like I/O interface for collections in the Python SDK. Within a pipeline, opening and reading files in a collection now uses a pattern that will feel very natural and familiar to Python programmers:
c = arvados.CollectionReader(collection_id) with c.open(input_path) as infile: for line in infile: ...
- Infinite scroll for the pipeline view page
- As-you-type filename filtering in the collection view
- Improved formatting for the provenance graph.
Under the hood, we've added loads of internal improvements to make the Arvados site administrator's job easier:
- Rendezvous hashing for Keep ensures that Keep clients and proxies store blocks evenly across Keep servers, and permits adding new Keep servers to an existing cluster without substantially degrading performance.
- Better timeout handling in the Python SDK improves latency for Keep requests.
- Consistent log formatting for the Keep server logs for easier automated analysis.
- Many improvements to our new Node Manager:
- Google Compute Engine support to give you more options for running compute nodes in the cloud.
- Administrators can specify a minimum number of compute nodes to keep alive at all times
- New nodes can be brought up automatically when all existing ones are busy.
We've already launched our next engineering sprint and are fast adding new features to Arvados, but we're looking forward to your feedback! As always, please feel free to check out Arvados from Github if you want to try it out, or get in touch with us in email or on IRC!
It has been a while since we've blogged about our development progress. Since the previous Arvados Update, we've completed 5 development sprints. We resolved 678 issues and tasks: 73 feature stories, 148 bugs and 12 support issues. These issues and stories were further broken down into 445 tasks.
Here are some of the new feature highlights:
- New cli commands:
- arv edit can be used to edit Arvados objects from the command line. Arv edit opens up the editor of your choice (set the EDITOR environment variable) with the json description of the object. Saving the json will update the Arvados object on the API server.
- arv create can be used to create Arvados objects from the command line. Arv create opens up the editor of your choice (set the EDITOR environment variable) and allows you to type or paste a json description. When saved the object will be created on the API server if it passes validation.
- arv copy can be used to copy a pipeline instance, template or collection from one Arvados instance to another. It takes care of copying the object and all its dependencies.
- Arvados Node Manager: this component manages Arvados compute nodes in a cloud computing environment, automatically spinning up new nodes and shutting down excess ones as needed. It supports Amazon Web Services and Google Cloud support is in the works.
- Crunch: we added run-command - a generic "command line wrapper" crunch script.
- Workbench: workbench remains a significant focus area. Here are a couple of the more important new workbench features:
- switched to a project-based interface
- added a 'Home' project
- upgraded the dashboard to show the current state of the cluster - how many (idle) compute nodes are available, what jobs and pipelines are running, etc. The jobs and pipeline information shown to the user is subject to the permission model.
- added a search feature
- added a 'manage account' page where users can view the virtual machines and repositories they have access to, as well as manage tokens and ssh keys.
- overhauled the 'show pipeline instance' page
- added a persistent top nav bar
- added a diagnostic suite feature, which allows for automated running of diagnostic pipelines
- added a project sharing feature
- Installation: we added a 'binary' installation method for local evaluation and testing. This installation method downloads prebuilt docker images from the Docker Registry and spins them up locally.
- Testing: we have added a large number of tests all over the codebase, which are run automatically by our Jenkins server as part of our build pipeline.
In addition to those new features, the last three sprints have been centered around improving the user experience. We continue to put a lot of effort into finding and fixing bugs, and we are actively focusing on making Arvados easier to use.
If you want to try out Arvados, head over to curoverse.com and hit the "Log In" button. After you log in with a Google account, you can try out a cloud-hosted copy of Arvados - we're in open beta, so it's free!
Alternatively, have a look at the installation instructions to install Arvados yourself.
Let us know how it goes on the Arvados mailing list or the IRC channel!
Recently, a scientist in Denmark reported an interesting issue to the GNU "coreutils" mailing list. This researcher was trying to use /bin/cp to copy 39TB of data from one disk to another and ran up against some resource constraints that surprised him: cp's in-memory bookkeeping slowed the process to a crawl.
Experienced cluster administrators will recognize right away that cp was the wrong tool for this task to begin with. cp is a very robust tool — it's one of the oldest and most heavily used Unix or Linux tools — but it may have never before been exercised on a 39TB filesystem consisting of 400 million files. It's not entirely surprising that cp manifests such strange failure modes when run on such an extreme edge case. But it’s not just cp’s fault; the real problem runs deeper than that.
Guaranteeing correctness of data is obviously important in any discipline. It can be especially tricky for scientific computation, where data sets get analyzed over and over again in slightly different ways. It’s entirely too easy to smash a critical result without even noticing, by cp'ing or rm'ing the wrong file at the wrong time. The coreutils discussion illustrates just one way in which POSIX filesystems are fundamentally ill-equipped to ensure safe data reproduction at this scale: synchronizing thousands or millions of files means visiting each one individually, and often requires tracking metadata like hardlinks and permissions over the full duration of the operation. All of the traditional tools for addressing problems like these — cp, rsync, tar, and so on — suffer significant scaling problems in this situation.
So how do you copy millions of files spanning dozens of terabytes, quickly and safely, from one system to another?
In Arvados, we do it with content-addressed storage. This is a storage model that’s been used to great advantage by tools like git and Camlistore. In content-addressed storage, an object can be retrieved using a hash of its contents as a lookup key. When you can guarantee that each object on the system has exactly one, unique name, and the name is derived from the object’s content, suddenly you’re guaranteed of several other things:
- You haven’t duplicated any objects already on the system.
- The new object didn’t accidentally overwrite some other one.
- You haven’t accidentally stored an object under the wrong name (e.g. a typo).
- The object hasn’t been corrupted. If you address it by checksum at each stage of the computation, you can double-check at each point that the content still matches the checksum.
Git uses content-addressed storage for commits and files to guarantee that the same patch can't accidentally be applied twice to the same branch, and that it doesn’t accidentally overwrite a different patch that's already been applied. With Arvados, it ensures that a new 5TB data set hasn’t been uploaded over an important 20TB data set that was already there, and it provides a simple mechanism to copy data safely and reliably between systems.
An Arvados data collection is stored in Keep, our content-addressed storage system. Each collection is divided into 64MB data blocks. Each block is stored under its MD5 checksum. If a 40TB collection needs to be copied from one machine to another, it’s incredibly simple to be sure you’ve done it right: copy one block at a time, checking the MD5 sum of each block as it’s copied to ensure that data isn’t corrupted in the copying process. On top of that, the list of blocks that make up the collection (its “manifest”) is itself a string of data that gets hashed and verified in Keep, providing an additional level of protection against accidentally losing blocks in the copy. Since blocks can be copied asynchronously, and the checksums can be verified both during and after the copy is performed, clients gain a great deal of flexibility without losing reliability.
This is the right workflow for using enormous data sets and still getting reproducible results: a platform that guarantees your data won't be overwritten, won't be deleted by accident, and can be shared with collaborators via data federation. Preserving data provenance, eliminating accidental data loss, ensuring reproducibility: that's what next-generation scientific computing demands.
BarCamp Boston was held October 11 and 12 at the Microsoft NERD Center in Cambridge! About 300 nerds and geeks from all over New England came on a rainy October weekend to share ideas, tips and tricks. It was a great and exciting weekend. Curoverse co-founder Jonathan Sheffi and I attended to check out the talks and share our work with other local hackers.
An unconference has a great kind of improvised, manic energy that makes it an excellent place to discuss new, interesting and nutty off-the-wall ideas. Which, of course, is ideal for us. :-)
On Saturday, I presented the work we've been doing on Keep. In this session, I explained why content-addressed storage helps solve critical problems of data provenance in scientific computing, presented an overview of Keep's architecture, and reviewed our team's experience porting a large Perl application to Go. The Keep presentation is published on Slideshare: Keep: Open Source Content-Addressed Storage (CC-BY-SA)
Conferences (both the formal and informal kinds) give us a very important opportunity: the chance to have our work criticized by others. That might not seem obvious at first, but we think it's a crucial part of the software engineering process for free software. Every software system needs to be able to stand up to criticism -- if it can't, that's likely to indicate a flaw that needs to be addressed. Conversely, if the software doesn't get thoroughly criticized, any significant flaws in the design are likely to go overlooked. This principle is often expressed as: "with enough eyeballs, all bugs are shallow." It's one of the fundamental advantages that free software has over proprietary software.
But it's important to be mindful that open source isn't a panacea for debugging. Just because the source code is available doesn't mean that it's actively being audited. Engineers and project leaders have to be proactive in seeking out reviews that will help uncover hidden flaws in their systems.
That's why presenting our work in public is important to us. It's not just about telling you how awesome it is. (Although we really think it is!) It's also about finding the flaws that we haven't been able to find ourselves. It's about tearing the work to shreds -- because that's also how we make it even better.
In the coming months, we're looking forward to bringing our work to more conferences for you to review. We'd love to hear your thoughts on Arvados and scientific computing -- both what's great and what's not. And we particularly want to hear about the latter!
The Bioinformatics Open Source Conference (BOSC) came to Boston this year, which was really fortunate for Arvados. Everyone at the event believes that open source software can improve the reliability and reproducibility of biology research, just like us. Since Boston is home base for the Curoverse team, it was easy for us to decide to attend.
At the conference, I presented a brief architectural overview of Arvados. The talk describes how the software keeps track of data and analysis to help biomedical researchers better collaborate and iterate to discoveries faster. Several components work in concert to help this happen, so if you've been looking for a big picture overview of the system, this talk will be right up your alley.
As fun as the talk was, the best part of any conference is the other people. It was exciting to see the breadth of other open source work in bioinformatics, and compare notes on different problems and solutions. And with my background in software development, it helped me to hear more about the obstacles some scientists face when they use free software in their research. My thanks to the BOSC organizers—it's clear they put in a lot of effort to hosting a productive conference.
Dear Arvados users:
As we approach a 1.0 release for Arvados we've been improving key features for data provenance.
Building on our success integrating Docker into Arvados, we've implemented some incredibly valuable provenance features around Docker. A new command-line tool uploads Docker images to Keep, and Arvados records the full system image used to run a job. This makes it possible to reproduce the entire operating environment used to produce a particular result -- not just your code for that job, but all of the system libraries and tools that were installed along with it -- and helps improve reproducibility for complicated pipelines.
On that note, we've also added a longstanding goal: informaticians can now specify minimum resource requirements for jobs they run. When running a computationally intensive job, or one that requires a lot of scratch disk space, it's very frustrating to launch a pipeline only to watch jobs run slowly or fail because of CPU or disk limitations on compute nodes. It's now possible to specify runtime job constraints, including minimum amounts of disk space, RAM or CPU cores per compute node, and Arvados will ensure that your jobs run only on nodes that are sufficiently powerful to accommodate them.
Here at Curoverse we're very enthusiastic about the Go programming language. We've rewritten the Keep file server in Go and added a Keep proxy server. Now we've added a Go SDK to the Arvados toolkit, so you can write Arvados pipelines in Go as well! We hope you'll give it a try and see how much fun writing Go is!In addition to those features, we have lots of other goodies for you, including:
- Significantly improved Workbench display performance
- Better interface for picking collections and pipeline templates
- Real-time log display via CLI
Dear Arvadans near and far:
Our May development sprint yielded lots of new features for you! This month, we focused on enhancing the Arvados pipeline factory. Pipelines are at the heart of the Arvados workflow, and we know how important it is for users to have an intuitive, natural process for managing both pipelines and the data collections that they're run on. To that end, we're really excited with the new pipeline and collection tools we have to offer you:
Names and Projects. You can now give names to pipelines, collections, templates and specimens, and group them into projects. This way, you can choose your own organizational scheme for finding and managing data, and can easily keep track of large amounts of data, even hundreds of data sets and dozens of projects.
Deleting pipeline instances. It's important to have a way of managing pipeline instances that are no longer needed. Now, once you're done with a pipeline instance, you can delete it -- without harming the provenance graph for data generated by that instance. You still have full data provenance for all of your results.
Streamlined data upload. Uploading data to Keep has always been a little cumbersome, requiring that it be uploaded to a staging server before copying it into Keep. For uploading gigabytes and terabytes of data, we knew we would need something better, and now we have it: a Keep upload proxy which receives uploads directly from your workstation or local machine, and stores them on a Keep server (with multiple replicas, if desired).
One of the other complications with large-scale data upload is how to handle interrupted uploads. To that end, the arv-put command now keeps track of the state of an upload in progress, and if restarted after an interruption, it will resume from where it left off. You no longer need worry about having to start an 18-hour upload from scratch!Improved collections view. We have added lots of information to the collections view page, including:
- Metadata and provenance
- Links from a job result to the job that produced it
- Permission and sharing information: who's allowed to read a collection
- Files in a collection displayed in a tree
Real-time log updates. You can now watch the log output from a pipeline while the pipeline is still executing.
We're charging ahead with the next milestone and are incredibly excited about bringing you Arvados 1.0 later this year. In the meantime, please always feel free to leave us your feedback via e-mail or IRC!
Greetings from Curoverse HQ! We bring you tidings from the trenches of Arvados development, where we've just finished a really productive engineering sprint and are excited to see the product that's emerging.
For the last few months, each of our engineers has been working independently on different Arvados components. It's been an extremely productive period, but until now we haven't combined the different tools we've worked so hard to build. Now that we're putting those pieces together, it's incredibly satisfying and exciting to watch a unified, seamless application emerge.
Among the features that have come out of our May 7 engineering sprint, "Storing and Organizing Data":
We knuckled down and revamped the whole Workbench interface from end to end, resulting in one beautiful user experience with a sleek, clean, intuitive UI for managing your data and pipelines.
We know some of you have been really eager for a Java SDK, so we're delighted to announce the Java SDK for Arvados. Informaticians who work with Java can immediately write Arvados pipelines in that language.
A new event manager allows Arvados components to signal events to each other quickly and securely, reducing latency between back-end systems. This event bus allows us to make the system more responsive to critical system events like low disk space or compute node failures.
It's now possible to build and run a Crunch job in a Docker container, drastically improving pipeline provisioning and reproducibility. The benefits of deploying a job in a Docker container include:
- simpler provisioning: Build your Docker image once, then deploy it consistently across all your compute nodes. Make as many as you need to suit different analyses.
- reproducibility: Users now have a away to control the whole environment their analysis runs in, all the way down through the system’s C library.
We've also implemented a Data Manager tool to help site administrators keep tabs on the health of a local Arvados installation. The Data Manager will allow administrators to identify "garbage" data (blocks and collections that are no longer in use by any pipeline), control cache utilization and monitor user usage.
As always, if you're an Arvados user or curious about what we're doing, we'd love to hear from you. The engineering team coordinates on IRC, so if you're an IRC user, pop into #arvados on the OFTC network and say hi!