A picture is worth a thousand words

Added by Tim Pierce over 6 years ago

Real-time job CPU and I/O graphs
We've just wrapped up another engineering sprint at Curoverse -- 35 bugs fixed and over a dozen major new features -- and are pretty excited to show you some of the new things you can do with Arvados.

Some of our most exciting new user interface features include:

  • Real-time CPU and I/O graphs for running jobs. Now, while you're watching the status of a running job, you can also see a graph of the job's CPU and I/O activity update in real time.
  • The new arv-run command provides a convenient shell-like syntax for composing and launching Arvados pipelines. Creating a new pipeline to run the same command in parallel on hundreds or thousands of input files is as simple as:
    arv-run grep -H -n ATTGGAGGAAAGATGAGTGAC -- *.fastq

    A full provenance graph for a completed pipeline
  • A file-like I/O interface for collections in the Python SDK. Within a pipeline, opening and reading files in a collection now uses a pattern that will feel very natural and familiar to Python programmers:
    c = arvados.CollectionReader(collection_id)
    with as infile:
      for line in infile:
  • Infinite scroll for the pipeline view page
  • As-you-type filename filtering in the collection view
  • Improved formatting for the provenance graph.

Under the hood, we've added loads of internal improvements to make the Arvados site administrator's job easier:

Provenance graph (detail)
  • Rendezvous hashing for Keep ensures that Keep clients and proxies store blocks evenly across Keep servers, and permits adding new Keep servers to an existing cluster without substantially degrading performance.
  • Better timeout handling in the Python SDK improves latency for Keep requests.
  • Consistent log formatting for the Keep server logs for easier automated analysis.
  • Many improvements to our new Node Manager:
    • Google Compute Engine support to give you more options for running compute nodes in the cloud.
    • Administrators can specify a minimum number of compute nodes to keep alive at all times
    • New nodes can be brought up automatically when all existing ones are busy.

We've already launched our next engineering sprint and are fast adding new features to Arvados, but we're looking forward to your feedback! As always, please feel free to check out Arvados from Github if you want to try it out, or get in touch with us in email or on IRC!

February development review: Sharing is caring

Added by Brett Smith about 6 years ago

Once you've found an interesting result in your analysis, you don't want to keep it to yourself. You want to share it with the world! The Arvados development team just wrapped up a sprint making it easier than ever to share those results, along with the pipelines and data that generated them.

Public Workbench sharing

Workbench public sharing screenshot

The Arvados Workbench already makes it simple to share your work with other Arvados users. We've extended that to let you share with people who don't have logins on your cluster. Once you do that, anyone can visit that page to view everything in it: collections of input and output data, Arvados pipeline templates, and even the specific pipelines that have been run. You can make all of that available in just a few clicks.

It's worth noting that all of these features are opt-in. Projects are not shared by default, and site administrators have to set a specific configuration value to enable public sharing in Workbench.

Dockerized Web services using arv-web

Of course, in many cases you want more than raw data. You'd like to be able to view it through dedicated software to help find patterns or unique results. arv-web is a new tool to configure those services with Arvados data, complete with automatic updates for new results.

Here's how it works: you have a Web service that provides a nice interface to your data. You build a Docker image to run that application and read data from a dedicated directory. Now start arv-web and point it at your Arvados project. arv-web will run your Docker container, filling the data directory with files from the latest collection in the project. As new collections are published (for example, because pipelines finish), arv-web automatically updates the Docker container's data directory, and can even run a reload script inside it. You get a nice view to your data that stays in sync with your latest results.

We're really looking forward to seeing what you all build with these new features. You can try them out on our public beta, and feel free to get in touch by IRC or e-mail to learn more.

A quick fix: Migrating Google login from OpenID to OAuth2

Added by Tom Clegg about 6 years ago

Perhaps you have a web application that uses OpenID to authenticate Google users. That will stop working on April 20, 2015 so you need to switch to OAuth2! Or, perhaps you are making a new web application and you just want a Google login button working as quickly as possible.

(If you're not planning to make any Google login buttons, this post won't be very interesting.)

Google provides many pages of helpful documentation about implementing OAuth2 and migrating from OpenID. As a supplement to that -- in case you don’t want to wade through all the exciting possibilities of Google’s many and varied APIs, and you just want to stop worrying about April 20 -- here is the surprisingly short story about how to implement OAuth2 from (nearly) scratch. It should be easier to migrate to OAuth2 than it was to implement OpenID in the first place.

The objectives are simple here:
  1. Implement Google login with OAuth2, so your users can still log in after April 20.
  2. Recognize your existing OpenID users when they log in with OAuth2. You don’t want them to show up in a brand new empty account.
This post focuses on the specifics of what needs to work, rather than how to get any particular library or deployment strategy to work. For the sake of clarity, I’ll assume:
  • You use a language/libraries that can do what PHP can do, like make HTTPS requests and decode JSON.
  • You can decode JSON web tokens (JWT). Link to a PHP library is provided below. (Others: find it yourself.)
  • Your web application is installed in only one place, and your source tree is private. (This probably isn’t true, but I’m sure you can figure out the relevant config and deployment stuff yourself so I won’t discuss it here.)
There are four things you need to do.
  • Establish credentials for communication between your web application and Google.
  • Make a new login button for unauthenticated users to click.
  • Make a new callback handler. This sets up a cookie/session when Google assures you that user 12345 has clicked the login button.
  • Make account migration code in (or after) the callback handler, so you can recognize that user 12345 logging in with Google+ today is the same person as user https://google/accounts/o8/id?id=BlUrFl… who logged in with Google OpenID yesterday.

Step 0. Get ready.

Before getting to work, you need to make two easy decisions.
  • Your application’s realm. This is It’s the same realm you used with OpenID. Usually, this is the root URI of your application.
  • Your OAuth2 callback URI. This is where you find out that a user is trying to log in to your app using Google. For our example we’ll use

Step 1. Tell Google Developer Console how to recognize your web application.

  • Visit
  • Click “Create Project”. Give it a name and ID.
  • In the new project, click “APIs” in the “APIs & auth” section in the left nav.
  • Find “Google+ API” in the big list and enable it.
  • Click “Credentials” in the “APIs & auth” section in the left nav.
  • Under OAuth, click the “Create a new Client ID” button. In the dialog:
  • Application type: Web application
  • Authorized Javascript origins:
  • Authorized redirect URIs:
  • Yes, you can use stuff like “localhost:1234” in both of these fields, for testing.
  • Click “Create Client ID”
  • You should have a “Client ID for web application” table on the right.
  • Copy the client ID and client secret. You’ll need those soon.

Step 2. Make a new login button.

There are complicated fancy ways to do this with Google JS libraries, but you’re going to do it the easy way instead:

<form action="" method="get">
<input type="hidden" name="response_type" value="code" />
<input type="hidden" name="client_id" value="client_id_from_dev_console_goes_here" />
<input type="hidden" name="redirect_uri" value="" />
<input type="hidden" name="state" value="see_note_about_state" />
<input type="hidden" name="scope" value="email openid profile" />
<input type="hidden" name="access_type" value="online" />
<input type="hidden" name="approval_prompt" value="auto" />
<input type="hidden" name="openid.realm" value="" />
<input type="submit" value="Log in using Google" />

Use your own values in the client_id, redirect_uri, and openid_realm inputs, of course.

Note about “state”: This just gets passed along to your google-oauth2.php. Maybe you want to put a timestamp here, so you can say “yay it took you 2.3 seconds to log in”. You can also leave it blank or delete the input entirely.

Step 3. Make a callback handler (google-oauth2.php).

This requires a JWT decoder. Here we use JWT.php from the BSD-licensed php-jwt project.

$oauth2_code = $_GET['code'];
$discovery = json_decode(file_get_contents(''));
$ctx = stream_context_create(array(
    'http' => array(
        'header'  => "Content-type: application/x-www-form-urlencoded\r\n",
        'method'  => 'POST',
        'content' => http_build_query(array(
            'client_id' => 'client_id_from_dev_console_goes_here',                         // <-- edit this
            'client_secret' => 'client_secret_from_dev_console_goes_here',                 // <-- edit this
            'code' => $oauth2_code,
            'grant_type' => 'authorization_code',
            'redirect_uri' => '',                   // <-- edit this
            'openid.realm' => '',                             // <-- edit this
$resp = file_get_contents($discovery->token_endpoint, false, $ctx);
if (!$resp) {
    // $http_response_header here got magically populated by file_get_contents(), surprise
    error_out('Error verifying token: ' . $http_response_header[0]);
$resp = json_decode($resp);
$access_token = $resp->access_token;
$id_token = $resp->id_token;

// Skip JWT verification: we got it directly from Google via https, nothing could go wrong.
$id_payload = JWT::decode($resp->id_token, null, false);
if (!$id_payload->sub) {
    error_out('No subscriber ID provided in ID token! See error log for details.');

// Hurray, authenticated.
// Edit the following section to suit your application.

$user_id = 'google+' . $id_payload->sub;
$user_email = $id_payload->email;

// Do whatever you do to keep people logged in. Maybe something like this.
$_SESSION['user_id'] = $user_id;
That’s it for authentication.
  • $user_id is ‘google+X’ where X is a bunch of digits uniquely identifying this user. Unlike OpenID, X stays the same when the same user logs in to different web applications. You can make up your own translation from X to a local identifier, of course.
  • $user_email is the user’s email. $id_payload has other stuff too, all fascinating I’m sure.

Step 4. Migrate user accounts.

In 2017, Google will stop giving you the OpenID-to-Google+-ID mapping, so you’d better start weaning your system off the OpenID identifiers. Of course you can skip this step (and remove the bits mentioning “openid” above) if you’re working on a new web app with no existing OpenID accounts to worry about.

$openid_id = $id_payload->openid_id;
if ($openid_id) {
    // Anything in our database referring to $openid_id should change to refer to $user_id.
    search_and_replace($openid_id, $user_id);


(Try it once to make sure it works, if you’re into that sort of thing.)

March development review: Keeping a good thing going

Added by Brett Smith about 6 years ago

Sometimes when we plan our sprints, we like to leave a little time for users to try out new features and give us feedback before we take a second pass on it. Other times, the idea is so obviously right that there's no point to that. The project sharing we developed last sprint was an instant hit! Naturally, we had to follow up by improving Arvados project pages. When the project has a description, that's the first thing visitors will see, in a dedicated tab. This gives project developers much more space to describe the project's pipelines and outputs, and help collaborators make sense of all the data.

There's another new feature to help Arvados users make their way around the system: documentation search. Simple, we know, but effective. If you're looking for information about a specific Arvados concept or tool, try using the search bar at the top of any documentation page.

Last but certainly not least, we've extended our Python SDK to make it easier to manipulate collections. Now instead of dealing with separate CollectionReader and CollectionWriter objects, you can instantiate Collection objects, which support both reading and writing. You can manipulate the contents of these objects with familiar methods like open, copy, exists, and remove. When you're done, just call the save method to update the collection in Arvados. This gives you a comfortable interface to update collections, while retaining top performance and better error handling by interacting directly with Keep in Python.

All this, in addition to a suite of bug fixes and interface improvements. If you haven't tried Arvados lately, definitely log in to our public beta server and kick the tires. And if you'd like to learn more, drop us a line on IRC or our mailing list.

May development review: Project promotion and Keep performance

Added by Brett Smith almost 6 years ago

We want Arvados to be the premier way for bioinformaticians and data scientists to collaborate. And as an open source project, we know that helping people understand your work is an important step to get others interested and joining you. In the last sprint, we added Open Graph support to public projects. Now when you share those projects on social media, people will see a useful preview with basic project information and a short introduction. We also improved the project description editing interface, so you'll have an easier time writing the information visitors need right on the project's front page.

Looking under Arvados' hood a bit, we also made several improvements to Keep performance, both on the server and client. The server's request handling has been reorganized to send each request through a wait queue fewer times, and it notifies clients when that queue is full to avoid overloading. Our Python Keep client library now makes its requests with PycURL library. In our experience, PycURL lets us more accurately detect when a server is unreachable. This means our code can still retry requests quickly in that scenario, but we have more reliability in other bad conditions like a strained network. In the I/O-heavy pipelines our users run, these little improvements add up quickly.

The performance improvements won't end there. We're also putting the speed of Arvados collections under a microscope. We've collected a variety of performance metrics about them across the system, and we have plans for improvement over the next few development cycles. Watch this space for the updates.

If you have questions or feedback about these or other updates, don't hesitate to drop us a line in our IRC channel or mailing list. Or if you just want to kick the tires, our open beta program is running the very latest Arvados code. Let us know what you think.

A good helping of Docker with a sprinkling of Common Workflow Language

Added by Ward Vandewege about 6 years ago

Our most recent sprint contained the usual mix of bug fixes and new features. Here's a quick overview of what happened.

The pre-built Arvados Docker images were greatly improved. They are now the easiest way to test Arvados on a local workstation. Specifically, we added support for the web-based file uploader and websockets to the Docker containers. The containers can now also be stopped and restarted via the arvdock command.

You can get started with the Arvados Docker containers at

In Workbench, users can now create and manage their own Arvados-hosted git repositories.

We have more git-related Workbench features in the pipeline, so stay tuned!

Also on Workbench, the collection page now has a much more useful summary at the top that shows the collection UUID, content address, and some information about the contents of the collection.

Workbench is now also smarter about in-browser file previews, as we taught it to use standard MIME types to determine when a file preview makes sense.

The Arvados Node Manager now fully supports Google Compute Engine (GCE), in addition to Amazon Web Services (AWS).

We added basic LDAP/Active Directory authentication support to the Arvados SSO server.

Large (1 GiB+) file downloads through Workbench now complete (that was a bug). Quite a few other, smaller bugs in Workbench were also fixed.

And finally, we wrote the draft 2 specification for the Common Workflow Language (CWL). It is available at We're very excited about the progress that is being made with the CWL, and we look forward to making it possible to run CWL pipelines on Arvados soon.

May/June development review: Performance across the board

Added by Brett Smith almost 6 years ago

The Arvados development team just concluded another sprint, continuing to build on a lot of our work from last time. The work you share through Arvados is more discoverable than ever thanks to Workbench's new public project listing. This page provides a helpful overview of all the public projects available through a cluster. It's immediately available from any Workbench page, even for folks who aren't logged in, so it's easy for anyone to find and browse the listings. You can see it in action by checking out the public projects on our open beta.

We also continued to improve the performance of Arvados collections. We now have a broad test suite to report how different collection operations perform in the Arvados API server and Workbench. Using this data, we made a few performance optimizations to the API server's collections handling. In the end, we reduced API response times by 35% for most requests. You'll feel the difference whether you're working with data sets through our Python SDK, FUSE driver, or Workbench.

The performance improvements don't end with collections. We put our public beta cloud through some scalability tests by running eight GATK variant caller pipelines in parallel, using GATK Queue to distribute work across multiple compute nodes. This led us to make some configuration changes to help the cluster perform more consistently; logging improvements to help track down issues with jobs at this scale; and a few fixes for corner-case bugs in Crunch's job dispatch code. Ultimately, we demonstrated that the pipeline's run time stayed flat even with this much parallelization—a testament to Arvados' design for scale.

You don't have to take our word for it. If you want to see Arvados' scalability for yourself, sign up for the open beta and run some pipelines. If you run into questions, don't hesitate to get in touch with us by IRC or e-mail.

April development review: Workbench makes it easy

Added by Brett Smith almost 6 years ago

Over the last month, we've made it easier than ever to get started with Arvados. You'll literally see the results as soon as you log in to Workbench: now a guide pops up to walk you through each step of running your own pipeline. The material will be familiar if you've already been through our tutorials, but the integration will make things easier for first-timers.

But even if you've been with us for a while, the improvements don't end there. Workbench now also recognizes a wider range of pipeline input errors, like insufficient rights to read a collection, and will prompt you to fix them before you submit the pipeline to run. We all make mistakes now and then, and Workbench will help you find and fix them faster than ever.

There's another tool to help you verify and debug your pipelines: we built a Git repository browser into Workbench. You can see the commit that a job used, and browse your repository at that point. This makes it super simple to do quick code checks on a job, like seeing whether files or particular bug fixes are missing.

Git integration improvements don't end with Workbench, though. We also taught Crunch to fetch job code from remote repositories. Now if you find a Crunch script you'd like to run in a public repository, you can do that just by submitting a job or pipeline by writing a public repository URL instead of an Arvados repository name. Arvados will automatically fetch and store everything it needs.

Finally, we made Crunch's resource allocation smarter. In the previous sprint, we imposed memory limits on tasks to prevent one from interfering with others, allocating an equal amount of memory per compute node core. Crunch now recognizes when a job is running fewer tasks than cores available, and provides as much memory as it can to each task while preventing interference. This means Crunch provides better resource allocation to map-reduce and scatter-gather jobs.

We hope these new features are as useful to you as they are to us. You can still try out our open beta program if you haven't already. We'd be happy to hear your feedback on our IRC channel or mailing list. Get in touch!

June development review: Writing to FUSE on Red Hat installs

Added by Brett Smith almost 6 years ago

A couple of major new developments landed in the last Arvados development sprint. Whether you're using Arvados to analyze data, administering it in your lab, or both, you'll find something to like.

Users can now write files in our FUSE driver. When you enable this feature, new data you write will be uploaded to Keep, and all your changes will update the Arvados collection on the API server. Of course, data is never lost; every byte is still in Keep, and previous versions of collections can be identified by their content address. Now you have even more ways to combine existing data analysis tools with Arvados' rich data tracking and organizing features. Not every POSIX operation is supported yet, but this is enough to let you run interactive analysis directly on collections. Check our FUSE documentation for all the details.

Administrators now have more options for installing Arvados. We've long packaged each Arvados component for Debian 7, and we're happy to add CentOS 6.6 and Ubuntu 12.04 to our list of supported distributions. Of course, we took the opportunity to update our installation guide to match—and while we were at it, we improved several sections to add useful pointers and describe how we deploy Arvados at Curoverse. All of these changes should help anyone installing Arvados on any kind of cluster to have a smoother experience.

We hope you'll try out the new features, and let us know how they work out for you. We're always happy to hear your feedback on our IRC channel or mailing list. And if you're curious but not sure how to get started, the easiest way is to sign up for our public beta.

Introducing the Arvados Web shell

Added by Brett Smith over 5 years ago

Arvados includes a full suite of command-line tools to create and query objects, upload and download collections, and run jobs and pipelines. However, not everybody wants to install these tools, especially if they're just trying out or getting started with Arvados.

One way we've made Arvados easier for them is by providing a shell box on each cluster. This host can provide an SSH account to any user, and it already has all the client tools installed. Users can simply SSH to the shell box, and use the tools already there. But even this solution isn't seamless. If you haven't already used SSH, you'll have to generate a keypair and upload the public half to Arvados, which can be a finicky process to learn. On top of that, Windows doesn't include an SSH client, so Windows users have to find and install a third-party client. All of this is still more overhead than we'd like.

We've just rolled out a feature to make it much easier to get to work with these Arvados tools: Web shell. Now you have access to everything you need, right through your browser. When you visit Workbench's "Virtual machines" page, you'll see a login button next to each shell box you have access to.

Press that button to open a shell in your browser. This is a fully-featured SSH client, so you can use all the same tools you would normally—even text editors and others that draw on the screen. That includes all the Arvados tools installed on our shell boxes, too, like the Keep FUSE mount.

If you're curious about the technical details, the way this works under the hood is that we've written a PAM module to authenticate users with their Arvados API token. It's an API client just like most of our tools: it uses the provided API token to query login permissions for the shell box, and verifies the user's access if the API token owner and login link line up. With that done, we just have to configure SSH to use the new module, and the Web shell client to pass along your API token automatically.

We think this feature will make Arvados more accessible to a lot of people, so we're excited for you to try it out and share your feedback. It's available right now on our public beta (you can sign up for an account if you haven't already). Let us know what you think on our IRC channel or mailing list.


Also available in: Atom RSS