Lifecycle of an Arvados compute node

Many components work in concert to prepare a compute node to receive Crunch jobs. This page describes that process from start to finish.

Arvados node creation

Before you can set up a compute node, you need to create an Arvados node record. The API server will assign a random ping_secret value in the node's info hash.

Node setup

When a compute node comes up, it registers itself with Arvados by sending a node ping action. This ping includes basic information that Arvados needs to talk to the server, such as its IP address. It should send pings regularly as long as it's available to do compute work.

The ping request must include the previously-assigned ping secret. Different deployments may use different strategies to get the ping secret onto the new node. Here's how we do it on Curoverse EC2 cloud installations of Arvados:

  • We've prepared an EC2 node image that includes the ping script, and the cron jobs necessary to run it regularly.
  • When we spin up a new node (e.g., in Node Manager), we use that image, and put a base ping URL (including the ping secret) in the node's "user data." This is set at node creation time, and the ping script can read it via a special file on the filesystem.

Node association with Arvados

When the API server receives a ping, it will update the node record with the provided information, and automatically assign any fields that are necessary for operation but not yet set (e.g., hostname). At this point, crunch-dispatch will see that the node is available to do work, and send jobs to it.

You may want to propagate some information assigned by the API server back to the real node, or other places. For example, on EC2 clouds, Node Manager updates the node's name tag with the name assigned by the API server.

Node shutdown

Nodes can go down for any number of reasons, planned or unplanned. There is no explicit shutdown mechanism. Instead, Arvados components that care should assume that a node that has not sent a ping in some time is down, or at least unable to do compute work.

Updated by Brett Smith almost 10 years ago · 3 revisions