Feature #4138

[Node Manager] Support Google Cloud Platform in node manager

Added by Tom Clegg almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Node Manager
Target version:
Start date:
11/17/2014
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
1.0

Subtasks

Task #4580: Review 4138-node-manager-google-cloudClosedBrett Smith

Task #4880: get credentials for testingResolvedWard Vandewege

Task #4203: Implement arvnodeman.computenode.googlecloudResolvedTim Pierce

Task #5222: Review 4138-node-manager-gce-wipResolvedTom Clegg

Associated revisions

Revision 319503f1
Added by Brett Smith over 5 years ago

Merge branch '4138-node-manager-gce-wip'

Closes #4138, #5222.

Thanks, Tim.

History

#1 Updated by Tim Pierce almost 6 years ago

  • Assigned To set to Tim Pierce

#2 Updated by Tim Pierce over 5 years ago

  • Target version changed from 2014-10-29 sprint to Arvados Future Sprints

#3 Updated by Ward Vandewege over 5 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint

#4 Updated by Brett Smith over 5 years ago

  • Subject changed from [Crunch] Support Google Cloud Platform in node manager to [Node Manager] Support Google Cloud Platform in node manager
  • Category set to Node Manager

#5 Updated by Tim Pierce over 5 years ago

  • Status changed from New to In Progress

#6 Updated by Tim Pierce over 5 years ago

  • Target version changed from 2014-11-19 sprint to 2014-12-10 sprint

#7 Updated by Brett Smith over 5 years ago

Reviewing ab8acfe

GCE implementation

I realize how much it's a challenge to write this sort of code without being able to actually test it, and I realize that you're blocked on external resources for testing. Bearing that in mind, there are some things I know won't work, and others I wonder about:

  • ex_userdata is completely EC2-specific, so node creation will crash. And the ping secret has to get to the node somehow, because otherwise it has no way to register itself with Arvados. If you haven't seen it yet, the Node lifecycle wiki page might help illuminate what's going on there, under the Node setup section. You'll have to coordinate with Ward to figure out the right way to seed this to new GCE nodes.
  • In the EC2 driver, you pass in a dictionary to ex_metadata to set tags on the created node. GCE apparently has an ex_tags argument for this. Given this, it's not clear to me what all the ex_metadata handling is accomplishing.
  • In EC2, the hostname tag gets special treatment, and is treated as the hostname for other purposes. Your comment makes it sounds like we really want to be using GCE's metadata call. This is also something to settle with Ward, including what to do about libcloud's lack of support for it. (Do you know if there's an upstream issue we can follow, at least? Should we file one?)
  • The code to figure out the node's launch time is probably EC2-specific. Even if the GCE driver stores it in node.extra as well, I bet the key name or timestamp format are different, just to spite us.

Candidly, I'm not wild about the idea of merging this without being able to kick the tires in a real cloud environment, at least a little bit. How can anyone say that the story is done without that? You're going to have a round of "make sure it works in reality" debugging one way or another; why not do it in the branch, and have it be part of this story, rather than separating things out?

Class organization

  • You commented that filtering listed nodes is a client-side responsibility. I think gce.ComputeNode.list_nodes is the client side for our purposes. It could call super() for the real API response, then pare that list down in Python. There's no other place in the code that can understand how to distinguish compute nodes from the list that GCE provides. I really would like to see arvnodeman.computenode.drivers be the sole place where any kind of cloud-specific logic lives.
  • Do you think we should refactor the loop that calls _init methods for specific kwargs into BaseComputeNodeDriver? If it's generally useful (and you would know better than anyone else right now), I don't see why we wouldn't; it introspects enough that it shouldn't interfere with new classes.

Configuration

  • The libcloud GCE documentation makes a brief note that GCE has "minute-level billing (10-minute minimum)." This is consistent with other material I've seen, although I haven't seen it printed chapter and verse from Google. But if that's right, a good setting for shutdown_windows might be "30, 999999". The first window needs to be long enough for the node to safely bootstrap, and the sysadmin could set the second window (much) shorter if they want less flapping on the nodes, but after that, it's all up to how they want to balance billing concerns with stability/uptime concerns.
  • I'm not sure I understand your comment about the cloud timeout argument. Are you saying NodeManagerConfig requires it? If so, that's a bug. But if you pass the argument in to a libcloud compute node driver, it propagates all the way down to libcloud.common.base.Connection, where it has an effect.
  • The example settings that are commented out shouldn't have quotes around the values. I don't think ConfigParser will parse that as intended.
  • There's one trailing whitespace character in the example configuration.

Thanks.

#8 Updated by Tim Pierce over 5 years ago

Brett Smith wrote:

Reviewing ab8acfe

GCE implementation

I realize how much it's a challenge to write this sort of code without being able to actually test it, and I realize that you're blocked on external resources for testing. Bearing that in mind, there are some things I know won't work, and others I wonder about:

  • ex_userdata is completely EC2-specific, so node creation will crash. And the ping secret has to get to the node somehow, because otherwise it has no way to register itself with Arvados. If you haven't seen it yet, the Node lifecycle wiki page might help illuminate what's going on there, under the Node setup section. You'll have to coordinate with Ward to figure out the right way to seed this to new GCE nodes.
  • In the EC2 driver, you pass in a dictionary to ex_metadata to set tags on the created node. GCE apparently has an ex_tags argument for this. Given this, it's not clear to me what all the ex_metadata handling is accomplishing.

Both EC2 and GCE support ex_metadata: see e.g. http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#libcloud.compute.drivers.gce.GCENodeDriver.create_node In fact, in GCE, ex_metadata is the formally recommended way to deliver SSH keys to a new instance. So I think ex_metadata is probably the preferred route for implementing both of these (but I'll try to find out whether there are obstacles to delivering ping_secret this way).

GCE's ex_tags argument appears to be something entirely different -- a flat list of keywords that are associated with the node, no key-value pairs. ex_tags may actually turn out not to be useful to us at all in the end.

  • In EC2, the hostname tag gets special treatment, and is treated as the hostname for other purposes. Your comment makes it sounds like we really want to be using GCE's metadata call. This is also something to settle with Ward, including what to do about libcloud's lack of support for it. (Do you know if there's an upstream issue we can follow, at least? Should we file one?)

I haven't followed up to see if there are outstanding upstream issues for this or an ETA. That's a very good idea, I'll research it some more.

I want to say "how hard can it be to provide our own abstraction call over setMetadata?" but I know that way lies madness.

  • The code to figure out the node's launch time is probably EC2-specific. Even if the GCE driver stores it in node.extra as well, I bet the key name or timestamp format are different, just to spite us.

Good point. I can dig in to try to figure this out too, but this seems like another issue that's best solved by poking at a real node :-/

Candidly, I'm not wild about the idea of merging this without being able to kick the tires in a real cloud environment, at least a little bit. How can anyone say that the story is done without that? You're going to have a round of "make sure it works in reality" debugging one way or another; why not do it in the branch, and have it be part of this story, rather than separating things out?

Oh, I agree strongly. In my defense, I still have not gotten GCE credentials from Ward to test against a real environment, so by the time we got to the end of the sprint, my only option that gave me a chance of closing out my stories for the sprint was to break up the stories :-)

Class organization

  • You commented that filtering listed nodes is a client-side responsibility. I think gce.ComputeNode.list_nodes is the client side for our purposes. It could call super() for the real API response, then pare that list down in Python. There's no other place in the code that can understand how to distinguish compute nodes from the list that GCE provides. I really would like to see arvnodeman.computenode.drivers be the sole place where any kind of cloud-specific logic lives.

That seems reasonable. The main question to solve is how the user wants to be able to filter nodes and on what criteria, so I'll work with Ward to find out how those should be expressed in the config file, and that will help me implement a concrete feature here.

  • Do you think we should refactor the loop that calls _init methods for specific kwargs into BaseComputeNodeDriver? If it's generally useful (and you would know better than anyone else right now), I don't see why we wouldn't; it introspects enough that it shouldn't interfere with new classes.

I think that's probably a good idea? Each driver will have its own keywords that are not known to any others, so we might as well have BaseComputeNodeDriver know how to perform that introspection. I can do the refactor.

Configuration

  • The libcloud GCE documentation makes a brief note that GCE has "minute-level billing (10-minute minimum)." This is consistent with other material I've seen, although I haven't seen it printed chapter and verse from Google. But if that's right, a good setting for shutdown_windows might be "30, 999999". The first window needs to be long enough for the node to safely bootstrap, and the sysadmin could set the second window (much) shorter if they want less flapping on the nodes, but after that, it's all up to how they want to balance billing concerns with stability/uptime concerns.

Seems reasonable to me.

  • I'm not sure I understand your comment about the cloud timeout argument. Are you saying NodeManagerConfig requires it? If so, that's a bug. But if you pass the argument in to a libcloud compute node driver, it propagates all the way down to libcloud.common.base.Connection, where it has an effect.

Ah, so it is -- I was getting myself confused about the difference between the Arvados.timeout option and the Cloud Credentials.timeout. I left that comment as a reminder (mainly to myself) about what level in the abstraction layer consumed that option.

  • The example settings that are commented out shouldn't have quotes around the values. I don't think ConfigParser will parse that as intended.

Good catch, thanks.

  • There's one trailing whitespace character in the example configuration.

Fixed.

#9 Updated by Brett Smith over 5 years ago

Tim Pierce wrote:

Brett Smith wrote:

  • In the EC2 driver, you pass in a dictionary to ex_metadata to set tags on the created node. GCE apparently has an ex_tags argument for this. Given this, it's not clear to me what all the ex_metadata handling is accomplishing.

Both EC2 and GCE support ex_metadata: see e.g. http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#libcloud.compute.drivers.gce.GCENodeDriver.create_node In fact, in GCE, ex_metadata is the formally recommended way to deliver SSH keys to a new instance. So I think ex_metadata is probably the preferred route for implementing both of these (but I'll try to find out whether there are obstacles to delivering ping_secret this way).

Sorry, I got a little sloppy with my commentary. Definitely use ex_metadata where it's appropriate. I'm specifically worried about copying self.list_kwargs into it. In the EC2 driver, that takes any tags that we search for during listing, and makes sure they're applied to new nodes as well. Since GCE doesn't even have key-value tags, that can't be what it's doing here. So I guess my question is, what's the intent here?

  • The code to figure out the node's launch time is probably EC2-specific. Even if the GCE driver stores it in node.extra as well, I bet the key name or timestamp format are different, just to spite us.

Good point. I can dig in to try to figure this out too, but this seems like another issue that's best solved by poking at a real node :-/

Yes, definitely.

#10 Updated by Tim Pierce over 5 years ago

  • Target version changed from 2014-12-10 sprint to 2015-01-07 sprint

#11 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2015-01-07 sprint to 2015-01-28 Sprint

#12 Updated by Tim Pierce over 5 years ago

Creating new nodes on GCE requires a valid "Project ID", which is apparently accessible only through the GCE console. From
http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html under Service Accounts:

You will also need your “Project ID” which can be found by clicking on the “Overview” link on the left sidebar.

Please send this along when time permits so that I can proceed with testing and bug fixing.

#13 Updated by Tim Pierce over 5 years ago

Ready for review at 5fd3e8c.

Notes on configuration settings:
  • Cloud Credentials
    • json_credential_file is a path to the JSON file with credentials delivered from the GCE dashboard.
    • project is the name of the project, cut and pasted from the GCE dashboard.
  • Cloud Create
    • image and location are required fields.
    • gce.example.cfg provides suitable defaults; see the comments for guidance on selecting other valid image and location names.
  • Size

This code has been unit tested and was manually exercised in the Python console, creating a Node Manager generic driver with config.new_cloud_client() and using that to generate and destroy new GCE nodes with the same interface used by the NodeManagerDaemonActor when new jobs are created in Arvados.

#14 Updated by Tom Clegg over 5 years ago

  • Assigned To changed from Tim Pierce to Brett Smith

#15 Updated by Tom Clegg over 5 years ago

  • Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint

#16 Updated by Tom Clegg over 5 years ago

Only a couple of comments at 8f28c53... with the caveat that I'm still pretty new at libcloud/nodemanager:

Config
  • services/nodemanager/doc/gce.example.cfg seems to have forked the documentation in ec2.example.cfg. Should we move the docs and defaults into a default.cfg, and reserve the platform-example files for just the values and comments that are platform specific? (I don't think this needs to hold up merging, or even closing the issue, but we might save ourselves some trouble if we deal with it before it gets too much more forked...)
  • Better to use RFC-protected rather than ?
  • I looked around for where cost info could come from, when choosing the cheapest node size. It seems to just come from the config file, except that we don't have any examples of it -- is that right? Should we add an example, maybe commented out so we don't have to worry so hard about keeping it accurate?
Fake node_start_time
  • This resets node_start_time to a recent time when node manager starts up. But ComputeNodeMonitorActor.shutdown_eligible relies on a "started recently" test to tell the difference between a new unpaired node (which is probably a bootstrapping failure and should be shut down) and an old unpaired node (which means the node has been operating for a while without node manager's help, which might mean it's a bad idea to touch it at all). Should we be worried about this? (Even if so, I haven't thought of a better plan...)

#17 Updated by Brett Smith over 5 years ago

Tom Clegg wrote:

  • services/nodemanager/doc/gce.example.cfg seems to have forked the documentation in ec2.example.cfg. Should we move the docs and defaults into a default.cfg, and reserve the platform-example files for just the values and comments that are platform specific? (I don't think this needs to hold up merging, or even closing the issue, but we might save ourselves some trouble if we deal with it before it gets too much more forked...)

As a sysadmin, I really like it when server software comes with "follow the bouncing ball" example configuration files. You just take the one you want, read the comments, edit settings to taste, and you're good to go. The current setup facilitates this for each service, and I like that.

It is admittedly not DRY. I feel like the trade-off is worth it as a user affordance, but that's admittedly kind of hypothetical right now until people start delopying Arvados on clouds themselves. If you're not swayed by this, I can change it.

Yes, done.

  • I looked around for where cost info could come from, when choosing the cheapest node size. It seems to just come from the config file, except that we don't have any examples of it -- is that right?

libcloud provides it, based on information available through each cloud's respective API. Nifty, huh?

  • This resets node_start_time to a recent time when node manager starts up. But ComputeNodeMonitorActor.shutdown_eligible relies on a "started recently" test to tell the difference between a new unpaired node (which is probably a bootstrapping failure and should be shut down) and an old unpaired node (which means the node has been operating for a while without node manager's help, which might mean it's a bad idea to touch it at all). Should we be worried about this?

Yes. I've made the GCE driver smarter: when it boots a node, it stores the boot time in the node's metadata. If that information isn't available, the boot time method returns 0. This way, we'll retain the behavior of avoiding touching nodes that we don't have enough information about. We'll also retain the information when Node Manager is restarted.

Now at 28467b4.

#18 Updated by Tom Clegg over 5 years ago

Brett Smith wrote:

Tom Clegg wrote:

  • services/nodemanager/doc/gce.example.cfg seems to have forked the documentation in ec2.example.cfg. Should we move the docs and defaults into a default.cfg, and reserve the platform-example files for just the values and comments that are platform specific? (I don't think this needs to hold up merging, or even closing the issue, but we might save ourselves some trouble if we deal with it before it gets too much more forked...)

As a sysadmin, I really like it when server software comes with "follow the bouncing ball" example configuration files. You just take the one you want, read the comments, edit settings to taste, and you're good to go. The current setup facilitates this for each service, and I like that.

It is admittedly not DRY. I feel like the trade-off is worth it as a user affordance, but that's admittedly kind of hypothetical right now until people start delopying Arvados on clouds themselves. If you're not swayed by this, I can change it.

Yes, I'm fine with leaving it alone for now. Eventually maybe we can have it both ways, somehow -- perhaps that means auto-generating example configs (when?) by combining the docs+defaults file with the example config values / additional comments?

(As an occasional sysadmin, I like the quick-start of skimming a well-documented config file and editing the parts I need... but I also dislike manually merging the vendor's updated comments into my giant mostly-docs-and-defaults config file N times after that, and having to do a diff to find out which things I changed and which are just defaults.)

libcloud provides it, based on information available through each cloud's respective API. Nifty, huh?

Yes! I suspected/hoped that was happening somehow, but I didn't see it.

Yes. I've made the GCE driver smarter: when it boots a node, it stores the boot time in the node's metadata. If that information isn't available, the boot time method returns 0. This way, we'll retain the behavior of avoiding touching nodes that we don't have enough information about. We'll also retain the information when Node Manager is restarted.

Now at 28467b4.

LGTM, thanks!

#19 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 75 to 100

Applied in changeset arvados|commit:319503f1a8eda9fb9cea0bff038ad437e88ebeac.

Also available in: Atom PDF