Story #4904

[SDKs] Use websockets to restart/reconfigure a web service running on a VM whenever the content of a project changes

Added by Tom Clegg over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Sample Pipelines
Target version:
Start date:
01/09/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

  1. a really really short Python program that uses a super convenient SDK library call to listen to websockets, waits for a specific event type, and prints "foo"
    • this is mostly just a copy of the main section of ws.py, with "print json.dumps(ev)" and the "subscribe to extra stuff" functionality replaced with something simple like print "new collection arrived: %s" % uuid
  2. a longer but more useful example of that same pattern which brings up a web service based on what it sees there. Example: shut down the old/existing docker container "foo", and start up a new one, passing it the portable_data_hash of the most recently added/updated collection.

Subtasks

Task #4947: Provide example to science teamResolvedPeter Amstutz

Task #5018: Review 4904-arv-webResolvedPeter Amstutz

Task #4946: Fix project directory refresh bug in FUSEResolvedPeter Amstutz

Associated revisions

Revision a67bb34f
Added by Peter Amstutz over 5 years ago

Merge branch '4904-arv-web' closes #4904

History

#1 Updated by Tom Clegg over 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-01-28 Sprint

#2 Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz

#3 Updated by Peter Amstutz over 5 years ago

  • Subject changed from [DRAFT] [SDKs] Use websockets to restart/reconfigure a web service running on a VM whenever the content of a project changes to [SDKs] Use websockets to restart/reconfigure a web service running on a VM whenever the content of a project changes

#4 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)

#5 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)

#6 Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg over 5 years ago

  • Category set to Sample Pipelines

b4fd2ab4 looks pretty good.

I haven't scrutinized the code (or docs) as a proper review. I noticed a couple of things:
  • The doc page needs some proofreading/copy-editing.
  • The doc page should address the downtime issue. It looks like there's significant downtime every time the "current collection" changes. This is fine for now, but I think it deserves to be mentioned.
  • I think it would make a better example (starting point for users to fork) if we tidy up the big while loop so the "what to do" details are more cleanly separated from the "when to do stuff" details.
  • The "running" variable seems to mean "waiting".

#8 Updated by Brett Smith over 5 years ago

Reviewing b4fd2ab

arv-web

I double-checked that I was running arvados_fuse from the branch, but I get this error when trying to run arv-web:

Traceback (most recent call last):
  File "arv-web.py", line 84, in <module>
    mountdir = run_fuse_mount(collection)
  File "arv-web.py", line 36, in run_fuse_mount
    operations = Operations(os.getuid(), os.getgid(), "utf-8", True)
TypeError: __init__() takes at most 4 arguments (5 given)

I have the latest llfuse (0.40) installed.

It's kind of a downer that there are no tests for arv-web. It's also a downer that the code will require refactoring to make it testable, since the code is in the module top level, it relies on global variables, it instantiates those objects directly, etc. I know there's a little tension here with Tom's request to keep it simple, but if this is going to be an advertised feature of Arvados, I imagine we'll end up writing tests for it sooner rather than later.

Beyond that:

  • --project help should specify that it's expecting a UUID. It should also accept --project-uuid as a synonym, for consistency with our other tools.
  • I think "Local bind port" might be too terse a hint for our users. What about "Local port to access application"? Or something else, I'm not wedded to that.
  • The code crashes with an IndexError if you pass it a project with no collections. Please convert this to a nicer error.
  • A collection can change in ways that don't affect its contents. I think making newcollection != collection more specific (checking portable_data_hash?) could help avoid unnecessary restart overhead in these cases.
  • In on_message, would it help readability to resolve the old and new owner_uuids to local variables up front? Repeatedly resolving three levels of dictionary lookups so often is a bit of a bear, especially since the important part (old vs. new) is in the middle of it.
  • In logging.exception(str(e)) - logging.exception always writes the full backtrace and exception, so passing it the stringified exception is redundant. Pass it a message to explain the traceback's context, or nothing.
  • I don't have a strong opinion about whether or not you use the Arvados logging infrastructure, but using the logging module methods means all your log messages refer to "root"—it means the root logger, but it's easy to assume it means the root user. Using a named logger would avoid that.

These don't need to be addressed for the branch, just FYI:

  • The logging module does string interpolation for you, so you can say, e.g., logging.info("%s %s", eq[1], eq[2]). It usually looks a little nicer.
  • In t = threading.Thread(None, lambda: llfuse.main()), the lambda is redundant. You can just pass in the llfuse.main callable directly.

Documentation

  • "Using arv-web" subheader is redundant with the title.
  • "arv-web is a long running service for managing a web service …" - Try to avoid echoing "service." Maybe "arv-web supervises a web service …"?
  • "Arv-web applications are reproducible, immutable application bundle where …" has number disagreement and echo of "application."
  • "it is efficient to add produce a new application bundle" has an extra word.
  • /mnt should always be set in monospace font.
  • "UUID" should always be uppercase.
  • "Docker" should always be capitalized (unless you're referring to the docker command).
  • "this is part section of the URL" is missing a word.
  • The second paragraph of "Writing your own applications" is missing its ending period.
  • "it will render default Apache index pages permitting simple browsing the collection contents" could use rewording to avoid the double gerund. Maybe "… index pages to let you browse the collection contents"?

Documentation code

I think it's really worthwhile to have our code samples represent really beautifully idiomatic Python. Users often start their work by copying and pasting ours, so having more robustness than is immediately necessary helps the code stand up to some of those changes. Plus, we benefit if they learn a thing or two about Python from it.

  • Prefer ev.get('event_type') … to 'event_type' in ev and ev['event_type'] ….
  • Please use a consistent quoting convention for dictionary lookups and comparisons. Either style is fine, but using two different styles on two lines that do the same thing is a little jarring.
  • "A new collection was created %s" needs a colon or something to separate the text from the UUID.
  • Please remove cache=False from the arvados.api call. I don't see any reason it's necessary in this example, and we might as well get in front of #5037.
  • import json is unnecessary.

Docker

  • Please hook the arv-web image into the existing build infrastructure so people can run build.sh arv-web.
  • trap "kill -TERM -$pgrp; exit" EXIT TERM KILL SIGKILL SIGTERM SIGQUIT - Some of these spellings are redundant, and the OS won't let you trap SIGKILL. I think you just need to list EXIT TERM QUIT, and maybe INT if you want to throw it in.

Thanks.

#9 Updated by Peter Amstutz over 5 years ago

  • Target version changed from 2015-01-28 Sprint to 2015-02-18 sprint

#10 Updated by Brett Smith over 5 years ago

Reviewing 8233bab

Bugs

  • On my system, using our own Docker package, the service refresh causes the Docker daemon and most of my own processes to die. I reproduced the problem outside arv-web by just running:
    $ docker run -d arvados/debian sleep 1h
    $ docker exec [that cid] killall --regexp '.*' --signal HUP

    Here's arv-web's log when it happens:
    2015-02-02 14:04:52 arvados.arv-web[5220] INFO: Mounting 4xphq-4zz18-kc5hdi6uphu2l7k
    2015-02-02 14:04:52 arvados.arv-web[5220] INFO: Starting Docker container arvados/arv-web
    2015-02-02 14:04:52 arvados.arv-web[5220] INFO: Container id 8e2c470ebfdc795b9945a36a092ea99985dc22d0393e745108789dea63b3b17a
    2015-02-02 14:04:52 arvados.arv-web[5220] INFO: Waiting for events
    apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3 for ServerName
    2015-02-02 14:06:03 arvados.arv-web[5220] INFO: create 4xphq-4zz18-v374kx4pf7ckmi5
    2015-02-02 14:06:03 arvados.arv-web[5220] INFO: Mounting 4xphq-4zz18-v374kx4pf7ckmi5
    2015-02-02 14:06:04 arvados.arv-web[5220] INFO: Sending refresh signal to container
    2015/02/02 14:06:04 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?
    2015-02-02 14:06:04 arvados.arv-web[5220] INFO: Waiting for events
  • sample-rack-app appears to have a misconfiguration. When I try to start arvados/arv-web with that mount, a Python process apparently enters an infinite loop (~100% CPU usage), and docker run never returns.

Lesser bugs

  • When the code loops waiting to read content from a file, it should check that the exception's errno is ENOENT, and reraise it otherwise. Without this, it could enter an infinite loop if there's some other read problem.
  • If docker run fails for some reason, there's no cidfile to remove. When the clean-up code tries to do that later, it fails and eats the exception, and skips removing ciddir. In my testing, because of the first bug, I ended up with a lot of stray tempdirs lying around.
  • logger.exception(e) - This is still not correct. Like I said, "Pass it a message to explain the traceback's context, or nothing." The method gets the current exception information from sys.exc_info(), so passing in the exception itself is also redundant.
  • The code should probably check that the provided project UUID is actually a project UUID before it enters the main loop. I got turned around and accidentally entered a collection UUID, and it happily entered a main loop that would never actually do anything.
  • The clean-up process in the finally block does a lazy unmount, but immediately tries to rmdir the mount point, without checking if the unmount has gone through. Given the arv-web owns the mount, I think it probably should omit -z, although there are probably other ways to deal with this.
  • Above that, if docker stop fails we'll get an exception, but the code doesn't do anything about this, which will interrupt the rest of the cleanup. Is this intended behavior?

Design issues

I am very concerned about arv-web's deep integration with CollectionDirectory from the FUSE driver. arv-web basically half-reinstantiates the object by modifying a few instance attributes. But we don't treat the FUSE module's interface as a public interface like we do with the Python SDK; we don't have tests for this use, either on the FUSE end or on arv-web's; and even the FUSE driver doesn't ever exercise CollectionDirectory this way.

I would be much more comfortable if arv-web ran arv-mount as a subprocess like everything else, and then had another channel to give the Docker image the collection identifier. If that's too difficult, at least give CollectionDirectory a single reconfiguration method to do what arv-web needs, taking all the necessary parameters as arguments, and document the dependency.

main() has gotten really hairy. It has a block that's 11 levels deep. It has a 74-line try block which is nested immediately inside a 113-line try block, making it difficult to pair up the try lines with the except lines. Please break out useful functional units for readability.

There are a few chunks of repeated code: the "keep trying to read this file until I get something" loop, the "log and stop Docker" 2-liner, the block to get the project's latest Collection. DRYing some of these up might be a good start for cleaning up.

Python style

Documentation

  • The included --help output needs to be updated.
  • The first paragraph of "Running sample applications" is missing its ending period.
  • References to the docker_image file in the last paragraph should be monospace.

Thanks.

#11 Updated by Peter Amstutz over 5 years ago

Will go back to the "shut down the container, shut down the mount, start a new mount, start a new container" approach. Most of the cleverness relates to trying to get it to reload quickly, but that's not an acceptable tradeoff if the result is just broken.

#12 Updated by Peter Amstutz over 5 years ago

On second thought, I have a couple of ideas that would be slightly less hacky without going all the way backwards.

#13 Updated by Peter Amstutz over 5 years ago

  • Target version deleted (2015-02-18 sprint)

Brett Smith wrote:

Reviewing 8233bab

Bugs

  • On my system, using our own Docker package, the service refresh causes the Docker daemon and most of my own processes to die.

This is an epic container fail if it killed processes outside the container. It did work for me. However, I have implemented a better mechanism. If there is a file called "reload" in the collection, it is executed to tell the server inside the container to reload. If there is no "reload" script, or the new docker image is different from the current docker image, the container is stopped and started with the new image.

Lesser bugs

  • When the code loops waiting to read content from a file, it should check that the exception's errno is ENOENT, and reraise it otherwise. Without this, it could enter an infinite loop if there's some other read problem.

Worked around this problem by accessing the "docker_image" and "reload" files directly via CollectionDirectory instead of open(). This is much less problematic since it is already in-processes.

  • If docker run fails for some reason, there's no cidfile to remove. When the clean-up code tries to do that later, it fails and eats the exception, and skips removing ciddir. In my testing, because of the first bug, I ended up with a lot of stray tempdirs lying around.

Now detaches the container and gets the container id from the docker stdout.

  • logger.exception(e) - This is still not correct. Like I said, "Pass it a message to explain the traceback's context, or nothing." The method gets the current exception information from sys.exc_info(), so passing in the exception itself is also redundant.

Fixed.

  • The code should probably check that the provided project UUID is actually a project UUID before it enters the main loop. I got turned around and accidentally entered a collection UUID, and it happily entered a main loop that would never actually do anything.

Fixed.

  • The clean-up process in the finally block does a lazy unmount, but immediately tries to rmdir the mount point, without checking if the unmount has gone through. Given the arv-web owns the mount, I think it probably should omit -z, although there are probably other ways to deal with this.

Fixed.

  • Above that, if docker stop fails we'll get an exception, but the code doesn't do anything about this, which will interrupt the rest of the cleanup. Is this intended behavior?

Fixed.

Design issues

I am very concerned about arv-web's deep integration with CollectionDirectory from the FUSE driver. arv-web basically half-reinstantiates the object by modifying a few instance attributes. But we don't treat the FUSE module's interface as a public interface like we do with the Python SDK; we don't have tests for this use, either on the FUSE end or on arv-web's; and even the FUSE driver doesn't ever exercise CollectionDirectory this way.

I would be much more comfortable if arv-web ran arv-mount as a subprocess like everything else, and then had another channel to give the Docker image the collection identifier. If that's too difficult, at least give CollectionDirectory a single reconfiguration method to do what arv-web needs, taking all the necessary parameters as arguments, and document the dependency.

Added CollectionDirectory.change_collection() method.

main() has gotten really hairy. It has a block that's 11 levels deep. It has a 74-line try block which is nested immediately inside a 113-line try block, making it difficult to pair up the try lines with the except lines. Please break out useful functional units for readability.

There are a few chunks of repeated code: the "keep trying to read this file until I get something" loop, the "log and stop Docker" 2-liner, the block to get the project's latest Collection. DRYing some of these up might be a good start for cleaning up.

Refactored into an ArvWeb class and split out chunks of code into separate methods.

Python style

Fixed.

Documentation

  • The included --help output needs to be updated.

Fixed.

  • The first paragraph of "Running sample applications" is missing its ending period.

Fixed.

  • References to the docker_image file in the last paragraph should be monospace.

Fixed.

#14 Updated by Peter Amstutz over 5 years ago

  • Target version set to 2015-02-18 sprint

#15 Updated by Brett Smith over 5 years ago

Reviewing 5dbf5c8.

Peter Amstutz wrote:

Brett Smith wrote:

  • On my system, using our own Docker package, the service refresh causes the Docker daemon and most of my own processes to die.

This is an epic container fail if it killed processes outside the container.

Yeah, definitely. It smells more like a Docker bug than anything arv-web did wrong. But we go to code with the Docker we have, not the Docker we want to have or wished we had.

  • If docker run fails for some reason, there's no cidfile to remove. When the clean-up code tries to do that later, it fails and eats the exception, and skips removing ciddir. In my testing, because of the first bug, I ended up with a lot of stray tempdirs lying around.

Now detaches the container and gets the container id from the docker stdout.

The code still builds ciddir and cidfilepath, which are now extraneous and can be removed.

  • The code should probably check that the provided project UUID is actually a project UUID before it enters the main loop. I got turned around and accidentally entered a collection UUID, and it happily entered a main loop that would never actually do anything.

Fixed.

Please make it exit nonzero in this case.

It's good to merge with these changes. Thanks.

#16 Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:a67bb34f6f19662f0a30e4aa670774c4595cb7a4.

Also available in: Atom PDF