Project

General

Profile

Actions

Idea #3609

closed

[SDKs] arv-run wrapper to interactively submit 'run-command' jobs

Added by Peter Amstutz over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
09/16/2014
Due date:
Story points:
2.0

Subtasks 6 (0 open6 closed)

Task #3731: Add pipe support to run-commandResolvedPeter Amstutz10/14/2014Actions
Task #4168: Talk to awz about his vision for this toolResolvedPeter Amstutz09/16/2014Actions
Task #3733: Write wrapperResolvedPeter Amstutz09/16/2014Actions
Task #4183: Review 3609-arv-wsResolvedPeter Amstutz09/16/2014Actions
Task #3729: Design command lineResolvedPeter Amstutz09/16/2014Actions
Task #4241: Review 3609-arv-runResolvedTim Pierce09/16/2014Actions
Actions #1

Updated by Peter Amstutz over 9 years ago

Brainstorming:

/keep/foo/1.baz /keep/foo/2.baz /keep/foo/3.baz

Example command line:

arv-run grep /keep/foo/*.baz '|' cut -c5 '>' stuff

Runs as separate tasks:
bash -c grep /keep/foo/1.baz | cut -c5 > stuff
bash -c grep /keep/foo/2.baz | cut -c5 > stuff
bash -c grep /keep/foo/3.baz | cut -c5 > stuff

Combine "stuff" at the end.

Collect everything at the end of the command line that doesn't start with '-' as separate files. If necessary, signal start of file list using '--'

Run one task per file, or specify grouping

arv-run --group-by=2 bwa mem -- *.fastq

Actions #2

Updated by Tom Clegg over 9 years ago

  • Subject changed from arv-run wrapper to interactively submit 'run-command' jobs to [SDKs] arv-run wrapper to interactively submit 'run-command' jobs
  • Story points set to 2.0
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Actions #4

Updated by Peter Amstutz over 9 years ago

  • Category set to SDKs
  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz over 9 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Peter Amstutz over 9 years ago

  • Target version changed from 2014-09-17 sprint to 2014-10-08 sprint
Actions #7

Updated by Tom Clegg over 9 years ago

  • Target version changed from 2014-10-08 sprint to Arvados Future Sprints
Actions #8

Updated by Peter Amstutz over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-29 sprint
Actions #9

Updated by Brett Smith over 9 years ago

Reviewing the revamped arv-ws at 55888e6.

events module

  • Please add tests for PollClient.
  • It shouldn't be necessary to do your own JSON encoding of API parameters; the API client library should take care of that for you. It definitely works for filters, and seems to do OK with a single-string order parameter too. It seems to mess up when you pass a list to order. Maybe that's a bug in our discovery document?
  • _logger.error("Web sockets not available at %s" % api._rootDesc['websocketUrl']) can crash if there is no websocketUrl (e.g., we got here because PollClient raised an exception). I think you could avoid these sorts of problems, and reduce code that's redundant between the branching logic and the exception logic, by having the try clause only cover the main websockets setup code, and do necessary cleanup if it fails.
  • This is a modernization thing not directly related to your branch, but while you're here, the websocket URL should be generated using api.api_token rather than config.get('ARVADOS_API_TOKEN').

ws command

  • Please add help for all the options. I don't think I would've figured out that --filters expects a JSON string without reading the code.
  • The name --poll-fallback sounds like a boolean, especially since it's accompanied by --no-poll-fallback. Please consider a different name that clarifies that it's defining a poll interval.
  • lambda ev: on_message(ev) is redundant. You can just write on_message here.
  • Please remove the commented code from on_message.

Thanks.

Actions #10

Updated by Brett Smith over 9 years ago

Reviewing arv-ws follow-ups at 58014b9.

  • Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.
  • In the tests, lambda ev: self.on_event(ev) is redundant as before.
  • Suggest signal.pause() instead of time.sleep(60) to wait for the child thread to finish.
  • The polling-related switches would be nicer as a mutually exclusive group, since it makes no sense to set a poll interval and turn polling off.
  • The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.

Thanks.

Actions #11

Updated by Peter Amstutz over 9 years ago

Brett Smith wrote:

Reviewing the revamped arv-ws at 55888e6.

events module

  • Please add tests for PollClient.

Test added.

  • It shouldn't be necessary to do your own JSON encoding of API parameters; the API client library should take care of that for you. It definitely works for filters, and seems to do OK with a single-string order parameter too. It seems to mess up when you pass a list to order. Maybe that's a bug in our discovery document?

You're right, the discovery document does specify type=string (instead of type=array) for the "order" parameter. However changing this falls into a gray area of backwards compatibility that seems out of scope for this branch. Instead, since there's only a single order column, I just changed to be a string (no json encoding necessary).

  • _logger.error("Web sockets not available at %s" % api._rootDesc['websocketUrl']) can crash if there is no websocketUrl (e.g., we got here because PollClient raised an exception). I think you could avoid these sorts of problems, and reduce code that's redundant between the branching logic and the exception logic, by having the try clause only cover the main websockets setup code, and do necessary cleanup if it fails.

Fixed.

  • This is a modernization thing not directly related to your branch, but while you're here, the websocket URL should be generated using api.api_token rather than config.get('ARVADOS_API_TOKEN').

Fixed.

ws command

  • Please add help for all the options. I don't think I would've figured out that --filters expects a JSON string without reading the code.

Fixed.

  • The name --poll-fallback sounds like a boolean, especially since it's accompanied by --no-poll-fallback. Please consider a different name that clarifies that it's defining a poll interval.

Fixed.

  • lambda ev: on_message(ev) is redundant. You can just write on_message here.

Fixed.

  • Please remove the commented code from on_message.

Fixed.

Actions #12

Updated by Peter Amstutz over 9 years ago

Brett Smith wrote:

Reviewing arv-ws follow-ups at 58014b9.

  • Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.

Changed to use event variable.

  • In the tests, lambda ev: self.on_event(ev) is redundant as before.

Fixed.

  • Suggest signal.pause() instead of time.sleep(60) to wait for the child thread to finish.

Didn't know about that one. Fixed.

  • The polling-related switches would be nicer as a mutually exclusive group, since it makes no sense to set a poll interval and turn polling off.

That was my mistake, I intended them to be mutually exclusive but used the wrong method.

  • The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.

Fixed.

Actions #13

Updated by Brett Smith over 9 years ago

Reviewing 2379619

The new logic for cloning an API client for PollClient seems to be inconsistently applied. We use a threaded WebSocketClient, so it seems like that whatever thread safety policy we apply to PollClient, we should also apply to EventClient.

Personally, I think we should make that the caller's responsibility: "When you pass an API client to subscribe(), you can't use it again until you close the client." Like you pointed out on IRC, api() kwargs make it difficult to do proper cloning, and it seems like most users will fall into the pattern that they have an API client to do some setup, and then use it for websockets the rest of the time.

But, if you feel strongly about cloning, then I think we need to do the clone at the top of subscribe(), before we build any client.

Peter Amstutz wrote:

Brett Smith wrote:

Reviewing arv-ws follow-ups at 58014b9.

  • Sleeping in multithreaded tests is really dicey business. It's difficult to find values that are both fast for developer machines, and accommodating enough for Jenkins. What if on_event in both these test classes set a threading.Event for state 2, and the test method waited for that with a generous timeout (10 seconds, maybe)? There's really no surefire way to test state 3 with the current structure, since that requires solving the halting problem.

Changed to use event variable.

What about the time.sleep(1) before creating the object? If we need to make sure the 200 OK is handled before we send that, I think we need an event flag for that too. If not, I think we can ditch the sleep.

  • Suggest signal.pause() instead of time.sleep(60) to wait for the child thread to finish.

Didn't know about that one. Fixed.

You can remove the unused time import now.

  • The help and error messages inconsistently use both "websockets" and "web sockets." Please figure out the preferred spelling and make them consistent.

Fixed.

There's still a logger warning with "web sockets" on events.py line 100.

Actions #14

Updated by Peter Amstutz over 9 years ago

Brett Smith wrote:

Reviewing 2379619

Personally, I think we should make that the caller's responsibility: "When you pass an API client to subscribe(), you can't use it again until you close the client." Like you pointed out on IRC, api() kwargs make it difficult to do proper cloning, and it seems like most users will fall into the pattern that they have an API client to do some setup, and then use it for websockets the rest of the time.

Fixed.

What about the time.sleep(1) before creating the object? If we need to make sure the 200 OK is handled before we send that, I think we need an event flag for that too. If not, I think we can ditch the sleep.

Yep, added a "subscribed" event variable too. Refactored EventClientTest and PollClientTest to use the same test code.

You can remove the unused time import now.

Fixed.

There's still a logger warning with "web sockets" on events.py line 100.

Fixed.

Actions #15

Updated by Brett Smith over 9 years ago

Peter Amstutz wrote:

Yep, added a "subscribed" event variable too. Refactored EventClientTest and PollClientTest to use the same test code.

Just a couple of small things about the new test structure:

  • In EventTestBase.runTest, assertIsInstance would generate a nicer diagnostic than the current condition+fail.
  • In the subclasses' tearDown, please guard against the possibility that self.ws is None.
  • super(run_test_server.TestCaseWithServers, self).tearDown() should have the current class (WebsocketTest or PollClientTest) as the first argument. The current invocation is running the tearDown method of TestCaseWithServers' superclass, which is presumably a noop.

Please go ahead and merge with these changes. Thanks.

Actions #16

Updated by Tim Pierce over 9 years ago

Review @ 0edcc26f:

sdk/python/arvados/commands/run.py
  • The slots stuff was really confusing to me, and I had to read the code and the man page a few times to tell what was going on. Can I suggest a comment to clarify what's going on here? Something like
    # Parse the command arguments into 'slots'.
    # All words following '>' are output arguments and are collected into slots[0].
    # All words following '<' are input arguments and are collected into slots[1].
    # slots[2] and on represent individual pipelines.
    #
    # e.g. arv-run foo arg1 arg2 '|' bar arg3 arg4 '<' input1 input2 input3 '>' output.txt
    # will be parsed into:
    #   [['output.txt'],
    #    ['input1', 'input2', 'input3'],
    #    ['foo', 'arg1', 'arg2'],
    #    ['bar', 'arg3', 'arg4']]
    
  • determine_project and is_in_collection should list which exceptions they're catching
  • import arvados.commands._util so we can include the default --retries argument, instead of defaulting max_retries to 3
  • L123-146:
    • I'm not sure I understand what I'm going on here. Is this here to catch command-line flags that may be hiding filenames that need to be translated, like --filename=zzzzz-4zz18-23u3hxugbm71qmn.txt or -fblurfl.txt? If so:
      • Add a comment explaining what we're doing.
      • Does this correctly handle directories or filenames with equal signs in them?
  • L191:
    • if f is more than one directory deep under the current dir, this looks like it will make the stream name just the top level directory. Shouldn't the stream name be os.path.basename(f) or am I confused?
    • If f does not include a "/" then the stream name will be the empty string "", not ".".
  • L245: trailing "\"?
  • L267: needs ensure_unique_name=True
crunch_scripts/run-command
  • I'm pretty lost about how the list parsing works here, particularly on what var_items, get_items and expand_item are for and how they fit together, and what overall goal they accomplish. Document at least the general logic flow for a new reader.
  • L395: I think this is equivalent to success = any([status == 0 for status in rcode.values()])? The reduce call is sort of confusing to me because it suggests that something is being accumulated.
Actions #17

Updated by Peter Amstutz over 9 years ago

  • Target version changed from 2014-10-29 sprint to 2014-11-19 sprint
Actions #18

Updated by Peter Amstutz over 9 years ago

Tim Pierce wrote:

Review @ 0edcc26f:

sdk/python/arvados/commands/run.py
  • The slots stuff was really confusing to me, and I had to read the code and the man page a few times to tell what was going on. Can I suggest a comment to clarify what's going on here? Something like
    [...]

Done

  • determine_project and is_in_collection should list which exceptions they're catching

Done

  • import arvados.commands._util so we can include the default --retries argument, instead of defaulting max_retries to 3

Done

  • L123-146:
    • I'm not sure I understand what I'm going on here. Is this here to catch command-line flags that may be hiding filenames that need to be translated, like --filename=zzzzz-4zz18-23u3hxugbm71qmn.txt or -fblurfl.txt? If so:
      • Add a comment explaining what we're doing.

That's correct. Added comments.

  • Does this correctly handle directories or filenames with equal signs in them?

I reworked the code a bit so that it tests to see if something is a filename first, instead of pattern matching first, so if a filename has a equals sign it should get recognized correctly.

I just added the ability to recognize directories in collections.

Currently directories are not uploaded automatically since I don't feel like I have enough data to know what the right behavior is; recursively uploading the contents of a directory seems much more dangerous than uploading explicitly specified files.

  • L191:
    • if f is more than one directory deep under the current dir, this looks like it will make the stream name just the top level directory. Shouldn't the stream name be os.path.basename(f) or am I confused?

os.path.split() in python returns 2 items, the entire leading directory path in [0] and the leaf name in [1]

  • If f does not include a "/" then the stream name will be the empty string "", not ".".

Fortunately, CollectionWriter already handles that case and turns "" into "."

  • L245: trailing "\"?

Fixed.

  • L267: needs ensure_unique_name=True

Actually I don't think this is necessary as we currently enforce uniqueness for pipeline instance names, but out of an abundance of caution I added it anyway.

crunch_scripts/run-command
  • I'm pretty lost about how the list parsing works here, particularly on what var_items, get_items and expand_item are for and how they fit together, and what overall goal they accomplish. Document at least the general logic flow for a new reader.

I added some comments, probably inadequate but I'm trying to turn this code around.

  • L395: I think this is equivalent to success = any([status == 0 for status in rcode.values()])? The reduce call is sort of confusing to me because it suggests that something is being accumulated.

Not quite, but I did come up with a use for any() that does do the right thing. I agree that using reduce() here is being excessively clever.

Thanks!

Actions #19

Updated by Tim Pierce over 9 years ago

Comments at 51de3bf2

doc/user/topics/run-command.html.textile.liquid
  • The documentation for list expansion functions uses a lot of repeated component names like a and b. That makes it confusing to tell which parameters are expanded to what values. This would be much easier to grasp if each example is rewritten to use unique parameter names.
  • The group list function in particular needs to be clarified. The inner foreach loop is puzzling. I can work out that the outer b variable in this scope is getting expanded to the lists ["alice", "carol", "dave"] and ["bob"] based on how they match the regex, but the nested foreach loops are confusing, especially that one foreach is on "$(b)" and one is on "b" -- do those both work and are they different?
  • The section List context describes what happens "when a parameter is evaluated in a list context", but it doesn't describe what a list context is or how the user puts an expression into a list context. Is a list context something the user has any control over, or is it just implicitly invoked by list functions like foreach?
crunch_scripts/run-command
  • The logic for evaluating a parameter template into an argument list is hard to follow. The run-command user documentation fills in some of the gaps here, which is helpful -- we should copy some or most of this into the source as well, to make reading the source easier.
  • It is true that this process is a little bit like having a tiny language interpreter in the code. We need to start checking inputs and types more strictly here, e.g. these fields are assumed to hold a string value:
    • c['foreach']
    • c['list']
    • c['var']
    • (others?)
  • But if one of these holds a list instead, or a dict, or None, things will start to go sideways quickly (e.g. var_items will return None for var and this will set params[None] to each item in turn). As far as I can tell, that's a badly formed pipeline and should not work anyway, but we should proactively detect those cases and raise an exception.
  • FWIW I think we are going to need to make the definition and parsing of this metalanguage more rigorous, and that should happen sooner rather than later, before we get locked into weird implementation idiosyncrasies that are hard to back out.
Actions #20

Updated by Peter Amstutz over 9 years ago

Tim Pierce wrote:

Comments at 51de3bf2

doc/user/topics/run-command.html.textile.liquid
  • The documentation for list expansion functions uses a lot of repeated component names like a and b. That makes it confusing to tell which parameters are expanded to what values. This would be much easier to grasp if each example is rewritten to use unique parameter names.

Improved the formatting and added explicit "var" naming instead of using the implicit variable shadowing.

  • The group list function in particular needs to be clarified. The inner foreach loop is puzzling. I can work out that the outer b variable in this scope is getting expanded to the lists ["alice", "carol", "dave"] and ["bob"] based on how they match the regex, but the nested foreach loops are confusing, especially that one foreach is on "$(b)" and one is on "b" -- do those both work and are they different?

Fixed.

  • The section List context describes what happens "when a parameter is evaluated in a list context", but it doesn't describe what a list context is or how the user puts an expression into a list context. Is a list context something the user has any control over, or is it just implicitly invoked by list functions like foreach?

I tried to clarify this, let me know if you have any more questions.

crunch_scripts/run-command
  • The logic for evaluating a parameter template into an argument list is hard to follow. The run-command user documentation fills in some of the gaps here, which is helpful -- we should copy some or most of this into the source as well, to make reading the source easier.

Added comments about what each of the functions do.

  • It is true that this process is a little bit like having a tiny language interpreter in the code. We need to start checking inputs and types more strictly here, e.g. these fields are assumed to hold a string value:
    • c['foreach']
    • c['list']
    • c['var']
    • (others?)
  • But if one of these holds a list instead, or a dict, or None, things will start to go sideways quickly (e.g. var_items will return None for var and this will set params[None] to each item in turn). As far as I can tell, that's a badly formed pipeline and should not work anyway, but we should proactively detect those cases and raise an exception.

Added more typechecking.

  • FWIW I think we are going to need to make the definition and parsing of this metalanguage more rigorous, and that should happen sooner rather than later, before we get locked into weird implementation idiosyncrasies that are hard to back out.

In terms of strategy, I don't want to continue extending run-command's bespoke template language, but instead start implementing the common workflow language tool description which (a) will use Javascript instead of a custom expression language and (b) have a somewhat narrower scope (things like task parallelization will be handled at a different layer). (Eventually, if we're using tool description files in pipelines, then run-command in its current form may only be necessary support arv-run.)

Actions #21

Updated by Tim Pierce over 9 years ago

This LGTM enough to merge, thanks. The documentation makes me more confident that a new and unfamiliar user (i.e. me) could write a template based on this description, and knowing that we expect the common workflow language to replace this templating system makes me feel a lot more comfortable with going with it in its current state.

Until then we should be aggressive about writing tests for these templates. I'll add a separate story for that.

Actions #22

Updated by Anonymous over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:788ecdf8085f5e69cd3dc960f510b49f11432cb3.

Actions

Also available in: Atom PDF