Project

General

Profile

Actions

Feature #18790

closed

Access live container logs through arvados-client and crunch-run container gateway

Added by Tom Clegg over 2 years ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
2.0
Release relationship:
Auto

Description

A command like this should show live logs on stdout, and exit when the container finishes:

$ arvados-client logs $container_request_uuid

To be implemented using Efficient live access to container logs

For now:
  • Show content from crunch-run.txt, stderr.txt, stdout.txt
  • No headers/prefixes showing which file the logs are coming from
  • Uses the endpoint on controller /arvados/v1/container_requests/{uuid}/log/{container_uuid}/{file} to poll the files.

(description updated after the fact to reflect API change, see #18790#note-23)


Subtasks 1 (0 open1 closed)

Task #20277: Review 18790-log-clientResolvedTom Clegg04/04/2023Actions

Related issues

Related to Arvados Epics - Idea #16442: Scalable + reliable container loggingResolved03/15/202308/31/2023Actions
Related to Arvados - Feature #19889: access current container logs at /arvados/v1/containers/{uuid}/log/{filename}ResolvedTom Clegg03/23/2023Actions
Related to Arvados - Idea #20458: Document arvados-client logs commandNewActions
Actions #1

Updated by Tom Clegg over 2 years ago

  • Related to Idea #16442: Scalable + reliable container logging added
Actions #2

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 1 year ago

  • Release set to 60
Actions #4

Updated by Tom Clegg over 1 year ago

  • Target version set to Future
Actions #5

Updated by Tom Clegg over 1 year ago

  • Release deleted (60)
Actions #6

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Related to Feature #19889: access current container logs at /arvados/v1/containers/{uuid}/log/{filename} added
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Story points set to 2.0
Actions #10

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to To be scheduled
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to Development 2023-04-12 sprint
Actions #12

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Tom Clegg
Actions #13

Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress

18790-log-client @ fc6b3cd79ba9e07810330d0d47a0ab89ad8857f7 -- developer-run-tests: #3578

(based on unmerged #19889 branch)

  • adds arvados-client logs {container-or-request-uuid} command.
  • poll interval adjusts between 2s and 10s depending on whether any logs have appeared recently.
  • ...or, if command line parameter -poll=1s is given, poll interval adjusts between 1s and 5s
  • each line is prefixed with the filename it came from (even though issue description says we're not doing that for now).
  • surely after getting 10s worth of stderr.txt and 10s worth of crunch-run.txt, we should merge-sort them before displaying them, but I haven't done this yet. Should it be in this branch, or should we save it for future work?
  • if one of the log files (presumably crunch-run.txt) has a line ending with "Complete", "Cancelled", or "Queued", we immediately reload the container record, because that is typically the last thing logged before finalizing the container, and we want to exit as soon as we notice the container is finalized.
  • if the UUID given on the command line is a container request UUID (as opposed to a container UUID) and the container whose logs we're watching gets requeued/retried, we switch to the new container UUID and keep logging.
  • no user-facing docs yet -- where should this go? Getting started at the command line might be too early, perhaps just before Analyzing workflow performance ?
Actions #14

Updated by Lucas Di Pentima about 1 year ago

Haven't looked in detail at the code, these comments are from my blackbox type testing:

  • When using the logs command against an older cluster, I think it should report some error instead of exiting with errorlevel 0.
  • If the provided UUID is from a non-existing CR, the command exits saying that it wasn't found. If the provided UUID is from a non-existing container, it keeps trying while reporting the raw 404 error.
  • If the provided UUID has an incorrect resource prefix, it errors out correctly, but if it has a CR/Container prefix but the rest is wrong (example: zzzzz-dz642-), it tries to use it, giving the repeating 404 errors described above.
  • Waited for tordo to get deployed with d9fdb96 merged and tried a revsort wf. Got this:
    $ ./arvados-client logs tordo-xvhdp-g7x47zuycvzhguo
    connecting to container tordo-dz642-g46qwpu9o1ugk1t
    error getting stderr.txt: 416 Requested Range Not Satisfiable
    error getting crunch-run.txt: 500 Internal Server Error
    error getting crunch-run.txt: 500 Internal Server Error
    error getting crunch-run.txt: 500 Internal Server Error
    [...repeats until ctrl-c'ed...]
    
  • When requesting logs from a finished container (e.g. the mentioned above), it says 'connecting to <uuid>' and then outputs "error getting crunch-run.txt: 500 Internal Server Error" repeatedly. I think it should just say that the container is not running anymore instead of "connecting..." and exit, if the idea is to only show live logs. Maybe point the user to the log collection's UUID?
Actions #15

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-13:

  • if one of the log files (presumably crunch-run.txt) has a line ending with "Complete", "Cancelled", or "Queued", we immediately reload the container record, because that is typically the last thing logged before finalizing the container, and we want to exit as soon as we notice the container is finalized.

What happens if the running command logs a bunch of status messages like Sample 00000001 Complete, Sample 00000002 Queued, … ? This description sounds like it will cause a container record reload every poll interval. If I followed that right, that's not the end of the world, but it seems like a regrettable source of API server load.

I don't want to sweat an occasional needless reload, but would it be reasonable to put in guardrails to prevent this pathological case? Either by doing a fuller match on the line, or checking only the lines from crunch-run.txt while excluding other files?

I think my vote is for option 2. I think in general this section of the user guide could maybe benefit from some reorganization, but that's definitely out of scope for this ticket.

Actions #16

Updated by Tom Clegg about 1 year ago

  • When using the logs command against an older cluster, I think it should report some error instead of exiting with errorlevel 0.

Indeed. It was assuming that 404 meant "no stderr.txt file in log collection" rather than "API endpoint does not exist". I've added an OPTIONS request and a check for the WebDAV header . If that's missing, you get this:

$ arvados-client logs 2xpu4-xvhdp-owbbxbmz30lbnj1
connecting to container 2xpu4-dz642-xicm99f80mqe0ht
server does not support container logs API (OPTIONS request returned HTTP 200 OK, Dav: "")
$ echo $?
1
  • If the provided UUID is from a non-existing CR, the command exits saying that it wasn't found. If the provided UUID is from a non-existing container, it keeps trying while reporting the raw 404 error.

Trying to avoid making a (possibly transient) container lookup failure a fatal error here. I've adjusted the logic so a container lookup failure is fatal until we successfully receive any data, and after that we retry.

(Am I overdoing this? Should we just bail out on errors and let the user re-run?)

  • If the provided UUID has an incorrect resource prefix, it errors out correctly, but if it has a CR/Container prefix but the rest is wrong (example: zzzzz-dz642-), it tries to use it, giving the repeating 404 errors described above.

Improved the error-checking to check for some more obvious examples (like zzzzz-dz642-). Less-obvious nonexistent UUIDs are covered by the retry logic adjustment above.

$ arvados-client logs 2xpu4-xvhdp-owbbxbmz30lbnj
target UUID is not a container or container request UUID: 2xpu4-xvhdp-owbbxbmz30lbnj
$ arvados-client logs 2xpu4-xvhdp-owbbxbmz30lbnj2
container request "2xpu4-xvhdp-owbbxbmz30lbnj2" not found
  • Waited for tordo to get deployed with d9fdb96 merged and tried a revsort wf. Got this:
    [...]

"error getting stderr.txt: 416 Requested Range Not Satisfiable" -- oops, it turns out requesting "bytes=0-0" of a nonexistent file returns 416 instead of 404. I've changed that to request "bytes=-0", which returns 404.

  • When requesting logs from a finished container (e.g. the mentioned above), it says 'connecting to <uuid>' and then outputs "error getting crunch-run.txt: 500 Internal Server Error" repeatedly. I think it should just say that the container is not running anymore instead of "connecting..." and exit, if the idea is to only show live logs. Maybe point the user to the log collection's UUID?

The intent is to show the last few KB if the container is already finished.

The underlying problem here was that the cluster config didn't allow controller to connect to keep-web. Fixes:
  • note the new requirement on the "upgrading" page
  • show the response body when reporting an API error

Testing tordo now looks like this:

$ arvados-client logs tordo-xvhdp-arvyjwhn4jmmo15
connecting to container tordo-dz642-ldzih14s3aw114j
error getting crunch-run.txt: HTTP 500 Internal Server Error -- {"errors":["no internalURLs configured for WebDAV service"]}

I also changed the code and install docs to use WebDAV.InternalURLs instead of WebDAVDownload.InternalURLs. (WebDAVDownload sort of made sense because the proxy feature uses keep-web in download mode. However, keep-web looks at WebDAV.InternalURLs when choosing a listening address, and WebDAVDownload.InternalURLs are currently not used for anything, so I think it makes more sense for everyone to use WebDAV.InternalURLs than to require them to be identical or figure out what it means for them to be different.)

Actions #17

Updated by Tom Clegg about 1 year ago

Brett Smith wrote in #note-15:

I don't want to sweat an occasional needless reload, but would it be reasonable to put in guardrails to prevent this pathological case? Either by doing a fuller match on the line, or checking only the lines from crunch-run.txt while excluding other files?

Yes, checking only the lines from crunch-run.txt is an easy fix for this.

18790-log-client @ 1c534c7835db9be61f76d0a68fdaedf640848f68 -- developer-run-tests: #3581

retry wb1 developer-run-tests-apps-workbench-integration: #3863

Actions #18

Updated by Tom Clegg about 1 year ago

18790-log-client @ ba6af950d1cfa7606c2bea02916580dd2f57b174 -- developer-run-tests: #3582

retry wb1 developer-run-tests-apps-workbench-integration: #3865

I noticed adding the sort-by-timestamp feature was too easy to skip.

Actions #19

Updated by Lucas Di Pentima about 1 year ago

Sorry for the delay, I took the opportunity to correct tordo's config so we can do some manual testing. Here're some more comments:

  • I've added a couple of commits with salt installer updates that reflect the new upgrade notes.
  • I think the "connecting to container xxxxx" message could be more useful/less confusing if it's changed to "connected to container xxxxxx" on a successful connection. My particular case is when running arvados-client logs xxxxxx on a container that is still waiting for its instance to start, it seems that the message "connecting..." could give the impression to the operator that it's trying to connect but haven't done it yet (because no other output is printed on screen for some time)
  • Do you think we could add some explicit diagnostics tool test for this feature? Not sure if it's too much scope creeping.
  • I tried running someting on tordo and cancelling, the logs viewer didn't realized that it should finish:
    crunch-run.txt 2023-04-10T15:48:02.229198010Z caught signal: terminated
    crunch-run.txt 2023-04-10T15:48:02.229227792Z stopping container
    crunch-run.txt 2023-04-10T15:48:04.704751337Z Cancelled
    Get "https://tordo.arvadosapi.com/arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/crunch-run.txt": context deadline exceeded
    Get "https://tordo.arvadosapi.com/arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/crunch-run.txt": context deadline exceeded
    [... repeats ...]
    
  • Cancelling and running arvados-client logs tordo-xvhdp-d60uvzf6m1eb5sx again (from the previous test) gives me the following message: Options "https://tordo.arvadosapi.com/arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/": context deadline exceeded
Actions #20

Updated by Tom Clegg about 1 year ago

salt installer

Excellent, thanks.

diagnostics

Sounds valuable, given our experience here. I think we should add some checks in arvados-client diagnostics. Process-wise, maybe best to make a separate issue#, and do it after this is working/merged?

context deadline exceeded

Well, there is a design flaw here. I forgot the container log collection gets trashed as soon as the container finishes (at that point the log data gets copied into a new container request log collection, and the user is supposed to access it that way). So the idea of getting the log collection data from keep-web by PDH really doesn't work.
  • as an aside, workbench2 doesn't seem to know how to look up container UUIDs -- it thinks they are container request UUIDs, and tries to load /arvados/v1/container_requests/{cr_uuid}, which of course always returns 404.
  • as an aside, when controller proxies the containers/{uuid}/log request to keep-web with a PDH, keep-web uses controller to look up the PDH, which tries railsapi first, and then tries to load it from remote clusters, which return 401 because we're using the local cluster's SystemRootToken. Clearly we shouldn't be checking the remote clusters at all here -- perhaps we need to send keep-web a bypass_federation flag.
    • as an aside, keep-web's response here (a) only mentions the 401 error, not the 404 from local railsapi, and (b) takes the full RequestTimeout (5m) instead of failing right away. Evidently, controller is returning 502 instead of 404 (because of the 401s), and keep-web is auto-retrying for 5 minutes instead of returning 404 or 401.
      • keep-web shouldn't be enabling the SDK auto-retry
      • even when auto-retry is enabled, SDK shouldn't auto-retry 401
      • as an aside, keep-web should, but evidently doesn't, cancel the upstream request when the caller (arvados-client logs) hangs up after 20s.
Things we can do:
  • If the original target was a container request, get log_uuid from the container request record, and fetch /{container_uuid}/{filename} from there. (This is how "access logs from a container you ran" is intended to work in general.)
  • If the original target was a container, set include_trash=true in the upstream request to keep-web, and add a feature in keep-web to accept that flag via HTTP header (X-Arvados-Include-Trash) or query param. Sitefs doesn't currently expose trash, but I don't think it would be too hard to add a /trash_by_id virtual dir feature in sitefs, which keep-web would enable.
Actions #21

Updated by Lucas Di Pentima about 1 year ago

I think the 2nd bulletpoint (enabling trash on keep-web) although seems useful, it does also seem too big of a story to add it here. Can we get the corresponding CR for the specified container target and apply the 1st bulletpoint on it?

Actions #22

Updated by Tom Clegg about 1 year ago

Lucas Di Pentima wrote in #note-21:

I think the 2nd bulletpoint (enabling trash on keep-web) although seems useful, it does also seem too big of a story to add it here. Can we get the corresponding CR for the specified container target and apply the 1st bulletpoint on it?

Oh, yeah, I meant to add that option too. Also: assuming we look up a matching CR, I think the only potential advantage of the "show trashed log" version is that it can work for a container that no longer matches the container_uuid of any readable container request due to auto-retry. And in that case, the container itself won't be readable, so we won't even get as far as finding the trashed log PDH. So in practice it doesn't accomplish anything at all, except for an admin user.

Unfortunately, until we fix the underlying "lose relation to prior attempts" issue, I think we're stuck with a race condition where copying a running container UUID into a "logs" command line just before it ends/retries can cause an unrecoverable error.

Actions #23

Updated by Tom Clegg about 1 year ago

Do we need to go back and fix the design so the controller endpoint is /arvados/v1/container_requests/{cr_uuid}/log/ where .../log/{c_uuid}/stderr.txt either proxies through to live logs (if c_uuid is the currently assigned container and it's still running) or proxies to keep-web to serve saved logs from the container request log?
  • The saved container request log already has that directory structure.
  • I think this will continue to make sense if/when we implement service containers where a request can be associated with multiple concurrent containers (run N instances of a service and watch all of their logs). Controller would merge the log_events streams from N containers.
Actions #24

Updated by Tom Clegg about 1 year ago

  • Target version changed from Development 2023-04-12 sprint to Development 2023-04-26 sprint
Actions #25

Updated by Tom Clegg about 1 year ago

18790-log-client @ caa54be9b2984ce2d08dcc6ad377e7f0be339dae -- developer-run-tests: #3616

This branch is a bit of a monster, merging main and #20319 multiple times. I'm putting it here to establish that it actually works, but we might want to merge #20319 and then rebase this before reviewing it...?

Actions #26

Updated by Tom Clegg about 1 year ago

Rebased, now it's just the previously reviewed branch + merge main + change to new API.

18790-log-client @ e7cb79569f8047de64ccec7ec4382bf0f4b7423d -- developer-run-tests: #3622

retry wb1 developer-run-tests-apps-workbench-integration: #3907

Actions #27

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-04-26 sprint to Development 2023-05-10 sprint
Actions #28

Updated by Lucas Di Pentima about 1 year ago

Some last comments:

  • The help message is a bit confusing (there're "log" & "logs" subcommands, container-uuid param is not documented when using -help):
    $ ./arvados-client logs
    missing required argument: container-uuid (try -help)
    $ ./arvados-client logs -help
    Usage of ./arvados-client logs:
      -poll duration
            minimum duration to wait before polling for new data (default 2s)
    
  • Do you think it would be convenient to emulate the way journalctl works? that is, without "-f" just show the available logs, with "-f" enable the polling behavior.
  • I looked for specific documentation on how to look for logs but it doesn't seem to be anything obvious, except for some mention on the old user guide (wb1). I guess we need at least a ticket for documenting this feature from the user perspective, or update some existing documentation that I may have overlooked?

The rest LGTM.

Actions #29

Updated by Tom Clegg about 1 year ago

  • Related to Idea #20458: Document arvados-client logs command added
Actions #30

Updated by Tom Clegg about 1 year ago

Lucas Di Pentima wrote in #note-28:

Some last comments:

  • The help message is a bit confusing (there're "log" & "logs" subcommands, container-uuid param is not documented when using -help):

Ah. The missing container-uuid param in the help message (which I've now updated to container-request-uuid) was supposed to be shown by the flag parsing wrapper in lib/cmd, but that relied on the default value of flagset.Usage being nil, and the default has since changed to a private func. With that fixed, we get

Usage: arvados-client logs [options] container-request-uuid
  -f    follow: poll for new data until the container finishes
  -poll duration
        minimum duration to wait before polling for new data (default 2s)
  • Do you think it would be convenient to emulate the way journalctl works? that is, without "-f" just show the available logs, with "-f" enable the polling behavior.

Yes, updated so we don't do poll/follow unless -f is given.

  • I looked for specific documentation on how to look for logs but it doesn't seem to be anything obvious, except for some mention on the old user guide (wb1). I guess we need at least a ticket for documenting this feature from the user perspective, or update some existing documentation that I may have overlooked?

Yes, I think this deserves some thought beyond just finding a place to say "here's what arvados-client logs does". Added #20458.

18790-log-client @ 680a7bbd73fa20861ab4213b36947b835304532d -- developer-run-tests: #3623

retry wb1 developer-run-tests-apps-workbench-integration: #3909

Actions #31

Updated by Lucas Di Pentima about 1 year ago

This LGTM, thanks!

Actions #32

Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved
Actions #33

Updated by Tom Clegg 10 months ago

  • Description updated (diff)
Actions #34

Updated by Peter Amstutz 10 months ago

  • Release set to 66
Actions

Also available in: Atom PDF