Project

General

Profile

Actions

Feature #19886

closed

crunch-run commits log collection at container start

Added by Peter Amstutz about 1 month ago. Updated 3 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/13/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5
Release relationship:
Auto

Description

The new Resources panel uses node.json from the log collection. This doesn't appear until the 1st commit of the log collection, which only happens at the 1st checkpoint (after 30 minutes of running, I think?) or the end of the run.

crunch-run should make an early commit of the log collection once the system info files (node.json, node-info.txt, etc) have been recorded.


Files

19886-prove-tests-unrun.patch (1.14 KB) 19886-prove-tests-unrun.patch Brett Smith, 01/13/2023 12:04 AM

Subtasks 1 (0 open1 closed)

Task #19903: Review 19886-crunch-run-early-log-commitResolvedBrett Smith01/13/2023

Actions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to New
  • Category set to Crunch
  • Tracker changed from Bug to Feature
Actions #3

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
  • Subject changed from crunch-run commits logs at container start to crunch-run commits log collection at container start
Actions #4

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 25 days ago

  • Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Actions #6

Updated by Tom Clegg 25 days ago

  • Story points set to 0.5
  • Assigned To set to Tom Clegg
Actions #7

Updated by Peter Amstutz 24 days ago

  • Target version changed from 2023-02-01 sprint to 2023-01-18 sprint
Actions #8

Updated by Peter Amstutz 24 days ago

  • Assigned To changed from Tom Clegg to Brett Smith
Actions #9

Updated by Brett Smith 16 days ago

I have a branch up. 19886-crunch-run-early-log-commit @ 40d8256b92b5e2b64f1ea5a65a03e01e1763a6d9 - developer-run-tests: #3439

The tests are failing with:

17:06:19 ----------------------------------------------------------------------
17:06:19 FAIL: crunchrun_test.go:1714: TestSuite.TestStdoutWithMultipleMountPointsUnderOutputDir
17:06:19 
17:06:19 SecretMounts decoded {map[]} json "{\"secret_mounts\":null}" 
17:06:19 crunchrun_test.go:1756:
17:06:19     c.Check(v["ensure_unique_name"], Equals, true)
17:06:19 ... obtained = nil
17:06:19 ... expected bool = true

Unfortunately this has led me down a rabbit hole, because there's a disconnect between the tests and reality. It looks like what the tests intend to test is that an output collection is created with a certain manifest. However, they do this by searching for the output collection in a loop, and then checking matching results. Right now there are no matching results, and the tests never check that, which is why they're quietly passing on main. You can prove this to yourself by applying the attached patch to main and running the tests.

I've done a little digging but it's still not clear to me whether the tests should pass, and there's a problem in crunch-run or the test harnesses; or if these tests are no longer relevant (at least as written), and they just conveniently took themselves out of commission. I don't mind doing a little test updating as part of this branch but I would appreciate a couple pointers on this since it's not directly relevant to this story, and I'm guessing a little history goes a long way to finding the right solution.

Actions #10

Updated by Brett Smith 15 days ago

  • Status changed from New to In Progress

19886-crunch-run-early-log-commit @ 499848b567ebecadbdff1fcbbdb4683dfbf097da - developer-run-tests: #3441 - Ready for review

Actions #11

Updated by Tom Clegg 10 days ago

Hm. The documentation says the container log will be null or a UUID until the container is finished. The existing updateLogs loop periodically updates it to a PDH. And this branch updates it to a UUID during the Locked→Running transition.

In RailsAPI, Container#update_cr_logs calls ContainerRequest#update_collections, which assumes the log is a PDH.

I think the easy solution is for the new code to save a PDH instead of UUID, and update the documentation to say "...otherwise a PDH, UUID, or null", and a better solution (unless I'm missing some reason this won't work) is to update RailsAPI and crunch-run to do what the current documentation says.

Other than that, this LGTM, thanks.

Actions #12

Updated by Peter Amstutz 10 days ago

  • Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Actions #13

Updated by Brett Smith 10 days ago

Tom Clegg wrote in #note-11:

I think the easy solution is for the new code to save a PDH instead of UUID, and update the documentation to say "...otherwise a PDH, UUID, or null", and a better solution (unless I'm missing some reason this won't work) is to update RailsAPI and crunch-run to do what the current documentation says.

I'm not sure I understand why the second option is better? ContainerRequest#update_collections ensures that every request has its own log collection. Assuming we maintain that behavior, I'm not sure what the "log is a UUID before container is finished" rule is intended to accomplish for us, in terms of implementation flexibility or what.

Without more background, it seems to me like the best option that minimizes work while maximizing consistency is to update crunch-run to always send a PDH, and update the documentation to say that the log will always be a PDH or null, perhaps with a note that this restriction may be relaxed to allow UUIDs in the future. Is there some reason I'm missing that's not a good idea?

Actions #14

Updated by Tom Clegg 10 days ago

I'm not sure I understand why the second option is better?

I was thinking it would be better in that it would allow a client to get the log collection UUID and then watch that collection for updates, rather than having to use a special-to-container-logs approach, and would eliminate the "log pdh is now X" API call after saving the collection. Practically speaking neither seems like a huge deal though.

Without more background, it seems to me like the best option that minimizes work while maximizing consistency is to update crunch-run to always send a PDH, and update the documentation to say that the log will always be a PDH or null, perhaps with a note that this restriction may be relaxed to allow UUIDs in the future. Is there some reason I'm missing that's not a good idea?

I agree this is what we should do for now, and maybe we'll revisit the UUID option in the future.

Actions #15

Updated by Brett Smith 10 days ago

Tom Clegg wrote in #note-14:

I'm not sure I understand why the second option is better?

I was thinking it would be better in that it would allow a client to get the log collection UUID and then watch that collection for updates, rather than having to use a special-to-container-logs approach

Isn't this provided by the log propagation the API server does though? When a container request is updated with a container's new log, it makes a point to load and update its existing collection. Following updates for a specific container request's logs seems like it should just work. I realize you said "container logs" and I'm saying "container request logs" and those aren't the same thing, but practically is there any reason for clients to prefer reading the container logs directly over the container request logs?

I agree this is what we should do for now, and maybe we'll revisit the UUID option in the future.

19886-crunch-run-early-log-commit @ 6477db8a1f5ed0b2f81cf743bbea32c681c7c10c - developer-run-tests: #3449

Actions #16

Updated by Tom Clegg 9 days ago

If a running container is updated with a portable data hash in its log field, that collection will be propagated to the logs of associated container request(s). This behaivor may be extended to UUID updates in the future.

I'm not sure it fits here to say "if X happens" where X is never supposed to be done other than by a system component. Compare started_at -- "Null if container has not yet started" rather than getting into whether/how it gets updated through the API by a dispatcher. Maybe we should stick with "PDH when log data is available, otherwise null" and leave out the "we might change this API" part? (I know I asked for it earlier but the UUID angle seems less appealing every day.)

That would get rid of "behaivor" too ;)

The code update LGTM.

Actions #17

Updated by Brett Smith 9 days ago

Tom Clegg wrote in #note-16:

I'm not sure it fits here to say "if X happens" where X is never supposed to be done other than by a system component.

I'll only do one round of this, since it's kind of tangential to the branch, but this kind of opens a philosophical question about who's the audience for the API documentation, yeah? I agree only a dispatcher should be updating this field—which means someone who's writing their own dispatcher might like to know about this behavior in the API server. And we have clear enough boundaries between components you could actually imagine someone doing that.

Actions #18

Updated by Tom Clegg 9 days ago

I'd say the audience of that page is application developers. Arvados developers currently rely on reading the railsapi source code.

Thinking as a non-Arvados-developer, the proposed wording ("if a running container is updated...") makes me wonder "is this suggesting I/anyone can update a container log with a PDH? with a non-PDH? or is this trying to tell me I always need to check the container log because sometimes the CR log field doesn't get auto-updated?"

It might be more productive to enforce "container log must be pdh or nothing" in railsapi because it does not handle UUIDs in a desirable way. Then the documented API is much friendlier: c.log is a PDH and cr.log_uuid has the latest saved logs.

Actions #19

Updated by Brett Smith 8 days ago

I've updated the documentation in the branch at 7f359d51c3dca6c04003d56f3f18f86cd0916bae. I admit I'm not doing a test run since this new commit just touches one line of documentation, and there aren't any links in that line that could be broken.

I'm personally fine with a Rails API change like you suggest too, but IMO it should be a separate story, mostly because I think we should try to get as much feedback as practical before narrowing the API like that. I don't think that's a change that should get rolled into a branch that was never intended to touch the API server.

Actions #20

Updated by Tom Clegg 8 days ago

Technically it's crunch-run that does the updating so "the dispatcher will update" is perhaps slightly misleading. (Or we could just say "system" if we're trying to focus on what to expect from the system rather than how its parts work.)

Either way LGTM, thanks.

Actions #21

Updated by Brett Smith 8 days ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #22

Updated by Peter Amstutz 3 days ago

  • Release set to 57
Actions

Also available in: Atom PDF