Feature #19886
closedcrunch-run commits log collection at container start
Description
The new Resources panel uses node.json from the log collection. This doesn't appear until the 1st commit of the log collection, which only happens at the 1st checkpoint (after 30 minutes of running, I think?) or the end of the run.
crunch-run should make an early commit of the log collection once the system info files (node.json, node-info.txt, etc) have been recorded.
Files
Updated by Peter Amstutz almost 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
- Status changed from In Progress to New
- Category set to Crunch
- Tracker changed from Bug to Feature
Updated by Peter Amstutz almost 2 years ago
- Description updated (diff)
- Subject changed from crunch-run commits logs at container start to crunch-run commits log collection at container start
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Updated by Tom Clegg almost 2 years ago
- Story points set to 0.5
- Assigned To set to Tom Clegg
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-02-01 sprint to 2023-01-18 sprint
Updated by Peter Amstutz almost 2 years ago
- Assigned To changed from Tom Clegg to Brett Smith
Updated by Brett Smith almost 2 years ago
I have a branch up. 19886-crunch-run-early-log-commit @ 40d8256b92b5e2b64f1ea5a65a03e01e1763a6d9 - developer-run-tests: #3439
The tests are failing with:
17:06:19 ---------------------------------------------------------------------- 17:06:19 FAIL: crunchrun_test.go:1714: TestSuite.TestStdoutWithMultipleMountPointsUnderOutputDir 17:06:19 17:06:19 SecretMounts decoded {map[]} json "{\"secret_mounts\":null}" 17:06:19 crunchrun_test.go:1756: 17:06:19 c.Check(v["ensure_unique_name"], Equals, true) 17:06:19 ... obtained = nil 17:06:19 ... expected bool = true
Unfortunately this has led me down a rabbit hole, because there's a disconnect between the tests and reality. It looks like what the tests intend to test is that an output collection is created with a certain manifest. However, they do this by searching for the output collection in a loop, and then checking matching results. Right now there are no matching results, and the tests never check that, which is why they're quietly passing on main. You can prove this to yourself by applying the attached patch to main and running the tests.
I've done a little digging but it's still not clear to me whether the tests should pass, and there's a problem in crunch-run or the test harnesses; or if these tests are no longer relevant (at least as written), and they just conveniently took themselves out of commission. I don't mind doing a little test updating as part of this branch but I would appreciate a couple pointers on this since it's not directly relevant to this story, and I'm guessing a little history goes a long way to finding the right solution.
Updated by Brett Smith almost 2 years ago
- Status changed from New to In Progress
19886-crunch-run-early-log-commit @ 499848b567ebecadbdff1fcbbdb4683dfbf097da - developer-run-tests: #3441 - Ready for review
Updated by Tom Clegg almost 2 years ago
Hm. The documentation says the container log will be null or a UUID until the container is finished. The existing updateLogs loop periodically updates it to a PDH. And this branch updates it to a UUID during the Locked→Running transition.
In RailsAPI, Container#update_cr_logs calls ContainerRequest#update_collections, which assumes the log is a PDH.
I think the easy solution is for the new code to save a PDH instead of UUID, and update the documentation to say "...otherwise a PDH, UUID, or null", and a better solution (unless I'm missing some reason this won't work) is to update RailsAPI and crunch-run to do what the current documentation says.
Other than that, this LGTM, thanks.
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-11:
I think the easy solution is for the new code to save a PDH instead of UUID, and update the documentation to say "...otherwise a PDH, UUID, or null", and a better solution (unless I'm missing some reason this won't work) is to update RailsAPI and crunch-run to do what the current documentation says.
I'm not sure I understand why the second option is better? ContainerRequest#update_collections
ensures that every request has its own log collection. Assuming we maintain that behavior, I'm not sure what the "log is a UUID before container is finished" rule is intended to accomplish for us, in terms of implementation flexibility or what.
Without more background, it seems to me like the best option that minimizes work while maximizing consistency is to update crunch-run to always send a PDH, and update the documentation to say that the log will always be a PDH or null, perhaps with a note that this restriction may be relaxed to allow UUIDs in the future. Is there some reason I'm missing that's not a good idea?
Updated by Tom Clegg almost 2 years ago
I'm not sure I understand why the second option is better?
I was thinking it would be better in that it would allow a client to get the log collection UUID and then watch that collection for updates, rather than having to use a special-to-container-logs approach, and would eliminate the "log pdh is now X" API call after saving the collection. Practically speaking neither seems like a huge deal though.
Without more background, it seems to me like the best option that minimizes work while maximizing consistency is to update crunch-run to always send a PDH, and update the documentation to say that the log will always be a PDH or null, perhaps with a note that this restriction may be relaxed to allow UUIDs in the future. Is there some reason I'm missing that's not a good idea?
I agree this is what we should do for now, and maybe we'll revisit the UUID option in the future.
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-14:
I'm not sure I understand why the second option is better?
I was thinking it would be better in that it would allow a client to get the log collection UUID and then watch that collection for updates, rather than having to use a special-to-container-logs approach
Isn't this provided by the log propagation the API server does though? When a container request is updated with a container's new log, it makes a point to load and update its existing collection. Following updates for a specific container request's logs seems like it should just work. I realize you said "container logs" and I'm saying "container request logs" and those aren't the same thing, but practically is there any reason for clients to prefer reading the container logs directly over the container request logs?
I agree this is what we should do for now, and maybe we'll revisit the UUID option in the future.
19886-crunch-run-early-log-commit @ 6477db8a1f5ed0b2f81cf743bbea32c681c7c10c - developer-run-tests: #3449
Updated by Tom Clegg almost 2 years ago
If a running container is updated with a portable data hash in its log field, that collection will be propagated to the logs of associated container request(s). This behaivor may be extended to UUID updates in the future.
I'm not sure it fits here to say "if X happens" where X is never supposed to be done other than by a system component. Compare started_at
-- "Null if container has not yet started" rather than getting into whether/how it gets updated through the API by a dispatcher. Maybe we should stick with "PDH when log data is available, otherwise null" and leave out the "we might change this API" part? (I know I asked for it earlier but the UUID angle seems less appealing every day.)
That would get rid of "behaivor" too ;)
The code update LGTM.
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-16:
I'm not sure it fits here to say "if X happens" where X is never supposed to be done other than by a system component.
I'll only do one round of this, since it's kind of tangential to the branch, but this kind of opens a philosophical question about who's the audience for the API documentation, yeah? I agree only a dispatcher should be updating this field—which means someone who's writing their own dispatcher might like to know about this behavior in the API server. And we have clear enough boundaries between components you could actually imagine someone doing that.
Updated by Tom Clegg almost 2 years ago
I'd say the audience of that page is application developers. Arvados developers currently rely on reading the railsapi source code.
Thinking as a non-Arvados-developer, the proposed wording ("if a running container is updated...") makes me wonder "is this suggesting I/anyone can update a container log with a PDH? with a non-PDH? or is this trying to tell me I always need to check the container log because sometimes the CR log field doesn't get auto-updated?"
It might be more productive to enforce "container log must be pdh or nothing" in railsapi because it does not handle UUIDs in a desirable way. Then the documented API is much friendlier: c.log is a PDH and cr.log_uuid has the latest saved logs.
Updated by Brett Smith almost 2 years ago
I've updated the documentation in the branch at 7f359d51c3dca6c04003d56f3f18f86cd0916bae. I admit I'm not doing a test run since this new commit just touches one line of documentation, and there aren't any links in that line that could be broken.
I'm personally fine with a Rails API change like you suggest too, but IMO it should be a separate story, mostly because I think we should try to get as much feedback as practical before narrowing the API like that. I don't think that's a change that should get rolled into a branch that was never intended to touch the API server.
Updated by Tom Clegg almost 2 years ago
Technically it's crunch-run that does the updating so "the dispatcher will update" is perhaps slightly misleading. (Or we could just say "system" if we're trying to focus on what to expect from the system rather than how its parts work.)
Either way LGTM, thanks.
Updated by Brett Smith almost 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|c4d1131613de217719929ff05a06f19321cb5fdc.