Bug #15910
closed[crunch-run] Crash while writing output files
Description
crunch-run process disappears while writing outputs (~1GB in ~1700 files in 1 dir).
Suspect this is caused by a data race: there's a concurrent map write that sometimes causes panics and complaints from the Go race detector in a test case.
Related issues
Updated by Tom Clegg almost 5 years ago
15910-crunch-run-crash @ 5f9e11ae95dc2b768ccc7e5f165be06b6baabdb2 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1683/
Updated by Lucas Di Pentima almost 5 years ago
Is the new test intended to expose the bug? I’ve locally tried to run it with an older version of fs_collection.go
and it doesn’t fail.
Updated by Lucas Di Pentima almost 5 years ago
From chat:
lucas: you might need to run it a few times before it crashes -- but "test services/crunch-run -race" (from run-tests interactive mode) seems to find it reliably
Running sdk/go/arvados
tests with --repeat 50
on a older collection_fs.go
file didn't failed for me.
I've also tried locally running the services/crunch-run -race
tests using the interactive mode and they still fail with the fix:
[...] zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000003Z Goodbye OK: 71 passed --- FAIL: TestCrunchExec (52.93s) testing.go:809: race detected during execution of test FAIL exit status 1 FAIL git.curoverse.com/arvados.git/services/crunch-run 53.010s ======= services/crunch-run tests -- FAILED ======= test services/crunch-run -- 55s Failures (1): Fail: services/crunch-run tests (55s)
Are these race tests being run on our Jenkins test pipelines? If not, can we make Jenkins run them or is there any limitation on how they work?
Updated by Tom Clegg almost 5 years ago
Jenkins doesn't enable the race detector. I see crunch-run test races here too (although the first few are just races between tested code and the tests themselves, so they might not indicate real bugs).
Both should be addressed. I don't think we should block this bugfix behind either of them. It would be good to verify that the race detector confirms this bugfix for you like it does for me, though.
I should have said:
"test sdk/go/arvados -race" (from run-tests interactive mode) seems to find it reliably
(The new test case in sdk/go/arvados is the one that sets off the race detector without the fix, but not with the fix.)
Updated by Lucas Di Pentima almost 5 years ago
I've tried sdk/go/arvados -race
and I'm getting a "signal: killed" message previous to the failed test report. Checking the RAM usage, it goes up to 100% utilization (even when I upped my VM to 9.4 GiB) just before it crashes. I also tried going from 2 to 8 vCPUs and got the same result.
So I'm not able to confirm on my end but if you think it shouldn't block the fix, please go ahead.
Updated by Tom Clegg almost 5 years ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg almost 5 years ago
- Related to Bug #15946: [crunch-run] [collectionfs] Deadlock while writing output collection added