[crunch-run] Crash while writing output files
crunch-run process disappears while writing outputs (~1GB in ~1700 files in 1 dir).
Suspect this is caused by a data race: there's a concurrent map write that sometimes causes panics and complaints from the Go race detector in a test case.
15910-crunch-run-crash @ 5f9e11ae95dc2b768ccc7e5f165be06b6baabdb2 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1683/
#5 Updated by Lucas Di Pentima 7 months ago
lucas: you might need to run it a few times before it crashes -- but "test services/crunch-run -race" (from run-tests interactive mode) seems to find it reliably
sdk/go/arvados tests with
--repeat 50 on a older
collection_fs.go file didn't failed for me.
I've also tried locally running the
services/crunch-run -race tests using the interactive mode and they still fail with the fix:
[...] zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world! zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000003Z Goodbye OK: 71 passed --- FAIL: TestCrunchExec (52.93s) testing.go:809: race detected during execution of test FAIL exit status 1 FAIL git.curoverse.com/arvados.git/services/crunch-run 53.010s ======= services/crunch-run tests -- FAILED ======= test services/crunch-run -- 55s Failures (1): Fail: services/crunch-run tests (55s)
Are these race tests being run on our Jenkins test pipelines? If not, can we make Jenkins run them or is there any limitation on how they work?
Jenkins doesn't enable the race detector. I see crunch-run test races here too (although the first few are just races between tested code and the tests themselves, so they might not indicate real bugs).
Both should be addressed. I don't think we should block this bugfix behind either of them. It would be good to verify that the race detector confirms this bugfix for you like it does for me, though.
I should have said:
"test sdk/go/arvados -race" (from run-tests interactive mode) seems to find it reliably
(The new test case in sdk/go/arvados is the one that sets off the race detector without the fix, but not with the fix.)
#7 Updated by Lucas Di Pentima 7 months ago
sdk/go/arvados -race and I'm getting a "signal: killed" message previous to the failed test report. Checking the RAM usage, it goes up to 100% utilization (even when I upped my VM to 9.4 GiB) just before it crashes. I also tried going from 2 to 8 vCPUs and got the same result.
So I'm not able to confirm on my end but if you think it shouldn't block the fix, please go ahead.