Project

General

Profile

Actions

Bug #15910

closed

[crunch-run] Crash while writing output files

Added by Tom Clegg about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

crunch-run process disappears while writing outputs (~1GB in ~1700 files in 1 dir).

Suspect this is caused by a data race: there's a concurrent map write that sometimes causes panics and complaints from the Go race detector in a test case.


Subtasks 1 (0 open1 closed)

Task #15914: Review 15910-crunch-run-crashResolvedTom Clegg12/04/2019Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #15946: [crunch-run] [collectionfs] Deadlock while writing output collectionResolvedTom Clegg12/23/2019Actions
Actions #4

Updated by Lucas Di Pentima about 5 years ago

Is the new test intended to expose the bug? I’ve locally tried to run it with an older version of fs_collection.go and it doesn’t fail.

Actions #5

Updated by Lucas Di Pentima about 5 years ago

From chat:

lucas: you might need to run it a few times before it crashes -- but "test services/crunch-run -race" (from run-tests interactive mode) seems to find it reliably

Running sdk/go/arvados tests with --repeat 50 on a older collection_fs.go file didn't failed for me.
I've also tried locally running the services/crunch-run -race tests using the interactive mode and they still fail with the fix:

[...]
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world!
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world!
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000002Z Goodbye
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000001Z Hello world!
zzzzz-zzzzzzzzzzzzzzz 2015-12-29T15:51:45.000000003Z Goodbye
OK: 71 passed
--- FAIL: TestCrunchExec (52.93s)
    testing.go:809: race detected during execution of test
FAIL
exit status 1
FAIL    git.curoverse.com/arvados.git/services/crunch-run    53.010s
======= services/crunch-run tests -- FAILED
======= test services/crunch-run -- 55s
Failures (1):
Fail: services/crunch-run tests (55s)

Are these race tests being run on our Jenkins test pipelines? If not, can we make Jenkins run them or is there any limitation on how they work?

Actions #6

Updated by Tom Clegg about 5 years ago

Jenkins doesn't enable the race detector. I see crunch-run test races here too (although the first few are just races between tested code and the tests themselves, so they might not indicate real bugs).

Both should be addressed. I don't think we should block this bugfix behind either of them. It would be good to verify that the race detector confirms this bugfix for you like it does for me, though.

I should have said:

"test sdk/go/arvados -race" (from run-tests interactive mode) seems to find it reliably

(The new test case in sdk/go/arvados is the one that sets off the race detector without the fix, but not with the fix.)

Actions #7

Updated by Lucas Di Pentima about 5 years ago

I've tried sdk/go/arvados -race and I'm getting a "signal: killed" message previous to the failed test report. Checking the RAM usage, it goes up to 100% utilization (even when I upped my VM to 9.4 GiB) just before it crashes. I also tried going from 2 to 8 vCPUs and got the same result.

So I'm not able to confirm on my end but if you think it shouldn't block the fix, please go ahead.

Actions #8

Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved
Actions #9

Updated by Tom Clegg about 5 years ago

  • Related to Bug #15946: [crunch-run] [collectionfs] Deadlock while writing output collection added
Actions #10

Updated by Peter Amstutz almost 5 years ago

  • Release set to 22
Actions

Also available in: Atom PDF