Bug #4297
closed[Crunch] crunch-dispatch eats way too much cpu
Added by Ward Vandewege about 10 years ago. Updated about 10 years ago.
Updated by Ward Vandewege about 10 years ago
crunch-dispatch in --jobs mode eats way to much cpu. The ps output below has 2 ruby processess owned by root. Those are crunch-dispatch processes in --jobs mode. One is in drain mode but has been running for a few days. The other one is actively tracking 2 local jobs, and is waiting to start 24 queued jobs that are blocked by not enough compute nodes.
24383 root 20 0 309m 172m 6636 R 92.7 1.2 74:48.85 ruby 30182 www-data 20 0 500m 350m 4416 S 72.0 2.3 6:26.02 ruby 29983 www-data 20 0 594m 424m 4376 S 61.1 2.8 7:47.53 ruby 8601 root 20 0 125m 53m 4836 S 40.3 0.4 0:01.22 ruby 30062 www-data 20 0 448m 296m 4440 S 20.1 2.0 9:22.83 ruby 8566 root 20 0 111m 37m 4964 S 6.9 0.2 0:01.06 ruby 8728 root 20 0 284m 150m 6612 S 6.6 1.0 5:33.49 ruby 29870 root 20 0 121m 4980 3264 S 6.3 0.0 1:10.19 PassengerHelper 30209 postgres 20 0 105m 33m 29m S 4.0 0.2 0:21.25 postgres 25753 postgres 20 0 103m 32m 29m S 3.6 0.2 3:28.84 postgres 29991 postgres 20 0 105m 33m 29m S 3.0 0.2 0:24.99 postgres 29888 www-data 20 0 106m 5632 2280 S 2.3 0.0 0:28.99 nginx 30069 postgres 20 0 104m 33m 29m S 2.0 0.2 0:27.90 postgres 5634 root 20 0 497m 198m 6660 S 1.0 1.3 695:58.35 ruby 1326 root 20 0 23304 1724 1192 S 0.3 0.0 0:11.17 top 7122 root 20 0 23492 1972 1192 R 0.3 0.0 0:10.25 top
Updated by Tom Clegg about 10 years ago
4297-dispatch-load @ 789e479
- Don't refresh node status more than once per second
- Cache api tokens, don't make one per second
- If repo does not exist, fail the job instead of retrying forever
- Cache git-tag and git-fetch-pack results
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to 2014-10-29 sprint
Updated by Ward Vandewege about 10 years ago
- Status changed from New to In Progress
- Story points set to 0.5
Updated by Brett Smith about 10 years ago
Reviewing 789e479
On the code itself:
fail_job
usesrescue
to detect when Log creation fails, but it only usesLog.new.save
, which should return a boolean to indicate success. Did you mean to useLog.new.save!
?- This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new
have_*
instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggestfetched_commits
andjob_tags
as more descriptive names, but I'm not wedded to them either.
Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run
is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run
, read_pipes
, or reap_children
to make real gains here.
Hope this helps.
Updated by Tom Clegg about 10 years ago
Brett Smith wrote:
Reviewing 789e479
On the code itself:
fail_job
usesrescue
to detect when Log creation fails, but it only usesLog.new.save
, which should return a boolean to indicate success. Did you mean to useLog.new.save!
?- This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new
have_*
instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggestfetched_commits
andjob_tags
as more descriptive names, but I'm not wedded to them either.
Agreed. Changed (and added a comment above each one to explain the significance.)
Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in
run
is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more ofrun
,read_pipes
, orreap_children
to make real gains here.
I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start, and improved noticeably when those jobs were finally able to start, which is why I concentrated on the "incorrectly predict that salloc will succeed" code path.
It would be ideal to address this with test cases that actually expose poor performance scenarios. But until then I'm expecting this will improve the worst case at hand or at least help us see what needs to be fixed next...
Now at 8c492b3
Updated by Brett Smith about 10 years ago
Tom Clegg wrote:
Brett Smith wrote:
Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in
run
is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more ofrun
,read_pipes
, orreap_children
to make real gains here.I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start…
Oh, thank you for reminding me about that. Actually, I think I'm sold now that this is going to be a very effective change, mostly because marking jobs failed more aggressively should mean many fewer calls of start_jobs
and more time for select
to pause. 8c492b3 is good to merge, thanks.
Updated by Tom Clegg about 10 years ago
- Category set to Crunch
- Status changed from In Progress to Feedback
- Assigned To set to Tom Clegg
Updated by Tom Clegg about 10 years ago
- Status changed from Feedback to Resolved