Bug #4297: [Crunch] crunch-dispatch eats way too much cpu - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
My issues for grooming (no story pts)
Prioritized open issues

Actions

Copy link

Bug #4297

closed

[Crunch] crunch-dispatch eats way too much cpu

Added by Ward Vandewege about 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

2014-10-29 sprint

Story points:

0.5

Subtasks 1 (0 open — 1 closed)

Task #4323: Review 4297-dispatch-load

Resolved

Tom Clegg

10/27/2014

Actions

Related issues 1 (0 open — 1 closed)

Has duplicate Arvados - Bug #3278: [Crunch] crunch-dispatch should cache ApiClientAuthorizations it creates

Closed

07/17/2014

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Ward Vandewege about 10 years ago

crunch-dispatch in --jobs mode eats way to much cpu. The ps output below has 2 ruby processess owned by root. Those are crunch-dispatch processes in --jobs mode. One is in drain mode but has been running for a few days. The other one is actively tracking 2 local jobs, and is waiting to start 24 queued jobs that are blocked by not enough compute nodes.

24383 root      20   0  309m 172m 6636 R  92.7  1.2  74:48.85 ruby                                                                                    
30182 www-data  20   0  500m 350m 4416 S  72.0  2.3   6:26.02 ruby                                                                                    
29983 www-data  20   0  594m 424m 4376 S  61.1  2.8   7:47.53 ruby                                                                                    
 8601 root      20   0  125m  53m 4836 S  40.3  0.4   0:01.22 ruby                                                                                    
30062 www-data  20   0  448m 296m 4440 S  20.1  2.0   9:22.83 ruby                                                                                    
 8566 root      20   0  111m  37m 4964 S   6.9  0.2   0:01.06 ruby                                                                                    
 8728 root      20   0  284m 150m 6612 S   6.6  1.0   5:33.49 ruby                                                                                    
29870 root      20   0  121m 4980 3264 S   6.3  0.0   1:10.19 PassengerHelper                                                                         
30209 postgres  20   0  105m  33m  29m S   4.0  0.2   0:21.25 postgres                                                                                
25753 postgres  20   0  103m  32m  29m S   3.6  0.2   3:28.84 postgres                                                                                
29991 postgres  20   0  105m  33m  29m S   3.0  0.2   0:24.99 postgres                                                                                
29888 www-data  20   0  106m 5632 2280 S   2.3  0.0   0:28.99 nginx                                                                                   
30069 postgres  20   0  104m  33m  29m S   2.0  0.2   0:27.90 postgres                                                                                
 5634 root      20   0  497m 198m 6660 S   1.0  1.3 695:58.35 ruby                                                                                    
 1326 root      20   0 23304 1724 1192 S   0.3  0.0   0:11.17 top                                                                                     
 7122 root      20   0 23492 1972 1192 R   0.3  0.0   0:10.25 top

Actions

Copy link

Updated by Tom Clegg about 10 years ago

4297-dispatch-load @ 789e479

Don't refresh node status more than once per second
Cache api tokens, don't make one per second
If repo does not exist, fail the job instead of retrying forever
Cache git-tag and git-fetch-pack results

Actions

Copy link

Updated by Ward Vandewege about 10 years ago

Target version changed from Bug Triage to 2014-10-29 sprint

Actions

Copy link

Updated by Ward Vandewege about 10 years ago

Status changed from New to In Progress
Story points set to 0.5

Actions

Copy link

Updated by Brett Smith about 10 years ago

Reviewing 789e479

On the code itself:

fail_job uses rescue to detect when Log creation fails, but it only uses Log.new.save, which should return a boolean to indicate success. Did you mean to use Log.new.save!?
This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new have_* instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggest fetched_commits and job_tags as more descriptive names, but I'm not wedded to them either.

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

Hope this helps.

Actions

Copy link

Updated by Tom Clegg about 10 years ago

Brett Smith wrote:

Reviewing 789e479

On the code itself:

fail_job uses rescue to detect when Log creation fails, but it only uses Log.new.save, which should return a boolean to indicate success. Did you mean to use Log.new.save!?

This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new have_* instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggest fetched_commits and job_tags as more descriptive names, but I'm not wedded to them either.

Agreed. Changed (and added a comment above each one to explain the significance.)

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start, and improved noticeably when those jobs were finally able to start, which is why I concentrated on the "incorrectly predict that salloc will succeed" code path.

It would be ideal to address this with test cases that actually expose poor performance scenarios. But until then I'm expecting this will improve the worst case at hand or at least help us see what needs to be fixed next...

Now at 8c492b3

Actions

Copy link

Updated by Brett Smith about 10 years ago

Tom Clegg wrote:

Brett Smith wrote:

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start…

Oh, thank you for reminding me about that. Actually, I think I'm sold now that this is going to be a very effective change, mostly because marking jobs failed more aggressively should mean many fewer calls of start_jobs and more time for select to pause. 8c492b3 is good to merge, thanks.

Actions

Copy link