Bug #4297

[Crunch] crunch-dispatch eats way too much cpu

Added by Ward Vandewege about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
10/27/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
0.5

Subtasks

Task #4323: Review 4297-dispatch-loadResolvedTom Clegg


Related issues

Has duplicate Arvados - Bug #3278: [Crunch] crunch-dispatch should cache ApiClientAuthorizations it createsClosed07/17/2014

Associated revisions

Revision e68bf42e
Added by Tom Clegg about 5 years ago

Merge branch '4297-dispatch-load' refs #4297

History

#1 Updated by Ward Vandewege about 5 years ago

crunch-dispatch in --jobs mode eats way to much cpu. The ps output below has 2 ruby processess owned by root. Those are crunch-dispatch processes in --jobs mode. One is in drain mode but has been running for a few days. The other one is actively tracking 2 local jobs, and is waiting to start 24 queued jobs that are blocked by not enough compute nodes.

24383 root      20   0  309m 172m 6636 R  92.7  1.2  74:48.85 ruby                                                                                    
30182 www-data  20   0  500m 350m 4416 S  72.0  2.3   6:26.02 ruby                                                                                    
29983 www-data  20   0  594m 424m 4376 S  61.1  2.8   7:47.53 ruby                                                                                    
 8601 root      20   0  125m  53m 4836 S  40.3  0.4   0:01.22 ruby                                                                                    
30062 www-data  20   0  448m 296m 4440 S  20.1  2.0   9:22.83 ruby                                                                                    
 8566 root      20   0  111m  37m 4964 S   6.9  0.2   0:01.06 ruby                                                                                    
 8728 root      20   0  284m 150m 6612 S   6.6  1.0   5:33.49 ruby                                                                                    
29870 root      20   0  121m 4980 3264 S   6.3  0.0   1:10.19 PassengerHelper                                                                         
30209 postgres  20   0  105m  33m  29m S   4.0  0.2   0:21.25 postgres                                                                                
25753 postgres  20   0  103m  32m  29m S   3.6  0.2   3:28.84 postgres                                                                                
29991 postgres  20   0  105m  33m  29m S   3.0  0.2   0:24.99 postgres                                                                                
29888 www-data  20   0  106m 5632 2280 S   2.3  0.0   0:28.99 nginx                                                                                   
30069 postgres  20   0  104m  33m  29m S   2.0  0.2   0:27.90 postgres                                                                                
 5634 root      20   0  497m 198m 6660 S   1.0  1.3 695:58.35 ruby                                                                                    
 1326 root      20   0 23304 1724 1192 S   0.3  0.0   0:11.17 top                                                                                     
 7122 root      20   0 23492 1972 1192 R   0.3  0.0   0:10.25 top           

#2 Updated by Tom Clegg about 5 years ago

4297-dispatch-load @ 789e479

  • Don't refresh node status more than once per second
  • Cache api tokens, don't make one per second
  • If repo does not exist, fail the job instead of retrying forever
  • Cache git-tag and git-fetch-pack results

#3 Updated by Ward Vandewege about 5 years ago

  • Target version changed from Bug Triage to 2014-10-29 sprint

#4 Updated by Ward Vandewege about 5 years ago

  • Status changed from New to In Progress
  • Story points set to 0.5

#5 Updated by Brett Smith about 5 years ago

Reviewing 789e479

On the code itself:

  • fail_job uses rescue to detect when Log creation fails, but it only uses Log.new.save, which should return a boolean to indicate success. Did you mean to use Log.new.save!?
  • This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new have_* instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggest fetched_commits and job_tags as more descriptive names, but I'm not wedded to them either.

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

Hope this helps.

#6 Updated by Tom Clegg about 5 years ago

Brett Smith wrote:

Reviewing 789e479

On the code itself:

  • fail_job uses rescue to detect when Log creation fails, but it only uses Log.new.save, which should return a boolean to indicate success. Did you mean to use Log.new.save!?
  • This is highly subjective, but FWIW, it took me a minute to grapple with the meaning of the new have_* instance variables. I think mostly because it leaves open the question, where do we have those things? In the source git repository, our internal git repository, in Arvados, somewhere else? I would humbly suggest fetched_commits and job_tags as more descriptive names, but I'm not wedded to them either.

Agreed. Changed (and added a comment above each one to explain the significance.)

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start, and improved noticeably when those jobs were finally able to start, which is why I concentrated on the "incorrectly predict that salloc will succeed" code path.

It would be ideal to address this with test cases that actually expose poor performance scenarios. But until then I'm expecting this will improve the worst case at hand or at least help us see what needs to be fixed next...

Now at 8c492b3

#7 Updated by Brett Smith about 5 years ago

Tom Clegg wrote:

Brett Smith wrote:

Now, more philosophically: these changes make crunch-dispatch more responsible, I think they're good and and should generally be merged, but I'm not convinced they're going to resolve this ticket. Fundamentally, the main loop in run is very tight. If the jobs are producing any kind of output—and they usually will be, thanks to crunchstat if nothing else—then crunch-dispatch will keep looping without pause. In fact, I wouldn't be surprised if this branch causes crunch-dispatch to use more CPU, because it spends less time calling and waiting for child processes. I think we will have to change one or more of run, read_pipes, or reap_children to make real gains here.

I'm sure you're right that there are more performance problems to fix. But we did notice that performance was especially bad when jobs were queued and failing to start…

Oh, thank you for reminding me about that. Actually, I think I'm sold now that this is going to be a very effective change, mostly because marking jobs failed more aggressively should mean many fewer calls of start_jobs and more time for select to pause. 8c492b3 is good to merge, thanks.

#8 Updated by Tom Clegg about 5 years ago

  • Category set to Crunch
  • Status changed from In Progress to Feedback
  • Assigned To set to Tom Clegg

#9 Updated by Tom Clegg about 5 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF