Project

General

Profile

Actions

Bug #4882

closed

[Crunch] crunchstat reports surprising CPU usage when container appears and disappears

Added by Bryan Cosca almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

Background

source:services/crunchstat/crunchstat.go has a list of locations where cgroup accounting files are likely to be found. Historically this was needed because the location varies from one kernel/system to another. Each time crunchstat reads stats, it tries each location in turn until it finds one that exists. When the target container is up, this works fine, but the last two locations give accounting information for the host instead of the container. Often, crunchstat tries to read accounting data before the container has come up or after the container has been destroyed (in the interval before it gets notified that the docker process has exited). Naturally, the transition between host stats and container stats results in reporting wild deltas.

The transition from one stat file to another causes an "error" log message which is easy to misinterpret as an error affecting the job, or explaining some later job failure -- see #5523.

Proposed fix

One or more of:
  • Don't report host-level stats if container-level stats have ever been reported.
  • Don't report host-level stats ever.
  • When the previous sample point and the current sample point come from different accounting files, don't print a delta.

Bug report

Originally reported as "Log graph goes out of bounds":

about 5/6th of the way through, you can see the yellow point go off the screen. Suggestion: maybe the graph should scale in real time... Hmm, I'm not sure because then the graph would keep scaling and would probably get annoying. It would still be useful to at least see that point though.


Files

log_graph_oob.png (56.3 KB) log_graph_oob.png Bryan Cosca, 12/29/2014 08:48 PM

Subtasks 1 (0 open1 closed)

Task #5932: Review 4882-no-host-when-containerResolvedTom Clegg12/29/2014Actions

Related issues 1 (1 open0 closed)

Related to Arvados - Bug #5523: [Crunch] crunchstat should not report errors during normal timing racesNewTom CleggActions
Actions #1

Updated by Tim Pierce almost 10 years ago

  • Target version set to Bug Triage
Actions #2

Updated by Tim Pierce almost 10 years ago

  • Subject changed from Log graph goes out of bounds to [Workbench] Log graph goes out of bounds
  • Category set to Workbench
Actions #3

Updated by Brett Smith almost 10 years ago

The job in question is qr1hi-8i9sb-ajgkcn4cuk4cntm. At 2014-12-29_21:48:34, where previous CPU user values were 10 or below, suddenly it skyrocketed to over 4000. After that, samples were below 1 for the remainder of the job.

crunchstat apparently didn't miss a sample. The logs show them being reported every 10 seconds consistently.

I wouldn't rule out the possibility that this is a bug in the underlying data—either in crunchstat, or the records it reads.

Actions #4

Updated by Tom Clegg almost 10 years ago

  • Subject changed from [Workbench] Log graph goes out of bounds to [Workbench] crunchstat reports surprising CPU usage when container appears and disappears
  • Description updated (diff)
  • Category changed from Workbench to Crunch
  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #5

Updated by Tom Clegg almost 10 years ago

  • Subject changed from [Workbench] crunchstat reports surprising CPU usage when container appears and disappears to [Crunch] crunchstat reports surprising CPU usage when container appears and disappears
Actions #6

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg over 9 years ago

  • Description updated (diff)
Actions #8

Updated by Tom Clegg over 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-05-20 sprint
Actions #9

Updated by Tom Clegg over 9 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Peter Amstutz over 9 years ago

It seems likely that there will be spurious "stats not available" messages during container startup and teardown, it would be nice to suppress those since that is an expected condition. What do you think about adding a counter or timer so that it only prints the notice after several consecutive failed attempts to read the stats?

Actions #11

Updated by Peter Amstutz over 9 years ago

  • Assigned To set to Tom Clegg
Actions #12

Updated by Tom Clegg over 9 years ago

Peter Amstutz wrote:

It seems likely that there will be spurious "stats not available" messages during container startup and teardown, it would be nice to suppress those since that is an expected condition. What do you think about adding a counter or timer so that it only prints the notice after several consecutive failed attempts to read the stats?

That seems reasonable. "Several" should probably be two: neither of those races lasts more than a few milliseconds unless something's wrong. It would be ideal to handle the startup race by waiting a short time and repolling, to avoid missing the first interval of data. Added this to #5523 ("verbose logs sound too scary") rather than blocking this ("stats are incorrect").

Actions #13

Updated by Tom Clegg over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:31d8ece649c63b008ed79a930a5b237ec795ff22.

Actions

Also available in: Atom PDF