Bug #5523

[Crunch] crunchstat should not report errors during normal timing races

Added by Peter Amstutz about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
05/07/2015
Due date:
% Done:

50%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Container stat files appear and disappear in normal operation. In the "normal" cases, such events should not be logged (let alone as an error).

We expect zero or one episode of "cannot find stats file" when cidfile != "" and we're collecting stats for the first time.
  • If the first collection attempt for a given statistic results in "cannot find file", we should block in OpenStatFile and poll quickly over a short interval (say, every 100ms, max 1s) because we probably just won the race with the container setup process.
  • If the stat files don't show up within that max interval (~1s) it means something is wrong, and this should (still) be logged.
We expect zero or one episode of "stats file disappeared" when cidfile != "" when we happen to poll between container shutdown and (crunchstat's) child exit. For a given statistic:
  • The first time this occurs, we should not log anything.
  • The second time this occurs, we should log "warning: stats file disappeared {duration} ago, but child has not exited".
  • The third+ time this occurs, we should not log anything.
  • If the stat file reappears, we should reset the "went missing" counter to zero.

Subtasks

Task #5918: Review 5523-stats-errorResolvedTom Clegg

Task #5938: Handle normal container startup and shutdown races without logging an error/notice or missing the first intervalNewTom Clegg


Related issues

Related to Arvados - Bug #4882: [Crunch] crunchstat reports surprising CPU usage when container appears and disappearsResolved12/29/2014

Associated revisions

Revision 200f7004
Added by Tom Clegg about 4 years ago

Merge branch '5523-stats-error' closes #5523

History

#1 Updated by Peter Amstutz about 4 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#2 Updated by Tom Clegg about 4 years ago

Would be good to have more detail - job ID or log fragment.

We could probably start by changing that message to say "warning".

#3 Updated by Bryan Cosca about 4 years ago

Here is an example: https://cloud.curoverse.com/collections/111de0eb53db4ff9849dc2e63178425d+85/qr1hi-8i9sb-8vyxlg0da770t8e.log.txt

I think this should get moved up, a customer mistook this error as a RAM issue and stopped using arvados.

#4 Updated by Tom Clegg about 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-05-20 sprint

#5 Updated by Tom Clegg about 4 years ago

  • Status changed from New to In Progress

#6 Updated by Tom Clegg about 4 years ago

  • Category set to Crunch
  • Assigned To set to Tom Clegg

#7 Updated by Peter Amstutz about 4 years ago

ae6b514 5523-stats-error LGTM

#8 Updated by Tom Clegg about 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:200f7004f921a68ec40b407dfe31f1db95e98fb9.

#9 Updated by Tom Clegg about 4 years ago

  • Subject changed from [Crunch] crunchstat The "error reading stats from /sys/fs/cgroup/memory/memory.stat to [Crunch] crunchstat should not report errors during normal timing races
  • Description updated (diff)
  • Status changed from Resolved to In Progress

Reopening to add TODO from #4882.

#10 Updated by Tom Clegg about 4 years ago

  • Story points set to 0.5

#11 Updated by Tom Clegg about 4 years ago

  • Status changed from In Progress to New

#12 Updated by Brett Smith about 4 years ago

  • Target version changed from 2015-05-20 sprint to Arvados Future Sprints

Also available in: Atom PDF