Feature #4233
closed[Workbench] Graph CPU and IO statistics using live log feed on Jobs#show
Description
- A graph appears on the log tab of a job that is running or queued.
- When a websocket message arrives with a "stderr" log entry, in addition to appending the text the existing log viewer window, split the log text into lines (there can be multiple lines per log entry), grep for "crunchstat:" lines, and update the graph accordingly.
- For now, ignore everything the cumulative stats (before the "--" delimiter) and just graph the "interval" stats.
- Each log line is expected to have a numeric task identifier, T (it's "0" for the first task).
- Each crunchstat sample is expected to have the format "crunchstat: L X1 L1 X2 L2 ... -- interval Dt seconds D1 L1 D2 L2 ..." where L* are labels and D* are deltas.
- Each crunchstat sample should be rendered on the graph as a data point
(T, sum(Dx)/Dt)
. The legend for the series should be "L". The tooltip for the data point should list the individual D1 L1 etc. ("123 tx / 345 rx"), as well as sum(Dx)/Dt, the task id T ("task 0"), and the label L ("net:eth0"). - The graph should have a scale of ~1 pixel per second on the X axis: this will show 10-20 minutes of activity. Older activity will scroll off the left edge. For now, there's no need to support scrolling to see older data points, unless of course this turns out to be trivial to implement.
- Ignore lines with no
" -- interval "
. - Probably best to use Morris. Existing example in
app/assets/javascripts/keep_disks.js.coffee
- Provide a test helper (in apps/workbench/lib?) that replays crunch job logs.
- Accept arguments: path to plain text file, speed multiplier, optional job UUID.
- Read log messages from the specified file.
- Parse the timestamps.
- Replace the job UUID in the text file with the job UUID specified by the caller, if any.
- Copy the modified log entry into the Logs table using
Log.create()
. - Introduce delays between log messages such that the logs are replayed faster (by the given speed multiplier) than suggested by the timestamps in the text file. If the speed multiplier given is zero or negative, don't delay at all.
- Do not modify the timestamps in the log text, though. The log record itself will have a current timestamp which will disagree with the text, but that should be fine.
- Raise an exception if RAILS_ENV is not "test" or "development".
- Make a rake task so the helper can be invoked from the command line as
"rake replaylog log.txt 2.0"
. - Note on the Hacking Workbench wiki page how to use the helper to inject logs while hacking the graphing code (or the text log viewer, for that matter).
- A test case uses the helper to inject logs and confirm that the graph gets updated as a result.
- Add a real log file (or an excerpt of one) to the source tree for use in tests. (Check with Abram / science team for a good public example that has multiple concurrent tasks and some periods of interesting activity.)
- Graph stats from multiple jobs at once (e.g., on the "show pipeline" page, when more than one job in the pipeline is running)
- View historical logs (the data will come from plain text, not websockets)
- Restrict the series to a subset of tasks or statistics (for now, a job with many concurrent tasks will have too much data to digest, although "near zero" and "not near zero" should still be possible to distinguish, which is usually the first question).
- Group the statistics by node.
- Group the statistics by label (e.g., total CPU load across all running tasks).
Files
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Tom Clegg about 10 years ago
- Category set to Workbench
- Assigned To set to Tom Clegg
Updated by Tom Clegg about 10 years ago
- Assigned To changed from Tom Clegg to Phil Hodgson
Updated by Tom Clegg about 10 years ago
Here's a mini log excerpt.
2014-10-31_15:45:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 198369280 cache 0 swap 11 pgmajfault 213213184 rss 2014-10-31_15:45:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.2700 sys 8 cpus -- interval 10.0000 seconds 0.1300 user 0.2700 sys 2014-10-31_15:45:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read 2014-10-31_15:45:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx 2014-10-31_15:46:04 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 601022464 cache 0 swap 11 pgmajfault 615866368 rss 2014-10-31_15:46:04 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.4300 sys 8 cpus -- interval 10.0002 seconds 0.0000 user 0.1600 sys 2014-10-31_15:46:04 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0001 seconds 0 write 0 read 2014-10-31_15:46:04 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx -- interval 10.0001 seconds 0 tx 0 rx 2014-10-31_15:46:14 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 1003675648 cache 0 swap 11 pgmajfault 1018519552 rss 2014-10-31_15:46:14 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.5300 sys 8 cpus -- interval 10.0002 seconds 0.0000 user 0.1000 sys 2014-10-31_15:46:14 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0002 seconds 0 write 0 read 2014-10-31_15:46:14 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx -- interval 10.0002 seconds 0 tx 0 rx 2014-10-31_15:46:24 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 1406328832 cache 0 swap 11 pgmajfault 1421172736 rss 2014-10-31_15:46:24 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.6100 sys 8 cpus -- interval 10.0001 seconds 0.0000 user 0.0800 sys 2014-10-31_15:46:24 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0002 seconds 0 write 0 read 2014-10-31_15:46:24 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx -- interval 10.0002 seconds 0 tx 0 rx 2014-10-31_15:46:34 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 1808982016 cache 0 swap 11 pgmajfault 1823825920 rss 2014-10-31_15:46:34 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.6800 sys 8 cpus -- interval 10.0002 seconds 0.0000 user 0.0700 sys 2014-10-31_15:46:34 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0002 seconds 0 write 0 read 2014-10-31_15:46:34 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx -- interval 10.0002 seconds 0 tx 0 rx 2014-10-31_15:46:44 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 2211635200 cache 0 swap 11 pgmajfault 2226479104 rss 2014-10-31_15:46:44 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.7400 sys 8 cpus -- interval 10.0002 seconds 0.0000 user 0.0600 sys 2014-10-31_15:46:44 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0002 seconds 0 write 0 read 2014-10-31_15:46:44 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3298 tx 150945 rx -- interval 10.0002 seconds 0 tx 0 rx 2014-10-31_15:46:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: mem 2614288384 cache 0 swap 11 pgmajfault 2629132288 rss 2014-10-31_15:46:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: cpu 0.1300 user 0.7800 sys 8 cpus -- interval 10.0001 seconds 0.0000 user 0.0400 sys 2014-10-31_15:46:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: blkio:202:16 0 write 1167360 read -- interval 10.0001 seconds 0 write 0 read 2014-10-31_15:46:54 qr1hi-8i9sb-nf3qk0xzwwz3lre 29975 0 stderr crunchstat: net:eth0 3430 tx 151108 rx -- interval 10.0001 seconds 132 tx 163 rx
Updated by Phil Hodgson about 10 years ago
I am psyched to work on this, but my websockets server doesn't work in my dev environment. I've tried a series of different things using what clues I could find, but I feel like I'm shooting in the dark and wasting time. What's the trick?
Firefox can't establish a connection to the server at wss://localhost:3030/websocket?api_token=[etc] event_log_disp = new WebSocket(websocket_url);
Updated by Tom Clegg about 10 years ago
Phil Hodgson wrote:
I am psyched to work on this, but my websockets server doesn't work in my dev environment. I've tried a series of different things using what clues I could find, but I feel like I'm shooting in the dark and wasting time. What's the trick?
[...]
Do you have something like this when you start up your API server: Hacking API server → export ARVADOS_WEBSOCKETS=true
...?
Peter reported recently that Firefox silently fails when it can't verify a websocket server's SSL certificate. (Hence the recent change to the Workbench test suite that makes it start up two concurrent API server instances, https://
and ws://
, to test itself against.) It's possible that wss://X
works if you visit https://X
first and force it to accept the certificate, though.
Staring at the code, it looks like API server advertises the websocket_address given in config/application*yml even if it is running a websocket server itself. So, if you have ARVADOS_WEBSOCKETS set, you probably want websocket_address=false in your config.
Updated comments in the config file @ 8900727
(Helpful?)
Updated by Phil Hodgson about 10 years ago
It's possible that wss://X works if you visit https://X first and force it to accept the certificate, though.
Good call: this works, as does switching to Chrome which is even better because it doesn't require that first step of forced acceptance.
Thank you Tom and Peter!
Updated by Phil Hodgson about 10 years ago
There's a y-axis problem here, because not all the sum(Dx) are measured in seconds, so there are differing units of measurement. Morris is not yet capable of doing two different y axes (github issue may get resolved someday).
Simple solution may be to have multiple graphs? Maybe one per series, or maybe one per unit of measurement (e.g. one graph for the % time and another for bytes sent/received during an interval)?
Updated by Tom Clegg about 10 years ago
Phil Hodgson wrote:
There's a y-axis problem here, because not all the sum(Dx) are measured in seconds, so there are differing units of measurement. Morris is not yet capable of doing two different y axes (github issue may get resolved someday).
Darn.
Simple solution may be to have multiple graphs? Maybe one per series, or maybe one per unit of measurement (e.g. one graph for the % time and another for bytes sent/received during an interval)?
Another possibility: Scale each category of series (e.g., "net" scales by max(cpu)/max(net)), and just live with not having scale markings for the non-CPU figures. I'm imagining that could be coded such that it's easily removed when the appropriate feature (or a different graphing library) comes along. Changing (scaling) the existing points during an update shouldn't interfere with anything, right?
The github issue does explicitly ask for pull requests. It also says "it'll need some serious work". Hmm...
Updated by Bryan Cosca about 10 years ago
Here are a couple examples:
41c21fe24f7e861f61af30a52e3ae3ef+87/qr1hi-8i9sb-xca86qu6z460no8.log.txt
682b61f380d303d173e78fde9eb60c98+87/qr1hi-8i9sb-l7vv9qqxozezv38.log.txt
Let me know if you would like something different!
Updated by Abram Connelly about 10 years ago
Here are a couple from 9tee4 for me:
539086fd4061f946361b858d58ed487f+89/9tee4-8i9sb-fiqxbkp5j4mydb8.log.txt
0bb61e46c383b8c9d3979126d561fa4d+89/9tee4-8i9sb-u4i066qpr4zyidg.log.txt
badada516becda39a1319b728403d3f5+89/9tee4-8i9sb-2q8t7tnu7pit8eh.log.txt
Updated by Bryan Cosca about 10 years ago
Heres one that switches from 1-8
718372dd89ec3873df80ffd7dff00768+89/qr1hi-8i9sb-gctmuofg58gkt3k.log.txt
Updated by Bryan Cosca about 10 years ago
4821022082141e6f5b79266ec6acf139+87/qr1hi-8i9sb-k3s82jjrvwqr1oq.log.txt
Hopefully this is the one! :)
Updated by Phil Hodgson about 10 years ago
I've made a lot of progress on this feature and I should like to get some feedback on where it's at before dotting teas and crossing eyes.
Biggest eye to cross is probably the tests, which are cursory but decidedly much better than nothing. It proved to be very difficult indeed to make sure that the graph is actually displaying properly, and so what is being tested at the moment is actually the Javascript array object that is used by Morris to make the graph. If the right data end up in that object then the trickiest stuff is already working well.
Another big tea to dot would be the pipeline_instances logs, but I'm not sure how those work, or even if they have crunchstat intervals. Do they?
Also it should be noted that the chart is only displayed when using the live log display. When the job is done and the log is saved as a collection, an entirely different log display is used (the one with all the radio button filters and paging), and there is no chart displayed in that context.
My technique for testing generally runs like this: using a Job that is not attached to a log file Collection (and so presumably one that is not finished and for which the live log display will be active), I clear its log entries from the database with DELETE FROM jobs WHERE object_uuid = 'whateveruuid'
and then I run the new rake task from the arvados API:
rake replay_job_log[path_to_log,whateveruuid]
or with a multiplier, for the impatient (this will play back the log at 4x speed):
rake replay_job_log[path_to_log,whateveruuid,4]
Note that the log should be a relatively recent one (> ~oct 22, 2014), with crunchstat lines that have the "-- interval" part recorded, or nothing will happen.
The whateveruuid
will replace the one in the log, so that any log can be used for testing.
I'll write something like the above into the appropriate Hacking wiki at the same time as merging.
Also note that from the above postings from Bryan and Abram, only the last one has the "-- interval" recordings - direct link
Updated by Tom Clegg about 10 years ago
Well, I think this is already pretty fun to watch, even with simulated data. I haven't looked at the code very hard, or since Monday, but here are some thoughts anyway.
Phil Hodgson wrote:
Biggest eye to cross is probably the tests, which are cursory but decidedly much better than nothing. It proved to be very difficult indeed to make sure that the graph is actually displaying properly, and so what is being tested at the moment is actually the Javascript array object that is used by Morris to make the graph. If the right data end up in that object then the trickiest stuff is already working well.
We might be able to use some sort of "perceptual diff" for this (and provenance graphs), but getting the array data right sounds like a good start.
Beyond a certain point I suppose we'll just be testing whether Morris works. A thought: After telling Morris about the new data points, we could stuff something (data point timestamp?) into a data-*
attribute in a DOM element, in order to signal to the test case that it actually got that far...?
Another big tea to dot would be the pipeline_instances logs, but I'm not sure how those work, or even if they have crunchstat intervals. Do they?
The pipeline logs are just the union of job logs for all jobs used in the pipeline. I think this should stay as "future work" for now -- for the time being, with the single-job case merged, users can approximate "pipeline graph" by opening multiple job tabs.
Also it should be noted that the chart is only displayed when using the live log display. When the job is done and the log is saved as a collection, an entirely different log display is used (the one with all the radio button filters and paging), and there is no chart displayed in that context.
Ultimately we'll converge those two views, but for now the question "is my job wedged?" seems more interesting by a decent margin than "was my job wedged before it stopped?". For now, I hope we can live with the disappointment of having the graph snatched away just as it was getting interesting.
My technique for testing generally runs like this: using a Job that is not attached to a log file Collection (and so presumably one that is not finished and for which the live log display will be active), I clear its log entries from the database with
DELETE FROM jobs WHERE object_uuid = 'whateveruuid'
and then I run the new rake task from the arvados API:
Hm, how easy would it be to add a "delete existing logs first" option to the rake task?
I'll write something like the above into the appropriate Hacking wiki at the same time as merging.
Excellent plan.
I have a few requests, mostly with the implied phrase "unless it's a big deal...", and ~ in order by priority:Also note that from the above postings from Bryan and Abram, only the last one has the "-- interval" recordings - direct link
- At initial page load, and while the job is queued / starting up (I'm assuming, since I've only tried it as a simulation) the log div looks like accidental whitespace. I think some sort of placeholder would be good: perhaps a bootstrap
"alert alert-info"
box that says the job hasn't produced any runtime statistics yet but when it does they will appear here? Although in general moving stuff around the page is worth avoiding, I wonder if in this case it would be good to use a slim info box and then replace it with a full-height graph div when there's something to display. - I left this out of the issue description and neglected to add it when it came up in discussion: Having the memory figures (especially swap and rss) is expected to be very useful. Can we easily add those? They're not "interval" figures, just "current reading" ("memory in use right now" being a better indicator of performance than "memory allocated in the last 10 seconds").
- Things look a bit odd when the first few data points are plotted. Could we have a minimum time span for the x axis, so the first few points appear at the right hand end and scroll to the left?
- Hover-detail box formatting:
- Left-align
- Sort by task#, then statistic
- Add the word "task" ("Task 1: cpu 39% ...")
- Hover-detail box pinned to top or bottom? I find it a bit jumpy.
- Color assignment: I don't want to get bogged down in this (finding an Ideal Solution might take forever and/or be impossible) but just as a thought to throw out there, I wonder if a scheme like {blue=net, red=disk, green=cpu} / {dark=task1, lighter=task2, ??=task3} could work? Or perhaps this will cease to be an issue at all once we start adding "show/hide/group" controls for both dimensions.
Updated by Tom Clegg about 10 years ago
- I don't think this is related to the number of tasks, but it just shows me one point at t="2014". I'm guessing it's a timestamp formatting problem, and my "Do not modify the timestamps in the log text, though" idea was not ideal: The
*.log.txt
we use as testing data has timestamps like "2014-11-07_23:22:11" but the log entries crunch-dispatch feeds to the event log have timestamps like "Wed Nov 19 05:07:38 2014". Whoops! I've pushed a fix (b40e379) to this branch so future live logs generate UTC just like the test data, but it would be great if the graphing code could accept the previous/current format as well. - While on the Log tab, if I scroll down a bit, the browser window seems to spontaneously scroll itself back to the top of the page now and then. I haven't figured out exactly what circumstances cause this -- it certainly doesn't do it on every event bus update -- or whether it's even related to this branch. I haven't seen this before. Thought I'd mention it here just in case.
- Looks like morris can handle dynamic resize: https://github.com/morrisjs/morris.js/issues/111, http://jsbin.com/nejoleqi/3
Updated by Tom Clegg about 10 years ago
- Subject changed from [Workbench] Graph CPU and IO statistics using live log feed on PipelineInstance#show and Jobs#show to [Workbench] Graph CPU and IO statistics using live log feed on Jobs#show
- Status changed from New to In Progress
Updated by Ward Vandewege about 10 years ago
- Target version changed from 2014-11-19 sprint to 2014-12-10 sprint
Updated by Phil Hodgson about 10 years ago
There was a bit of talk in the Sprint Review about what to do about the maxima for y-scales for the various stats. It was decided that it is possible to know the highest numbers one can expect from most of the stats. It will be easy to implement once we figure these out.
Meanwhile I went to some trouble to make it so that stats where isJobSeriesRescalable(stat_name)
(new Javascript function) is true, the highest reading currently on the page will always be 100%. A spike (much higher valued datapoint in a series than any recently preceding), appearing on the right, will suddenly make all the existing datapoints in that series appear as dwarves, pushed against the bottom of the graph. Later when the spike scrolls off to the left, the other datapoints still on the screen in that series will suddenly turn into mountains, scaled back up again. This is pretty normal graphing behaviour when a y maximum is not known, but it remains a bit strange, especially when there are no y axis labels being shown, as now. You have to hover over the point and look at it.
I can also propose making it so that a y-scale for a series is never scaled up, only down, i.e. a cumulative maximum detection. E.g. once a datapoint in a series has hit 967, or whatever, that becomes the new maximum y value even if that datapoint scrolls off to the left.
I could make it an option. I'd suggest making this into a new Redmine story where we can configure data series using a yml file in the config directory. That would replace the current hard-coding of the "cpu" series, and the regex for matching, and how the positions of the various data in the regex are defined.
Updated by Phil Hodgson about 10 years ago
There is a Firefox bug that makes the x-axis timeline wrong right now. This needs to be fixed.
Tom Clegg wrote:
Phil Hodgson wrote:
My technique for testing generally runs like this: using a Job that is not attached to a log file Collection (and so presumably one that is not finished and for which the live log display will be active), I clear its log entries from the database with
DELETE FROM jobs WHERE object_uuid = 'whateveruuid'
and then I run the new rake task from the arvados API:Hm, how easy would it be to add a "delete existing logs first" option to the rake task?
I tried this but then it hit me that actually it's not so easy because one wants to clear the graph (probably refresh the whole page) after deleting the data but before re-running the rake task. The graph isn't smart enough to detect log lines being removed, which anyway I don't think should ever happen in Real Life. One would want to send a signal from the rake task to the browser for it to refresh itself, and I judged it not worth while to do this and so we're back to deleting manually, unless someone has another idea.
I have a few requests, mostly with the implied phrase "unless it's a big deal...", and ~ in order by priority:
- At initial page load, and while the job is queued / starting up (I'm assuming, since I've only tried it as a simulation) the log div looks like accidental whitespace. I think some sort of placeholder would be good: perhaps a bootstrap
"alert alert-info"
box that says the job hasn't produced any runtime statistics yet but when it does they will appear here? Although in general moving stuff around the page is worth avoiding, I wonder if in this case it would be good to use a slim info box and then replace it with a full-height graph div when there's something to display.
I have made an empty graph appear, with no labels, no axes, just horizontal lines. Maybe enough for now?
- I left this out of the issue description and neglected to add it when it came up in discussion: Having the memory figures (especially swap and rss) is expected to be very useful. Can we easily add those? They're not "interval" figures, just "current reading" ("memory in use right now" being a better indicator of performance than "memory allocated in the last 10 seconds").
I had to do some refactoring to accomplish it, since on one log line there are multiple memory stats whereas with other log line types there's only one stat, but it was a healthy refactor and now it's done. Latest commits will be charting two more stats, if detected, per task: rss and swap.
- Things look a bit odd when the first few data points are plotted. Could we have a minimum time span for the x axis, so the first few points appear at the right hand end and scroll to the left?
Done. Currently 10 minutes. Even this is a lot of data to load when first showing the log page if the job has already been underway. I changed it so that the graph data is loaded via AJAX, which helps mitigate the feeling of slowness, but the log line display itself is still pretty slow when loading so much history. It used to be that only 100 records were loaded, I believe, but Tim fixed this. We'll wait and see if this is a problem.
- Hover-detail box formatting:
- Left-align
- Sort by task#, then statistic
- Add the word "task" ("Task 1: cpu 39% ...")
Done, done, done.
- Hover-detail box pinned to top or bottom? I find it a bit jumpy.
Not yet fixed.
- Color assignment: I don't want to get bogged down in this (finding an Ideal Solution might take forever and/or be impossible) but just as a thought to throw out there, I wonder if a scheme like {blue=net, red=disk, green=cpu} / {dark=task1, lighter=task2, ??=task3} could work? Or perhaps this will cease to be an issue at all once we start adding "show/hide/group" controls for both dimensions.
Not yet done.
Updated by Phil Hodgson about 10 years ago
Tom Clegg wrote:
Bug reports after I tried running a job with lots of tasks:
- I don't think this is related to the number of tasks, but it just shows me one point at t="2014". I'm guessing it's a timestamp formatting problem, and my "Do not modify the timestamps in the log text, though" idea was not ideal: The
*.log.txt
we use as testing data has timestamps like "2014-11-07_23:22:11" but the log entries crunch-dispatch feeds to the event log have timestamps like "Wed Nov 19 05:07:38 2014". Whoops! I've pushed a fix (b40e379) to this branch so future live logs generate UTC just like the test data, but it would be great if the graphing code could accept the previous/current format as well.
So it's now also handling the old date format.
This dataset still is wonky though and reveals other problems. There is for example at least one huge negative spike in CPU which it calculates at -300,000% usage or something that really screws up the y axis and makes the whole thing look wrong. I'll try to find some simple way to take care of this.
It also has so many dozens of tasks that it leads to hundreds of data series. I made another change so that undefined datapoints were no longer listed in the hover popup (before they were still listed but with a hyphen next to them), which cleans things up dramatically. I'll see how it looks again after taking care of the above y axis issue with seemingly random outliers.
- While on the Log tab, if I scroll down a bit, the browser window seems to spontaneously scroll itself back to the top of the page now and then. I haven't figured out exactly what circumstances cause this -- it certainly doesn't do it on every event bus update -- or whether it's even related to this branch. I haven't seen this before. Thought I'd mention it here just in case.
Not yet fixed. I don't know why this happens but I've noticed it too.
Another feature request
- Looks like morris can handle dynamic resize
Done.
Updated by Tom Clegg about 10 years ago
Phil Hodgson wrote:
Tom Clegg wrote:
Hm, how easy would it be to add a "delete existing logs first" option to the rake task?
I tried this but then it hit me that actually it's not so easy because one wants to clear the graph (probably refresh the whole page) after deleting the data but before re-running the rake task. The graph isn't smart enough to detect log lines being removed, which anyway I don't think should ever happen in Real Life. One would want to send a signal from the rake task to the browser for it to refresh itself, and I judged it not worth while to do this and so we're back to deleting manually, unless someone has another idea.
Indeed, not so simple. Yes, let's skip it.
I have made an empty graph appear, with no labels, no axes, just horizontal lines. Maybe enough for now?
SGTM.
I had to do some refactoring to accomplish it, since on one log line there are multiple memory stats whereas with other log line types there's only one stat, but it was a healthy refactor and now it's done. Latest commits will be charting two more stats, if detected, per task: rss and swap.
Great!
- Things look a bit odd when the first few data points are plotted. Could we have a minimum time span for the x axis, so the first few points appear at the right hand end and scroll to the left?
Done. Currently 10 minutes. Even this is a lot of data to load when first showing the log page if the job has already been underway. I changed it so that the graph data is loaded via AJAX, which helps mitigate the feeling of slowness, but the log line display itself is still pretty slow when loading so much history. It used to be that only 100 records were loaded, I believe, but Tim fixed this. We'll wait and see if this is a problem.
Ah, I wasn't even thinking of loading the older data here, I was just thinking it would be better if the x scale didn't change wildly during the first few seconds. But having 10 minutes of recent activity is certainly more useful than the cosmetic change I was thinking of.
Looking at the code, it seems like we're supplying two "recent log entries" buffers per tab/page load: one inline, to fill the plain-text <div>
, and one as an ajax call with a new push_log action, to fill the graph. It seems like it would be much more efficient if the graphing code could read the historical entries straight out of the plain-text div instead of doing an additional ajax call. (Or am I misreading this?)
In stderr_log_records
we could use .select(['event_at'])
since we only ever look at that attribute.
There are a few stray tab characters in js files. git diff --check
should show them, and git config --global color.ui true
should make git diff
highlight them. Assuming you have git config --global core.whitespace tab-in-indent,trailing-space
, that is.
Although it seems to work in this case, we've had trouble before putting inline javascript tags in ajax-loaded content. I think it would be nicer to move the code to app/assets/javascripts
and set up some handler to do the required setup at the right time. The pattern we've been using for this is something like
$(document).on('ready ajax:success', function() { $('.graph-divs-have-this-magic-class').not('.graph-is-setup').addClass('.graph-is-setup').each(function() { // do whatever setup is needed here }); });
This should work regardless of when the graph div shows up.
- Hover-detail box formatting:
- Left-align
- Sort by task#, then statistic
- Add the word "task" ("Task 1: cpu 39% ...")
Done, done, done.
Great!
I don't know if it's related, but in the detail popup I'm currently getting two memory-related lines and zero non-memory-related lines for each of my running tasks. (See attached image, from http://curover.se/9tee4-8i9sb-yzcpfzb3rkztdpc. This might be another useful test dataset: it starts off with many short tasks, but also has some longer-running ones.)
One other minor complaint: when I switch a different tab, then switch back to the Log tab, the graph is blank for a few seconds until (I assume) the refresh timer goes off and it redraws. (I'm not sure why it goes blank in the first place. I'm not worried about it being updated immediately, but it would be less disconcerting if it still showed what it was showing before!) This doesn't happen consistently: perhaps only while log data is arriving?
I think CPU figures have to be subject to scaling because they're reported as core-seconds vs. clock-seconds. If a task uses a multithreaded process, or multiple CPU-intensive processes in a unix pipeline, etc., CPU usage can go to 16.0 on a 16-core machine. (Perhaps it should have a minimum range of 1.0, but auto-scaling just like others would also be OK.)
The negative CPU (and other) statistics doubtless come from crunchstat's way of collecting statistics: at the very beginning and very end of each task, there's an interval during which it can't find the statistics sources for the docker container, so (if it happens to hit its 10-second timer during such an interval) it prints statistics from the worker host instead. Ultimately crunchstat should do a better job of picking one container and sticking with it, but for now I think we can probably just ignore negative deltas in the "interval" numbers.
Updated by Tom Clegg about 10 years ago
Updated by Phil Hodgson about 10 years ago
Tom Clegg wrote:
Looking at the code, it seems like we're supplying two "recent log entries" buffers per tab/page load: one inline, to fill the plain-text
<div>
, and one as an ajax call with a new push_log action, to fill the graph. It seems like it would be much more efficient if the graphing code could read the historical entries straight out of the plain-text div instead of doing an additional ajax call. (Or am I misreading this?)
I tried something else. Now the log page loads right away and then makes an AJAX call to get the last 10 minute's worth of log lines, each of which triggers an event just as if they'd arrived through the websocket. The idea of course is to be DRY and straightforward, but the fact is that if there are thousands of log lines in the past 10 minutes this will be slow any way we slice it. At first I was simply setting a limit on the number of log lines (500), and we could go back to something like this if this turns out to be a problem.
In
stderr_log_records
we could use.select(['event_at'])
since we only ever look at that attribute.
And event_type, and properties[:text], and object_uuid... Good idea though so I selected only these.
There are a few stray tab characters in js files.
git diff --check
should show them, andgit config --global color.ui true
should makegit diff
highlight them. Assuming you havegit config --global core.whitespace tab-in-indent,trailing-space
, that is.
Sorry! Fixed. Only found the one though.
Although it seems to work in this case, we've had trouble before putting inline javascript tags in ajax-loaded content. I think it would be nicer to move the code to
app/assets/javascripts
and set up some handler to do the required setup at the right time. The pattern we've been using for this is something like[...]
This should work regardless of when the graph div shows up.
Done.
I don't know if it's related, but in the detail popup I'm currently getting two memory-related lines and zero non-memory-related lines for each of my running tasks. (See attached image, from http://curover.se/9tee4-8i9sb-yzcpfzb3rkztdpc. This might be another useful test dataset: it starts off with many short tasks, but also has some longer-running ones.)
Recently I changed it to not show in the detail popup any datapoints that are undefined at that timestamp. Could that be it?
One other minor complaint: when I switch a different tab, then switch back to the Log tab, the graph is blank for a few seconds until (I assume) the refresh timer goes off and it redraws. (I'm not sure why it goes blank in the first place. I'm not worried about it being updated immediately, but it would be less disconcerting if it still showed what it was showing before!) This doesn't happen consistently: perhaps only while log data is arriving?
Hm! Tricky! I will have to get back to this.
I think CPU figures have to be subject to scaling because they're reported as core-seconds vs. clock-seconds. If a task uses a multithreaded process, or multiple CPU-intensive processes in a unix pipeline, etc., CPU usage can go to 16.0 on a 16-core machine. (Perhaps it should have a minimum range of 1.0, but auto-scaling just like others would also be OK.)
I am detecting the CPU count and dividing by that number. Should that not address this? Or is that CPU count not the "core" count?
The negative CPU (and other) statistics doubtless come from crunchstat's way of collecting statistics: at the very beginning and very end of each task, there's an interval during which it can't find the statistics sources for the docker container, so (if it happens to hit its 10-second timer during such an interval) it prints statistics from the worker host instead. Ultimately crunchstat should do a better job of picking one container and sticking with it, but for now I think we can probably just ignore negative deltas in the "interval" numbers.
Now ignoring deltas < 0.
Updated by Phil Hodgson about 10 years ago
Phil Hodgson wrote:
There is a Firefox bug that makes the x-axis timeline wrong right now. This needs to be fixed.
There is now more consistent use of UTC and Firefox is now giving the same results as Chrome.
Updated by Tom Clegg about 10 years ago
Phil Hodgson wrote:
Hm, I like the way this defers the slow part of getting the logs, but there are a couple of things I don't like about it too:Tom Clegg wrote:
Looking at the code, it seems like we're supplying two "recent log entries" buffers per tab/page load: one inline, to fill the plain-text
<div>
, and one as an ajax call with a new push_log action, to fill the graph. It seems like it would be much more efficient if the graphing code could read the historical entries straight out of the plain-text div instead of doing an additional ajax call. (Or am I misreading this?)I tried something else. Now the log page loads right away and then makes an AJAX call to get the last 10 minute's worth of log lines, each of which triggers an event just as if they'd arrived through the websocket.
- The logs can arrive out of order: websockets will start pushing "live" log messages into the viewer window (and graph) as soon as the Log tab loads, when the push_logs ajax request is just getting started. So logs 1,2,3,4,5 might show up as 4,5,1,2,3 -- or even 4,5,1,2,3,4,5 if 4,5 are generated after websocket connects but before push_logs does its query.
jobs#push_logs
does something that could be done with the existinglogs#index?filters=
(do the ajax JSON, and put the parse+generate_event logic in an ajax done() instead of a .js view). FWIW, one reason it's nicer to do this with the existing logs controller is that it'll be easy to modify this to request the logs directly from the API server instead of using Workbench as a proxy. The uploader in #3781 is starting down this road. I don't want to hold up the feature over this point, though.
- Will the log graph do something reasonable when it gets logs with timestamps
4,5,1,2,3,6,7,...
? I think it's fine if it ignores points4,5
(that part should only cover a few seconds), and even if it ignores1,2,3
(although that'd mean we'd lose the historical view sometimes, depending on the race between websockets and ajax) -- as long as6,7,...
still render OK after that. I'm wondering a little about the code around the "this data is supposed to be sorted" comment. What's the worst that can happen there? - Perhaps there's something easy/nice that we could do for the viewer, perhaps prepend the entire set of ajaxed logs to the viewer when they all arrive, rather than appending?
The idea of course is to be DRY and straightforward, but the fact is that if there are thousands of log lines in the past 10 minutes this will be slow any way we slice it. At first I was simply setting a limit on the number of log lines (500), and we could go back to something like this if this turns out to be a problem.
I wonder if it would be better (and simpler?) to use limit()
and ignore the timestamps. If the logs are very busy, it might be better to get <10 minutes worth of pre-populated graph than to wait for a lot of old logs before the current stuff can be displayed. But this point, too, should surely be set aside as a future bikeshed.
There are a few stray tab characters in js files.
git diff --check
should show them, andgit config --global color.ui true
should makegit diff
highlight them. Assuming you havegit config --global core.whitespace tab-in-indent,trailing-space
, that is.Sorry! Fixed. Only found the one though.
I still see some, although they're not all in js files like I insinuated. git diff --check master...origin/4233-graph-job-stats
I don't know if it's related, but in the detail popup I'm currently getting two memory-related lines and zero non-memory-related lines for each of my running tasks. (See attached image, from http://curover.se/9tee4-8i9sb-yzcpfzb3rkztdpc. This might be another useful test dataset: it starts off with many short tasks, but also has some longer-running ones.)
Recently I changed it to not show in the detail popup any datapoints that are undefined at that timestamp. Could that be it?
Hm, I suppose that could be it, if task 1's CPU and memory statistics show up with slightly different timestamps. But I think they're usually (at least) come very close together -- which would be good because plugging that hole in the general case could be a bit more involved than I really want this to get right now. A quick examination of the log file suggests the point in the png wouldn't be explained by that:
2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 23 stderr crunchstat: mem 495616 cache 0 swap 0 pgmajfault 34435072 rss 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 23 stderr crunchstat: cpu 157.5000 user 260.0400 sys 20 cpus -- interval 10.0002 seconds 2.9600 user 7.0400 sys 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 23 stderr crunchstat: net:eth0 3390 tx 172925 rx -- interval 10.0002 seconds 0 tx 0 rx 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 12 stderr crunchstat: mem 483328 cache 0 swap 0 pgmajfault 27914240 rss 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 12 stderr crunchstat: cpu 143.2400 user 305.6400 sys 20 cpus -- interval 10.0003 seconds 2.8600 user 7.1400 sys 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 12 stderr crunchstat: blkio:202:3 0 write 0 read -- interval 10.0003 seconds 0 write 0 read 2014-11-22_00:26:26 9tee4-8i9sb-yzcpfzb3rkztdpc 8074 12 stderr crunchstat: net:eth0 3390 tx 180137 rx -- interval 10.0003 seconds 0 tx 0 rx
I think CPU figures have to be subject to scaling because they're reported as core-seconds vs. clock-seconds. If a task uses a multithreaded process, or multiple CPU-intensive processes in a unix pipeline, etc., CPU usage can go to 16.0 on a 16-core machine. (Perhaps it should have a minimum range of 1.0, but auto-scaling just like others would also be OK.)
I am detecting the CPU count and dividing by that number. Should that not address this? Or is that CPU count not the "core" count?
Aha. CPU count is core count, yes, so this does mean legitimate figures will be in [0,1]
. However, it also means that running 20 single-threaded tasks will give you 20 series squished into the bottom 5% of the graph, which might not be as revealing. I suppose we could restrict scaling so that Ymax is always between "1 core pegged" and "all cores pegged". This would ensure that 1%-of-one-core still looks as small as it is...
The negative CPU (and other) statistics doubtless come from crunchstat's way of collecting statistics: at the very beginning and very end of each task, there's an interval during which it can't find the statistics sources for the docker container, so (if it happens to hit its 10-second timer during such an interval) it prints statistics from the worker host instead. Ultimately crunchstat should do a better job of picking one container and sticking with it, but for now I think we can probably just ignore negative deltas in the "interval" numbers.
Now ignoring deltas < 0.
Great.
Updated by Phil Hodgson about 10 years ago
Tom Clegg wrote:
Hm, I like the way this defers the slow part of getting the logs, but there are a couple of things I don't like about it too:
- The logs can arrive out of order: websockets will start pushing "live" log messages into the viewer window (and graph) as soon as the Log tab loads, when the push_logs ajax request is just getting started. So logs 1,2,3,4,5 might show up as 4,5,1,2,3 -- or even 4,5,1,2,3,4,5 if 4,5 are generated after websocket connects but before push_logs does its query.
This is true. I added the prepend trick which should mitigate or even eliminate this problem.
jobs#push_logs
does something that could be done with the existinglogs#index?filters=
(do the ajax JSON, and put the parse+generate_event logic in an ajax done() instead of a .js view). FWIW, one reason it's nicer to do this with the existing logs controller is that it'll be easy to modify this to request the logs directly from the API server instead of using Workbench as a proxy. The uploader in #3781 is starting down this road. I don't want to hold up the feature over this point, though.
I still used a jobs route, rather than a logs route, but otherwise followed your sage advice. As you know I'm partial (so to speak) to JS views, but I'll try to kick the habit.
For the out-of-order issue, it's probably not a show-stopper but:
- Will the log graph do something reasonable when it gets logs with timestamps
4,5,1,2,3,6,7,...
? I think it's fine if it ignores points4,5
(that part should only cover a few seconds), and even if it ignores1,2,3
(although that'd mean we'd lose the historical view sometimes, depending on the race between websockets and ajax) -- as long as6,7,...
still render OK after that. I'm wondering a little about the code around the "this data is supposed to be sorted" comment. What's the worst that can happen there?
Thankfully, Morris does not care about the ordering of data points! It will show them in a sensible way. The "data is supposed to be sorted" is meant for the sake of efficiently dropping data that scrolls off to the left rather than easing Morris's job.
- Perhaps there's something easy/nice that we could do for the viewer, perhaps prepend the entire set of ajaxed logs to the viewer when they all arrive, rather than appending?
Yes! Done, as mentioned above.
I wonder if it would be better (and simpler?) to use
limit()
and ignore the timestamps. If the logs are very busy, it might be better to get <10 minutes worth of pre-populated graph than to wait for a lot of old logs before the current stuff can be displayed. But this point, too, should surely be set aside as a future bikeshed.
I set a 500 record limit rather than calculating 10 minutes. It is definitely faster and simpler.
There are a few stray tab characters in js files.
git diff --check
should show them, andgit config --global color.ui true
should makegit diff
highlight them. Assuming you havegit config --global core.whitespace tab-in-indent,trailing-space
, that is.Sorry! Fixed. Only found the one though.
I still see some, although they're not all in js files like I insinuated.
git diff --check master...origin/4233-graph-job-stats
Sorry. I think I got 'em now.
I don't know if it's related, but in the detail popup I'm currently getting two memory-related lines and zero non-memory-related lines for each of my running tasks. (See attached image, from http://curover.se/9tee4-8i9sb-yzcpfzb3rkztdpc. This might be another useful test dataset: it starts off with many short tasks, but also has some longer-running ones.)
Recently I changed it to not show in the detail popup any datapoints that are undefined at that timestamp. Could that be it?
Hm, I suppose that could be it, if task 1's CPU and memory statistics show up with slightly different timestamps. But I think they're usually (at least) come very close together -- which would be good because plugging that hole in the general case could be a bit more involved than I really want this to get right now. A quick examination of the log file suggests the point in the png wouldn't be explained by that:
[...]
I could not reproduce this with that dataset. Hopefully fixed. :)
I am detecting the CPU count and dividing by that number. Should that not address this? Or is that CPU count not the "core" count?
Aha. CPU count is core count, yes, so this does mean legitimate figures will be in
[0,1]
. However, it also means that running 20 single-threaded tasks will give you 20 series squished into the bottom 5% of the graph, which might not be as revealing. I suppose we could restrict scaling so that Ymax is always between "1 core pegged" and "all cores pegged". This would ensure that 1%-of-one-core still looks as small as it is...
Hmm... I'm not sure I follow, though it sounds like I'm almost there. Perhaps a slight modification to the way logs are recorded could help differentiate?
Updated by Tom Clegg about 10 years ago
Phil Hodgson wrote:
This is true. I added the prepend trick which should mitigate or even eliminate this problem.
Ah, that looks less intrusive than I imagined, great.
jobs#push_logs
does something that could be done with the existinglogs#index?filters=
(do the ajax JSON, and put the parse+generate_event logic in an ajax done() instead of a .js view). FWIW, one reason it's nicer to do this with the existing logs controller is that it'll be easy to modify this to request the logs directly from the API server instead of using Workbench as a proxy. The uploader in #3781 is starting down this road. I don't want to hold up the feature over this point, though.I still used a jobs route, rather than a logs route, but otherwise followed your sage advice. As you know I'm partial (so to speak) to JS views, but I'll try to kick the habit.
Looks good
Thankfully, Morris does not care about the ordering of data points! It will show them in a sensible way. The "data is supposed to be sorted" is meant for the sake of efficiently dropping data that scrolls off to the left rather than easing Morris's job.
Ah, great. Sounds like we're doing pretty well wrt sorting, then.
I set a 500 record limit rather than calculating 10 minutes. It is definitely faster and simpler.
Great
tabs
Sorry. I think I got 'em now.
LGTM
Hm, I suppose that could be it, if task 1's CPU and memory statistics show up with slightly different timestamps. But I think they're usually (at least) come very close together -- which would be good because plugging that hole in the general case could be a bit more involved than I really want this to get right now. A quick examination of the log file suggests the point in the png wouldn't be explained by that:
[...]
I could not reproduce this with that dataset. Hopefully fixed. :)
I'll try it again with 244dddb and report back...
Aha. CPU count is core count, yes, so this does mean legitimate figures will be in
[0,1]
. However, it also means that running 20 single-threaded tasks will give you 20 series squished into the bottom 5% of the graph, which might not be as revealing. I suppose we could restrict scaling so that Ymax is always between "1 core pegged" and "all cores pegged". This would ensure that 1%-of-one-core still looks as small as it is...Hmm... I'm not sure I follow, though it sounds like I'm almost there. Perhaps a slight modification to the way logs are recorded could help differentiate?
Earlier crunchstat versions reported "all cores pegged" as 100%, but that was confusing because most other tools (like top) report "16 cores pegged" as 1600%. We decided it made more sense in general to log the raw numbers (in this case #user/sys seconds of usage reported by the kernel over an interval) and let downstream readers decide whether to normalize it one way or another for human consumption.
So I think the way to get easy-to-interpret results for both single-threaded and many-threaded tasks is:- 10s interval, 16 cpus, 320s consumption → data point hidden/off-the-map, detail popup says 3200%
- 10s interval, 16 cpus, 160s consumption → data point at Ymax, detail popup says 1600%
- 10s interval, 16 cpus, 10s consumption (and no data points are >10s consumption) → data point at Ymax, detail popup says 100%
- 10s interval, 16 cpus, 5s consumption (and no data points are >5s consumption) → data point at Ymax÷2, detail popup says 50%
Does that make sense?
Updated by Phil Hodgson about 10 years ago
Good news: I solved the problem with you found with the missing non-memory datapoints. I didn't realize that log lines could come in with newlines! Now splitting those up and everything is fine.
Updated by Phil Hodgson about 10 years ago
For the CPU-pegging business, I've attempted something at this commit: 7ab6b64c5fa3b958752ecb22751630b6e1016bd1 - Basically I interpreted your analysis this way: "divide the stat by the number of cpus unless the time count is less than the interval length". I hope that makes sense.
Updated by Phil Hodgson about 10 years ago
...which didn't work perfectly either, because sometimes the total CPU time is for example 10.02 seconds and the interval length is 10.01 seconds. I rounded to 1 decimal place before making the comparison and now it seems to work more reasonably.
Updated by Tom Clegg about 10 years ago
- 10s interval, 16 cpus, 80s consumption (and no data points are >80s consumption) → data point at Ymax, detail popup says 800%
(i.e., we can scale dynamically within the "one cpu busy" to "all cpus busy" range.)
But the "pick one of two scales" way seems quite reasonable too, and we've covered the points I was worried about, which are:- Single-threaded tasks ranging from 10-100% CPU are clearly visible (not squished against the X axis)
- Multi-threaded tasks are visible (not above Ymax)
- Detail popup reports CPU usage in a way similar to "top" (i.e., pegging two CPUs looks like 200%).
Therefore, I think this is ready to merge without further bikeshed.
Updated by Anonymous about 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 83 to 100
Applied in changeset arvados|commit:968526e233fede288bf332aab724c170fe88345c.