Bug #2934

Limit impact of chatty crunch job on websocket/Log mechanism

Added by Tom Clegg over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
-
Start date:
06/10/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Description

Current code limits each job to one Log entry per second. Additional limits to impose:
  • 64K Log entries per job
  • 64Mbytes of log data per job

When one of these limits is reached, add one more entry like "Server configured limit reached (crunch_limit_log_events_per_job: 65536). Subsequent logs truncated".

Subsequent logs will still get stored in Keep, but won't be available through event bus.

These limits (and the current LOG_BUFFER_SIZE constant) should be configurable. Add to services/api/config/application.default.yml in the common section:

  crunch_limit_log_events_per_job: 65536
  crunch_limit_log_event_bytes_per_job: 67108864
  crunch_limit_log_bytes_per_event: 4096

Then use Rails.configuration.crunch_limit_log_bytes_per_event instead of LOG_BUFFER_SIZE, etc.

Optional: Add crunch_limit_seconds_between_log_events (this is a minimum, and currently hard coded at 1, but we can probably support 0, 0.5, 5, etc., easily enough).


Subtasks

Task #3012: Review 2934-limit-crunch-logsResolvedPeter Amstutz

Associated revisions

Revision 4a72800f (diff)
Added by Tim Pierce over 6 years ago

2934: add limits for crunch job log generation.

  • crunch_limit_log_events_per_job
  • crunch_limit_log_event_bytes_per_job

    Limit the number of events a job may log, and the total number of
    bytes a job may log.

  • crunch_log_bytes_per_event
  • crunch_log_seconds_between_events

    Set minimum logging thresholds: event logs will only be flushed to the
    database when at least crunch_log_bytes_per_event have been generated,
    or crunch_log_seconds_between_events have passed.

Integration test CrunchDispatchTest exercised creating a job.

Refs #2934.

Revision 54e43ffa (diff)
Added by Tim Pierce over 6 years ago

2934: add launch_crunch_dispatch to integration test

Added functionality for launching crunch-dispatch in
crunch_dispatch_test.rb while integration tests are running.

Refs #2934.

Revision d7e23d9a (diff)
Added by Tim Pierce over 6 years ago

2934: count events and bytes in-memory

To avoid thrashing the database, crunch-dispatch.rb counts log events
and bytes in RAM instead of reading the database each time.

Refs #2934.

Revision 525c9425 (diff)
Added by Tim Pierce over 6 years ago

2934: add "bar" repository to test fixture

Add the "bar" repository to test/fixtures/repositories.yml so
CrunchDispatchTest may create a job using a commit from that
repository. (Change lost in a bad rebase.)

Refs #2934

Revision 783343e5
Added by Tim Pierce over 6 years ago

Merge branch 'master' into 2934-limit-crunch-logs

Refs #2934

Revision 46ab8685 (diff)
Added by Tim Pierce over 6 years ago

2934: add crunch_dispatch_test repository

Move files for crunch_dispatch_test to their own test fixture
repository.

Refs #2934.

Revision 0143d26d (diff)
Added by Tim Pierce over 6 years ago

2934: document stderr messages for commit tests

Some unit tests for the Commit models issue alarming but expected git
error messages on stderr. Added comments documenting these in
commits_controller_test.rb.

Refs #2934.

Revision 95811b3a
Added by Tim Pierce over 6 years ago

Merge branch '2934-limit-crunch-logs'

Closes #2934

Revision ae8829d3 (diff)
Added by Tim Pierce over 6 years ago

2934: add crunch-dispatch default settings

Add default values to application.default.yml for:
  • crunch_limit_log_events_per_job
  • crunch_limit_log_event_bytes_per_job
  • crunch_log_bytes_per_event
  • crunch_log_seconds_between_events

Refs #2934.

Revision 3b72eb35
Added by Tim Pierce over 6 years ago

Merge branch '2934-limit-crunch-logs-bugfix'

Refs #2934.

History

#1 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg over 6 years ago

  • Assigned To set to Tim Pierce

#3 Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress

#4 Updated by Tim Pierce over 6 years ago

As of d7e23d9, crunch-dispatch.rb keeps event counts and byte counts in memory instead of reading them from disk.

Per discussion on IRC: we will almost certainly need to tune and tweak these parameters to find good settings. The ones that are defined in the story should be considered a starting point and not definitive. They're all configurable parameters.

We could implement byte limits per event rather than across all events (i.e. limit to 1Kb per event rather than 64MB total for all logged events), but I don't see the benefit. Applying the limit as a cap on total bytes generated gives jobs the flexibility to generate a few log events that are longer than the average without truncating them, and still prevents jobs from running away and filling the entire disk.

#5 Updated by Peter Amstutz over 6 years ago

  • Suggest changing the read size on read_nonblock() in #read_pipes to Rails.configuration.crunch_log_bytes_per_event instead of hardcoded at 2**20, otherwise if something sufficiently spammy could write out (a lot) more than Rails.configuration.crunch_log_bytes_per_event bytes in a single event.

#6 Updated by Tim Pierce over 6 years ago

Peter Amstutz wrote:

  • Suggest changing the read size on read_nonblock() in #read_pipes to Rails.configuration.crunch_log_bytes_per_event instead of hardcoded at 2**20, otherwise if something sufficiently spammy could write out (a lot) more than Rails.configuration.crunch_log_bytes_per_event bytes in a single event.

crunch_log_bytes_per_event isn't an upper limit, it's a lower one (which is why I didn't add "limit" to the name) -- it's the lower threshold for flushing the event to the Log table, if events are getting created more often than once per second. So we expect and allow that some events will actually produce more than crunch_log_bytes_per_event bytes at a time. (I'm absolutely open to changing the name: crunch_log_min_bytes_per_event? crunch_log_byte_threshold_for_fast_events? I dunno.)

#7 Updated by Peter Amstutz over 6 years ago

My understanding of the current logic:
  • If it's been at least N seconds (where currently N=1) then flush the buffer
  • If there are at least M bytes (where M is from the configuration file) then flush the buffer
  • There's also some line buffering going on, so I think it does generally force events to break evenly across lines.

My suggestion is simply that if you only read M bytes at a time, then you'll produce records that are at most (2*M-1) bytes instead of the current 1 MiB read size which means you could get a 1 MiB record, which seems really unwieldy and unfriendly to downstream log consumers. However, I'm inclined to let this go, and see how it performs in testing.

#8 Updated by Tim Pierce over 6 years ago

OK. I agree that sounds like a good idea, but would just as soon confirm whether it really is a problem first. (It's Been Working Okay So Far [tm])

Changes at 46ab868 per our offline discussion to add a new test fixture repository to test/test.git.tar rather than mucking with the repos that are already there. This still requires a small change to commits_controller_test.rb to account for the fact that there's a whole new repository in there.

#9 Updated by Peter Amstutz over 6 years ago

Not a fatal error but still disconcerting:

[ 40/196] Arvados::V1::CommitsControllerTest#test_test_find_commit_rangefatal: bad object 31ce37fe365b3dc204300a3e4c396ad333ed0556
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: ambiguous argument 'tag1': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57

#10 Updated by Tim Pierce over 6 years ago

Peter Amstutz wrote:

Not a fatal error but still disconcerting:

[...]

Yes, these are expected -- the tests are attempting to look up the specific revision 077ba2ad3ea24a929091a9e6ce545c93199b8e57 on every repository readable to users(:active), and now there are additional test fixture repositories readable to :active that do not include that commit. It's definitely disconcerting. Added documentation in 0143d26 to explain this output.

#11 Updated by Tim Pierce over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:95811b3a901df3ceab701fe18e9ec1bb614bef90.

Also available in: Atom PDF