Bug #9709

[API] logging filled up the disk -- API server should not log entire manifests

Added by Nico César over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
08/03/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Qr1hi had a 23 GB logging production.log Do a postmortem of that and prevent that in the future.

the file is located in qr1hi:/data1/production.log

and also there is an already compressed 2.3GB:

qr1hi:/var/www/arvados-api/current/log# ls -l production.log-20160802.bz2
-rw-r--r-- 1 www-data www-data 2404165961 Aug  2 06:29 production.log-20160802.bz2

Goal of this ticket:
  • the API server should not log entire manifests.
  • ideally, figure out what caused the excessive logging.

Related ticket: #9696 which is what Jiayong was doing when the log file filled up the disk.

When there is a conclusion on the root cause, please delete /data1/production.log on qr1hi, or ask a sysadmin to do it.


Subtasks

Task #9724: Review branch 9709-dont-log-manifest-textsResolvedRadhika Chippada

Task #9855: Review 9709-restore-manifest-loggingResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #15167: Complete logs table management docs with unlogged_attributes settingResolved05/14/2019

Associated revisions

Revision 41901325
Added by Radhika Chippada over 4 years ago

closes #9709
Merge branch '9709-dont-log-manifest-texts'

Revision a786d20e
Added by Tom Clegg over 4 years ago

Merge branch '9818-logstash'

refs #9818
refs #9709

Revision 7213d309
Added by Tom Clegg over 4 years ago

Merge branch '9709-restore-manifest-logging'

closes #9709

History

#1 Updated by Nico César over 4 years ago

  • Project changed from OPS to Arvados
  • Subject changed from API logging filled up the disk to [API] logging filled up the disk

#2 Updated by Nico César over 4 years ago

  • Target version set to 2016-08-17 sprint

#3 Updated by Radhika Chippada over 4 years ago

  • Assigned To set to Tom Clegg
  • Story points set to 0.5

#4 Updated by Radhika Chippada over 4 years ago

  • Assigned To changed from Tom Clegg to Lucas Di Pentima
  • Story points changed from 0.5 to 1.0

#5 Updated by Ward Vandewege over 4 years ago

  • Subject changed from [API] logging filled up the disk to [API] logging filled up the disk -- API server should not log entire manifests
  • Description updated (diff)

#6 Updated by Ward Vandewege over 4 years ago

  • Description updated (diff)

Update: it turns out that Jiayong did an rm -rf on a large collection right before this happened. That would explain the explosion of log entries, so this is likely the root cause. I suggest to update the goal for this ticket to just "the API server should not log entire manifests".

#7 Updated by Radhika Chippada over 4 years ago

  • Status changed from New to In Progress
  • Assigned To changed from Lucas Di Pentima to Radhika Chippada

#8 Updated by Radhika Chippada over 4 years ago

Commit 39292137 suppresses manifest_text from collection logs. Added test after re-purposing existing test.

#9 Updated by Lucas Di Pentima over 4 years ago

Looks good to me, took me a while understanding that we were logging through a Log object.
If we are logging to the database using the Log model, and the creation of Log objects are logged to production.log file, wouldn't be possible to avoid logging Log's events to said file? I spent some time searching for this and all I could find were custom logging solutions, not customizing the built-in logger.

#10 Updated by Radhika Chippada over 4 years ago

Lucas: since we use the log files for debug logging as well, the log files and the Log table serve different purposes. The Log table is also used by the crunch dispatcher to log events. Various clients such as jobs, workbench also subscribe to live logs using web sockets, which I think are served from the Log table. BTW, we also have a task in api server named "delete_old_job_logs.rake" that deletes the job log entries as specified (because the crunch job logs are expected to be in keep by then).

#11 Updated by Radhika Chippada over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:419013258ddead1ece05327e96017222f4aa7de8.

#12 Updated by Tom Clegg over 4 years ago

  • Category set to API
  • Status changed from Resolved to In Progress
  • Assigned To changed from Radhika Chippada to Tom Clegg

My impression is that the problem report concerns a case of production.log filling up with Rails request logs -- not SQL logs or our own logs about model changes.

For example, an unauthenticated client could send a lot of unroutable requests with huge POST bodies and produce the same problem, even though those requests never even hit our application code.

Losing the copy of the old manifest_text in the logs table addresses a different problem ("postgres database is too big", not "production.log is too big") and represents a loss of functionality. I'd like to revert this, or at least put it behind a config knob, because
  • I expect some sites still want an audit trail even though it makes their Postgres database bigger.
  • #9363.

#13 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-08-17 sprint to 2016-08-31 sprint

#14 Updated by Nico César over 4 years ago

Tom Clegg wrote:

My impression is that the problem report concerns a case of production.log filling up with Rails request logs -- not SQL logs or our own logs about model changes.

For example, an unauthenticated client could send a lot of unroutable requests with huge POST bodies and produce the same problem, even though those requests never even hit our application code.

you're correct, but we went from ~100 Mb to 27 Gb (uncompressed) per day. Even a DDoS will have a hard time creating this kind of log size.

Losing the copy of the old manifest_text in the logs table addresses a different problem ("postgres database is too big", not "production.log is too big") and represents a loss of functionality. I'd like to revert this, or at least put it behind a config knob, because
  • I expect some sites still want an audit trail even though it makes their Postgres database bigger.
  • #9363.

I have no real objection here, but bear in mind that if "the possibility of an audit" is really affecting performance in postgres too we need to review logs storing process as a whole. For the record, pg database is store in a separate /data partition, production.log in a much smaller one, that's why the disk filled up

#15 Updated by Tom Clegg over 4 years ago

Nico Cesar wrote:

you're correct, but we went from ~100 Mb to 27 Gb (uncompressed) per day. Even a DDoS will have a hard time creating this kind of log size.

1 MB POST request per second = 86 GB logs per day ... not hard enough!

#16 Updated by Tom Clegg over 4 years ago

9709-restore-manifest-logging

test 2bd7edf29bad74e61da31b801afd85b4c33ef7fe

#17 Updated by Radhika Chippada over 4 years ago

Just a couple minor comments.

  • I think it would be nice to have both tests here with and without attribute suppression config setting: test "manifest_text not included in collection logs". You might also want to consider testing having multiple values in the config setting.
  • In the config yaml comment: “# Attributes to avoid saving in events and audit logs” => can you use the word “suppress” instead of “saving” here?

LGTM

#18 Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:7213d3096cdb5d5e03b559a04f88fcd22a835076.

#19 Updated by Tom Morris over 1 year ago

  • Related to Bug #15167: Complete logs table management docs with unlogged_attributes setting added

Also available in: Atom PDF