Idea #5554
closed[API] Delete crunch job log messages from logs table after the jobs finish
Description
After a job completes, the full job log is written to Keep, so it isn't necessary to leave the (possibly incomplete) copy in the Postgres database as well.
In order to do this on the fly without adversely affecting users, we would need to ensure the Workbench log viewer retrieves the log messages from Keep when possible -- including the case of the Log tab of pipeline_instances#show when some of the pipeline's jobs are still running and some have had their logs written to Keep and deleted from the API server's database.
Alternatively, we could get nearly equivalent behavior by scheduling a task to delete the "stderr" log entries for jobs that finished >30 days ago. This would address the long term proliferation problem while permitting more admin troubleshooting opportunities, avoiding race conditions like "last log deleted before it got sent to workbench via websockets", and avoiding logs falling through the cracks due after "job finished but cleanup statement failed", "crunch-job crashed", etc.- The 30 day threshold should be configurable.
- Ideally the cleanup script should be an API client, but if there's no "delete records matching filters" API, we might have to make do with a rake task for now.
- crunch-dispatch already takes care of marking jobs as failed if they have been stuck in the queue too long, so this cleanup job doesn't have to worry about those -- just check the finished_at timestamp.
Updated by Radhika Chippada almost 10 years ago
- Category set to API
While addressing #5534 (Log tab takes too long to load ...), I noticed that the pipeline instance tb05z-d1hrv-lx5w1pngtl2poj9 had more than 100,000 log lines.
- Should we consider keeping the newest 10,000 or so log lines per object in the log table?
- Do we want to consider archiving logs table contents that are older than 2 weeks or so?
- When we do this, we would need to revisit the implementation of log viewer to fetch log files from keep for "old" objects.
Updated by Radhika Chippada almost 10 years ago
- Subject changed from [API] Purse logs table contents to [API] Purge logs table contents
Updated by Radhika Chippada almost 10 years ago
- Tracker changed from Bug to Idea
- Target version set to 2015-04-29 sprint
Updated by Tom Clegg almost 10 years ago
- Subject changed from [API] Purge logs table contents to [API] Delete crunch job log messages from logs table after the jobs finish
- Description updated (diff)
Updated by Tom Clegg almost 10 years ago
Radhika Chippada wrote:
While addressing #5534 (Log tab takes too long to load ...), I noticed that the pipeline instance tb05z-d1hrv-lx5w1pngtl2poj9 had more than 100,000 log lines.
Does this refer to crunch job logs ("stderr"), or "pipeline_instance updated" events? If the latter, most are probably superfluous and we should find somewhere to quell logs for update events that don't actually [need to] change any attributes.
Updated by Brett Smith over 9 years ago
5554-delete-job-log-rows-wip is up for review. Suggested test: add finished_at: <%= 14.days.ago.to_s(:db) %>
to the previous_job_run
fixture. Watch its corresponding stderr log (id 7) be deleted, or not, when you run the task.
Updated by Radhika Chippada over 9 years ago
Review comments:
- Can you please make the default setting value 30 in application.default.yml file?
- Would you please add a comment at the top of the rake task describing that logs older than the configured time (found in application.yml) for finished jobs …. will be deleted?
- The comment in the config file is a bit confusing: “If a single job in a pipeline runs longer than this, client applications … ” Do you mean to say the pipeline is still running “other” jobs, but has a job or two that are finished earlier than the configured time whose logs will be removed?
- So what happens if this task is executed without updating the code areas that are depending on the logs from the table (stderr_log_history method in jobs_helper.rb)? I am not sure if we should be enabling this task until these areas are addressed.
- Without this, the “disable log tab when user cannot read logs” will fail again and I think this is a concern.
- And I also think the pipeline instance’ log tab would be empty in the meantime?
- Would it be possible to move the core logic from the task into the lib and write a test for it (similar to simulate_job_log or something?)
- Given the concern about "log tab disabled" feature above, I am thinking we might want to add a test that does the purging of log table data and verifies the log tab is not disabled (when it should not be) after addressing jobs_helper updates.
Updated by Brett Smith over 9 years ago
Radhika Chippada wrote:
- Can you please make the default setting value 30 in application.default.yml file?
- Would you please add a comment at the top of the rake task describing that logs older than the configured time (found in application.yml) for finished jobs …. will be deleted?
Both done.
- The comment in the config file is a bit confusing: “If a single job in a pipeline runs longer than this, client applications … ” Do you mean to say the pipeline is still running “other” jobs, but has a job or two that are finished earlier than the configured time whose logs will be removed?
Yes. I have reworded the comment to try to clarify this, and make it a little more accurate.
- So what happens if this task is executed without updating the code areas that are depending on the logs from the table (stderr_log_history method in jobs_helper.rb)? I am not sure if we should be enabling this task until these areas are addressed.
- Without this, the “disable log tab when user cannot read logs” will fail again and I think this is a concern.
- And I also think the pipeline instance’ log tab would be empty in the meantime?
stderr_log_history
is still likely to find logs for the pipeline instance itself, which only exist in the logs table and are not removed by this branch. Workbench's logic to disable a pipeline instance's log tab should continue working as designed.
It is true that the tab will now display less information once a pipeline instance's job logs are flushed from the logs table. This is an expected consequence of the branch. I talked with Tom a bit about the possibility of fetching logs from Keep in this case, as discussed in the description. However, we would prefer to punt that work until we have a better way to stream large logs through the Workbench UI generally, so it's out of scope for this branch.
Since your comment about testing was tied to this, I've left that alone for now. If I'm mistaken and you still think it's worth writing a test given this simpler spec, let me know.
Now at 0bef99f.
Updated by Radhika Chippada over 9 years ago
Brett said:
stderr_log_history
is still likely to find logs for the pipeline instance itself, which only exist in the logs table and are not removed by this branch. Workbench's logic to disable a pipeline instance's log tab should continue working as designed.
Ok, that helps. So, the logs such as when the pipeline instance was created or started running etc would still be available.
It is true that the tab will now display less information once a pipeline instance's job logs are flushed from the logs table. This is an expected consequence of the branch. I talked with Tom a bit about the possibility of fetching logs from Keep in this case, as discussed in the description. However, we would prefer to punt that work until we have a better way to stream large logs through the Workbench UI generally, so it's out of scope for this branch.
So, in this case, basic information such as when the pipeline was created, finished etc is still available and displayed. That can be sufficient in the mean time. I think we need to document or clarify this to science team so that they understand why the log is very limited in such as case.
Since your comment about testing was tied to this, I've left that alone for now. If I'm mistaken and you still think it's worth writing a test given this simpler spec, let me know.
I think it would still be nice to have a test (as you described in note #10), possibly rearranging code if need be. However, if this is too much work, I understand and that is fine. BTW, I was able to run the task and verify that it is working correctly against my dev db.
Updated by Brett Smith over 9 years ago
My earlier comment was mistaken. While it's true that this branch does not remove pipeline stderr logs, currently crunch-dispatch only writes those when there's an error. So your earlier concerns are still valid.
I have updated the branch to have Workbench link to logs for finished jobs. This is not the nicest presentation, but it's the best option in scope for this branch. Like I mentioned on IRC, we want to improve Workbench's display of large logs generally before we spend more time on this.
If you have comments on the appearance of the log links, please let me know exactly what you'd like so I can implement it directly.
There were some semi-related changes to the anonymous access test because of this, and limitations in Capybara. See the git commit message for details.
I have also added tests for the Rake task. Now at fbf2382d. Thanks.
Updated by Radhika Chippada over 9 years ago
Just a couple minor comments:
- has_readable_logs regexp: can we use CollectionHelper for regexp matching rather than repeating it here?
- Can we add a bold title “Logs for jobs in progress” or something like that just above the “event_log_div” in Logs tab? I am thinking this might help distinguish the two sections in the Log page. If you do not like how it looks, I am fine with what we have.
- Thanks for adding all those additional tests.
LGTM with or without those. Thanks.
Updated by Brett Smith over 9 years ago
Radhika Chippada wrote:
- has_readable_logs regexp: can we use CollectionHelper for regexp matching rather than repeating it here?
I have switched it to using Keep::Locator.parse
. CollectionHelper does a lot of grouping that isn't really applicable for this use. Using Keep::Locator removes the duplication without requiring some of those extra parsing details.
- Can we add a bold title “Logs for jobs in progress” or something like that just above the “event_log_div” in Logs tab? I am thinking this might help distinguish the two sections in the Log page. If you do not like how it looks, I am fine with what we have.
I'm happy to add some kind of divider, but the text "Logs for jobs in progress" does not exactly describe it right. It will include any logs in the logs table, as long as at least one job is still running. It will also always include any pipeline handling error messages, no matter what. If you want to work on different text with me in IRC, I'm happy to figure something out and add it.
Now at d5d43d2. Thanks.
Updated by Brett Smith over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:97403e08475b328115373a2c6a23e82116199aad.