Bug #9848

[Crunch2] [API] Ensure saved logs and outputs from containers are accessible to non-admin users

Added by Tom Clegg about 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
08/24/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

crunch-run saves container logs (and outputs) with owner_uuid = uuid of the dispatch user.

Currently, this means only admins are able to read them.

A user who submits a container request should be able to read the log (and output) of the corresponding container.


Subtasks

Task #10226: review 9848-finalize-on-reuseResolvedPeter Amstutz

Task #10043: Review 9848-copy-container-outputResolvedPeter Amstutz

Task #10044: Copy logs and outputs to CR when a container completesResolvedTom Clegg

Task #10163: Finalize committed container request immediately if reusing a finished containerResolvedTom Clegg


Related issues

Related to Arvados - Bug #9799: [Crunch2] Ensure live logs for containers are accessible to non-admin usersResolved08/17/2016

Associated revisions

Revision 6613ec1e
Added by Tom Clegg almost 5 years ago

Merge branch '9848-copy-container-output' refs #9848

Revision 06f40b3d
Added by Tom Clegg almost 5 years ago

Merge branch '9848-finalize-on-reuse' closes #9848

History

#1 Updated by Tom Clegg about 5 years ago

Possible approach:

In ContainerRequest#container_completed! in source:services/api/app/models/container_request.rb, create new log and output collections with the same owner_uuid as the container request itself (copying manifest_text from the container log/output).

In crunch-run, set expires_at=(now + permissionTTL) when saving log and output collections (otherwise, all logs and outputs will use up storage space forever, even after users delete their copies).

#2 Updated by Tom Clegg about 5 years ago

  • Assigned To set to Tom Clegg

#3 Updated by Tom Morris about 5 years ago

  • Story points set to 1.0
  • Target version set to Arvados Future Sprints

#4 Updated by Tom Clegg about 5 years ago

In the container re-use code, we'll need to check whether the log is still available (as we currently do with outputs in crunch1)

#5 Updated by Tom Clegg about 5 years ago

  • Subject changed from [API] Ensure saved logs and outputs from containers are accessible to non-admin users to [Crunch2] [API] Ensure saved logs and outputs from containers are accessible to non-admin users

#6 Updated by Tom Clegg about 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-09-28 sprint

#7 Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg almost 5 years ago

9848-copy-container-output @ feff3b3

#9 Updated by Peter Amstutz almost 5 years ago

  • The log checking code:
      readable_pdh = Collection.
        readable_by(current_user).
        select('portable_data_hash')
      completed = completed.
        where("log in (#{readable_pdh.to_sql})").
        order('finished_at asc').
        limit(1)

This kind of hurts my head. I get this is using a subquery, so we're assuming the postgres query planner will DTRT. However, if it does not optimize the query, it could inefficiently enumerate every collection visible to the user. Can we verify that it is going to be smart and execute the query in a way that ensures that it uses the portable_data_hash index?

  • Right above the new code I noticed this line (from Lucas):
    completed = candidates.where(state: Complete).where(exit_code: 0)

I don't think we want to filter on exit code, since a deterministic container would always return the same exit code? Also a few tools like "grep" return non-zero codes for conditions that are not errors.

  • It appears that when job reuse happens, the container_request state remains in Committed (which means it won't create the output/log collections). I think there needs to be an additional hook for when the container request is in Committed state which checks to see if the referenced container is in Complete/Cancelled state and calls container_completed!

#10 Updated by Tom Clegg almost 5 years ago

Peter Amstutz wrote:

  • The log checking code:

[...]

This kind of hurts my head. I get this is using a subquery, so we're assuming the postgres query planner will DTRT. However, if it does not optimize the query, it could inefficiently enumerate every collection visible to the user. Can we verify that it is going to be smart and execute the query in a way that ensures that it uses the portable_data_hash index?

"Explain analyze" on a small database doesn't necessarily tell us very much about real-world performance (postgres chooses seq scan over an index when it knows the table is small) but here it is:

arvados_production=# explain analyze SELECT "containers".* FROM "containers" WHERE (log in (SELECT portable_data_hash FROM "collections" WHERE (expires_at IS NULL or expires_at > statement_timestamp()))) ORDER BY finished_at asc LIMIT 1;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1136.43..1136.43 rows=1 width=2299) (actual time=0.031..0.031 rows=0 loops=1)
   ->  Sort  (cost=1136.43..1136.43 rows=1 width=2299) (actual time=0.031..0.031 rows=0 loops=1)
         Sort Key: containers.finished_at
         Sort Method: quicksort  Memory: 25kB
         ->  Hash Join  (cost=1072.28..1136.42 rows=1 width=2299) (actual time=0.014..0.014 rows=0 loops=1)
               Hash Cond: ((collections.portable_data_hash)::text = (containers.log)::text)
               ->  HashAggregate  (cost=1071.16..1117.80 rows=4664 width=36) (never executed)
                     Group Key: (collections.portable_data_hash)::text
                     ->  Seq Scan on collections  (cost=0.00..1047.53 rows=9455 width=36) (never executed)
                           Filter: ((expires_at IS NULL) OR (expires_at > statement_timestamp()))
               ->  Hash  (cost=1.05..1.05 rows=5 width=2299) (actual time=0.005..0.005 rows=0 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 0kB
                     ->  Seq Scan on containers  (cost=0.00..1.05 rows=5 width=2299) (actual time=0.003..0.005 rows=5 loops=1)
 Planning time: 1.122 ms
 Execution time: 0.149 ms
(15 rows)

The more obvious solution is to get batches of containers and do the readable_by query manually until one is readable, which seems messier and as likely to be slower as anything. I think by default we should let Postgres do this sort of thing for us, and only poke it in the eye on an as-deserved basis.

  • Right above the new code I noticed this line (from Lucas):

[...]

I don't think we want to filter on exit code, since a deterministic container would always return the same exit code? Also a few tools like "grep" return non-zero codes for conditions that are not errors.

It's common for a container to exit non-zero after a fatal transient error, like OOM or a network problem that prevents it from reading its input. I don't think it would make sense to reuse those outputs, or disqualify all successful containers with identical outputs just because there was also one attempt that produced different output because it failed.

  • It appears that when job reuse happens, the container_request state remains in Committed (which means it won't create the output/log collections). I think there needs to be an additional hook for when the container request is in Committed state which checks to see if the referenced container is in Complete/Cancelled state and calls container_completed!

Yes, presumably committed requests should get finalized immediately if the container is already finished. I guess we didn't notice this in #9623? We should add a test. OK to merge this in the meantime?

#11 Updated by Tom Clegg almost 5 years ago

On the topic of exit codes, surely the only sane approach is to follow the rule: exit 0 means success, exit non-zero means failure. Programs that are broken (e.g., a shell script that says "grep x y" where it should say "grep x y || true") are broken. If we start second-guessing exit codes, we'll just never know whether anything succeeded, even programs that aren't broken...

#12 Updated by Radhika Chippada almost 5 years ago

  • Target version changed from 2016-09-28 sprint to 2016-10-12 sprint

#14 Updated by Tom Clegg almost 5 years ago

  • Target version changed from 2016-10-12 sprint to 2016-10-26 sprint

#15 Updated by Peter Amstutz almost 5 years ago

9848-finalize-on-reuse LGTM.

#16 Updated by Tom Clegg almost 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:06f40b3d477f3ab1c834d9671808c3da02ca7dfd.

Also available in: Atom PDF