2014-05-07 Storing and Organizing Data

05/07/2014 (Sprint start date 04/17/2014)


114 issues   (114 closed — 0 open)

Storing and Organizing Data

Rough demo outline

  1. Automatic ingest from a POSIX directory to Keep
    • Ingestor's access to staging area (could be remote NFS or sshfs mount) is arranged ahead of time
    • 3rd-party's access to staging area is arranged ahead of time
    • Ingestor runs in a screen session. Command line parameters provide project (group/folder) ID and a tag that indicates "this is for me to ingest".
    • Someone ("3rd-party") uploads some files to the staging area via SFTP or whatever
    • 3rd-party does an API call to "ingest-notify app". This might be a short bash script culminating in a curl command. In the API call, the 3rd-party provides a label (e.g., a sample ID) and a list of files, checksums, and an arbitrary "properties" hash containing whatever the 3rd-party wants.
    • Ingest-notify app generates a "data in staging area is ready to ingest" event via API server.
    • Ingestor waits of a "data in staging area is ready to ingest" notification via API server.
    • Ingestor reads the data from the staging area and writes it into Keep (creates one collection per API call made by 3rd-party).
    • Ingestor (or arv-put on behalf of ingestor?) makes API calls while working, to indicate progress (bytes done/todo). arvados.v1.logs.create(object_uuid=uuid_of_upload_object)
    • In Workbench the imported Datasets appear as Collections in the designated project
    • After data has been copied into Keep, ingestor deletes the files from the staging area (if --delete-after flag given).
  2. My data gets into the right project as specified by the uploader (API call)
    • How is the staging-area ↔ project mapping specified, and how/where is it encoded/stored?
  3. Subscribe to notifications (by email and/or Workbench dashboard): when files start/finish uploading; when files are shared with customer; when files are downloaded by third party
    • For now, use existing Logs table + automatic logging of create/update/delete operations + "progress" event from arv-put (see above)
    • "Show project" page shows recent activity: one progress bar for each unfinished upload, one entry for each start/finish event.
    • Dashboard page shows recent activity from all of my projects.
  4. Move/copy collections between projects (Project RX1234, or Customer X’s files), tag them in destination project with the appropriate string (e.g., sample ID) -- defaulting to existing tag used in source project (e.g., provided at time of upload).
    • UI for presenting Groups as Projects/Folders: create, view, rename, share, delete
    • UI for copying/moving objects between folders
    • How to avoid confusion about "is this one object in two places, or are there two objects?" Note GDocs has a bit of both, "My Drive" / "Shared with me" vs. regular folders
  5. Share project with other users/groups
  6. “Anyone with this secret link can view/download” mode. Enable, disable, change magic link. Use cases: browser + “wget -r”.
    • Perhaps the secret in the secret link is an ApiClientAuthorization token, belonging to the person creating the link, scoped to a single project/collection
    • How do we implement "Anonymous user, not logged in"?
  7. See log/overview of who has accessed your shared data (incl. “anonymous user” if using secret-link-to-share); when shared/unshared; when each upload started/finished -- for a single project, and across all projects
  8. Pilot alternate Workbench group/dashboard view

Retrospective notes

  • Went well - still some merge-race at the end
  • Lots of branches going in
  • Not a lot of merge conflicts
  • Big spec change (rejecting "ingestor" story in favor of future "remote arv-put")
  • Some in-sprint deployment dependency stuff (crunch+docker, websockets)
  • Please tag commits with story numbers. Use "refs #1234" for merges. (Use "refs #1234" for individual commits too?)
  • Consider extracting a task into a story if it grows into its own thing (e.g., token handling as part of collection sharing)
  • In sprint review, include 2 more agenda items: summary of things not done + high-level overview of next sprint
  • How to improve support/feedback loop for beta users?
Time tracking
Estimated time 265.00 hours
Issues by








Related issues
# Subject Story Points
Feature #2733 Generate test coverage report in CI pipeline 1.0
Idea #2638 Workbench displays garbage collection histogram 1.0
Idea #2631 datamanager logs block age vs disk space histogram 1.0
Feature #2328 Verify and generate permission hints in Keep 4.0
Feature #2505 Recurring: Update doc site to ensure it is internally consistent and accurately reflects the current behavior of the software. 1.0
Idea #1904 User can get a no-auth-required link to an Arvados object, i.e., turn on "anyone with the link can view" permission 3.0
Idea #2640 API server has functionality and test fixtures for folders 1.0
Idea #2492 Run Job tasks in a Docker container 1.0
Idea #2035 arv-mount exposes filesystem paths like /tag/tagname/collection-uuid and /folder/foldername/collection-uuid 3.0
Idea #2608 Implement a websocket read-only event bus backed by the logs table
Idea #2525 Generate a Java SDK using Google API tools 3.0
Bug #2744 Developer doc bug-fixes 1.0
Feature #2587 add docker to compute node image 1.0
Idea #1970 Create groups and use them like project folders to manage objects in Workbench. 5.0
Idea #2043 arv-mount is set up automatically when user logs in to VM 1.0
Feature #2620 Keep supports I/O locking 2.0
Idea #1969 Control transient/persistent switch in Workbench 2.0
Bug #2223 Repository owner_uuid and Arvados admins should get RW permission in gitolite 1.0
Idea #2622 Datamanager outputs garbage collection list 1.0
Bug #2409 Remove unused top level controllers and routes in apiserver. 1.0
Idea #2612 Workbench displays user usage in logs 1.0
Idea #1971 When a job output contains a single image file, show a thumbnail image inline on Workbench pipeline_instance and job pages. 0.5