Feature #8707: Arvados job: download data from remote site into Keep - Tapestry - Arvados

Custom queries

All assigned issues
All issues assigned for grooming
My issues for grooming
Prioritized open issues

Actions

Copy link

Feature #8707

open

Arvados job: download data from remote site into Keep

Added by Tom Clegg almost 9 years ago. Updated over 5 years ago.

Status:

In Progress

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Third party integration

Target version:

Interpretation automation

Story points:

1.0

Description

...to satisfy an API request like #8688

Implementation¶

One task per requested file -- this avoids retrying everything whenever one file fails

Use writable FUSE (task output dir)

Run wget or curl, probably with some sort of batch-progress flag

Subtasks 3 (0 open — 3 closed)

Task #8795: download script	Resolved	Tom Clegg	03/15/2016	Actions
Task #8897: Review 8707-download	Resolved	Tom Clegg	03/15/2016	Actions
Task #8708: open firewall rules on su92l to allow download from veritas	Resolved	Ward Vandewege	03/15/2016	Actions

Related issues 1 (0 open — 1 closed)

Related to Tapestry - Feature #8688: Accept authenticated API calls from data providers to add datasets to a public profile

Resolved

Tom Clegg

03/24/2016

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Tom Clegg almost 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg almost 9 years ago

Story points set to 1.0

Actions

Copy link

Updated by Tom Clegg almost 9 years ago

Category set to Third party integration
Assigned To set to Tom Clegg

Actions

Copy link

Updated by Tom Clegg over 8 years ago

8707-download @ db7bd2a8f4981c079ced6c09646ac297790326ae

failure due to successful download with right size but wrong md5sum: https://crvr.se/su92l-8i9sb-ful8qhzowkshfoq
success: https://crvr.se/su92l-8i9sb-aizw0cupzxafowf

Actions

Copy link

Updated by Brett Smith over 8 years ago

Reviewing db7bd2a. This is good to merge, these are all just "idiomatic Python" nits that you can take or leave as you like.

cStringIO provides the same API as StringIO with better performance. You can switch to it with a one-line change by changing your import to import cStringIO as StringIO.

It seems a little odd that you open the URL, then check its scheme. Maybe move that up? You might also consider saving the result of urlparse.urlparse() and reusing it, but that's really small potatoes.

Your download loop can be written a little DRYer as:

   with open(outpath, 'w') as outfile:
        for chunk in iter(lambda: httpresp.read(BUFFER_SIZE), ''):
            outfile.write(chunk)
            got_md5.update(chunk)
        got_size = outfile.tell()

Thanks.

Actions

Copy link

Updated by Tom Clegg over 8 years ago

All of that sounds better, thanks. I was torn between the two uglies -- while-True-if-cond-break and duplicating the read() -- the iter solution is just what I was wishing for.

Now at aee617c with new test jobs: