Feature #8707

Arvados job: download data from remote site into Keep

Added by Tom Clegg almost 4 years ago. Updated 10 months ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Third party integration
Start date:
03/15/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

...to satisfy an API request like #8688

Implementation

One task per requested file -- this avoids retrying everything whenever one file fails

Use writable FUSE (task output dir)

Run wget or curl, probably with some sort of batch-progress flag


Subtasks

Task #8795: download scriptResolvedTom Clegg

Task #8897: Review 8707-downloadResolvedTom Clegg

Task #8708: open firewall rules on su92l to allow download from veritasResolvedWard Vandewege


Related issues

Related to Tapestry - Feature #8688: Accept authenticated API calls from data providers to add datasets to a public profileResolved03/24/2016

Associated revisions

Revision 90c420b8
Added by Tom Clegg almost 4 years ago

Merge branch '8707-download'

refs #8707

History

#1 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg almost 4 years ago

  • Story points set to 1.0

#3 Updated by Tom Clegg almost 4 years ago

  • Category set to Third party integration
  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg almost 4 years ago

8707-download @ db7bd2a8f4981c079ced6c09646ac297790326ae

#5 Updated by Brett Smith almost 4 years ago

Reviewing db7bd2a. This is good to merge, these are all just "idiomatic Python" nits that you can take or leave as you like.

cStringIO provides the same API as StringIO with better performance. You can switch to it with a one-line change by changing your import to import cStringIO as StringIO.

It seems a little odd that you open the URL, then check its scheme. Maybe move that up? You might also consider saving the result of urlparse.urlparse() and reusing it, but that's really small potatoes.

Your download loop can be written a little DRYer as:

   with open(outpath, 'w') as outfile:
        for chunk in iter(lambda: httpresp.read(BUFFER_SIZE), ''):
            outfile.write(chunk)
            got_md5.update(chunk)
        got_size = outfile.tell()

Thanks.

#6 Updated by Tom Clegg almost 4 years ago

All of that sounds better, thanks. I was torn between the two uglies -- while-True-if-cond-break and duplicating the read() -- the iter solution is just what I was wishing for.

Now at aee617c with new test jobs:

#7 Updated by Brett Smith almost 4 years ago

Tom Clegg wrote:

Now at aee617c with new test jobs:

That looks great, thanks.

#8 Updated by Tom Clegg almost 4 years ago

  • Status changed from New to In Progress

Also available in: Atom PDF