Project

General

Profile

Actions

Feature #8707

open

Arvados job: download data from remote site into Keep

Added by Tom Clegg over 8 years ago. Updated about 5 years ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Third party integration
Story points:
1.0

Description

...to satisfy an API request like #8688

Implementation

One task per requested file -- this avoids retrying everything whenever one file fails

Use writable FUSE (task output dir)

Run wget or curl, probably with some sort of batch-progress flag


Subtasks 3 (0 open3 closed)

Task #8795: download scriptResolvedTom Clegg03/15/2016Actions
Task #8897: Review 8707-downloadResolvedTom Clegg03/15/2016Actions
Task #8708: open firewall rules on su92l to allow download from veritasResolvedWard Vandewege03/15/2016Actions

Related issues

Related to Tapestry - Feature #8688: Accept authenticated API calls from data providers to add datasets to a public profileResolvedTom Clegg03/24/2016Actions
Actions #1

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 8 years ago

  • Story points set to 1.0
Actions #3

Updated by Tom Clegg over 8 years ago

  • Category set to Third party integration
  • Assigned To set to Tom Clegg
Actions #4

Updated by Tom Clegg over 8 years ago

8707-download @ db7bd2a8f4981c079ced6c09646ac297790326ae
Actions #5

Updated by Brett Smith over 8 years ago

Reviewing db7bd2a. This is good to merge, these are all just "idiomatic Python" nits that you can take or leave as you like.

cStringIO provides the same API as StringIO with better performance. You can switch to it with a one-line change by changing your import to import cStringIO as StringIO.

It seems a little odd that you open the URL, then check its scheme. Maybe move that up? You might also consider saving the result of urlparse.urlparse() and reusing it, but that's really small potatoes.

Your download loop can be written a little DRYer as:

   with open(outpath, 'w') as outfile:
        for chunk in iter(lambda: httpresp.read(BUFFER_SIZE), ''):
            outfile.write(chunk)
            got_md5.update(chunk)
        got_size = outfile.tell()

Thanks.

Actions #6

Updated by Tom Clegg over 8 years ago

All of that sounds better, thanks. I was torn between the two uglies -- while-True-if-cond-break and duplicating the read() -- the iter solution is just what I was wishing for.

Now at aee617c with new test jobs:
Actions #7

Updated by Brett Smith over 8 years ago

Tom Clegg wrote:

Now at aee617c with new test jobs:

That looks great, thanks.

Actions #8

Updated by Tom Clegg over 8 years ago

  • Status changed from New to In Progress
Actions

Also available in: Atom PDF