Project

General

Profile

Actions

Bug #3351

closed

[SDK] arv-put hangs during remote upload

Added by Tom Clegg over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Sequence seems to be
  1. Upload hangs (progress stuck at 832M / 938M 88.7%)
  2. Nginx proxy (in front of Keep proxy) reports 408 Request Timeout
  3. User types ^C
  4. arv-put reports a BadStatusLine warning, then a stack trace
  5. Nginx proxy reports a few 200 requests for earlier PUT requests

Files


Subtasks 2 (0 open2 closed)

Task #3374: Test patched Python packageResolvedAbram Connelly07/28/2014Actions
Task #3375: Review 3351-keep-timeoutResolvedRadhika Chippada07/29/2014Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.ResolvedBrett Smith08/22/2014Actions
Related to Arvados - Bug #3346: ActiveRecord::SaveFailed exception does not include useful error message if AuthorizedKey#permission_to_set_authorized_user_uuid fails during @object.save!ClosedActions
Actions #1

Updated by Tom Clegg over 10 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 10 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Tom Clegg over 10 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
tom       8732  4280  8732  0    1 13:59 pts/13   00:00:02 strace -f -C -o arv-put.1343.strace.txt arv-put hg38.fa.gz
tom       8737  8732  8737  0    2 13:59 pts/13   00:00:02 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz
tom       8737  8732  8877  0    2 14:01 pts/13   00:00:00 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz

https://workbench.qr1hi.arvadosapi.com/collections/ef96312671f76e17b558bd445b404bc2+77

Actions #6

Updated by Tom Clegg over 10 years ago

  • Subject changed from [SDK] Retry Keep requests once after httplib2.BadStatusLine exception to [SDK] arv-put hangs during remote upload
  • Description updated (diff)
Actions #7

Updated by Tom Clegg over 10 years ago

8877  write(3, "\27\3\3@\30\24\\\16\366p\313\330R\215\3001\342\7\267\255\f\325t\t\2]v\310'\307\33\30"..., 16413 <unfinished ...>
8737  <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
8737  --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
8737  rt_sigreturn()                    = -1 EINTR (Interrupted system call)
8737  futex(0xd55a70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8877  <... write resumed> )             = 9595
8877  write(3, "\210\226\3.\214 \33{\16N\334\251\254\351\303\3577\"\251\373-\267\323\"\316\242^E$#h\224"..., 6818) = -1 ETIMEDOUT (Connection timed out)
Actions #8

Updated by Tom Clegg over 10 years ago

Setting httplib2's socket timeout to 60 seconds does cause occasional crash instead of occasional hang.

Actions #9

Updated by Tom Clegg over 10 years ago

Added a 60-second timeout and automatic retry (once) after timeout.

To try the patch without waiting for code review, download the .tar.gz package from the "files" section on this page, and install it on your workstation like this:

sudo pip install --upgrade arvados-python-client-0.1.1406400797.673a10d.tar.gz

Now, instead of hanging, your arv-put process should occasionally blab timeout warnings and then continue uploading, like this:

576M / 938M 61.4% WARNING:root:Request fail: PUT https://keep.qr1hi.arvadosapi.com:443/ec7741d104001d7e259287029a2d595a => <class 'ssl.SSLError'>: The write operation timed out
WARNING:root:Retrying: PUT https://keep.qr1hi.arvadosapi.com:443/ ec7741d104001d7e259287029a2d595a
938M / 938M 100.0%
576a6e6e0ad9f3e07ff56b8623263fed+655
Actions #10

Updated by Tom Clegg over 10 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Tom Clegg over 10 years ago

  • Story points deleted (0.5)
Actions #13

Updated by Tom Clegg over 10 years ago

Notes
  • Default timeout is 60 seconds and it's not yet user-controllable.
  • The exact meaning of timeout is elusive: Does it set a maximum time for an entire 64MiB block write to turn around? Or do httplib2 and the network stack send the data in smaller chunks and limit the buffering between us and the server? If the latter, 60 seconds is the maximum time to send the last {network buffer size} bytes of data, transfer 64MiB from Keep proxy to Keep block server, write the data to disk, and return a response to the client. This should be plenty.

Also, added d525bce to fix issue exposed in review:

3351: Propagate caller-provided timeout from KeepClient to KeepWriterThread.
Also, use the timeout in the sign_for_old_server block (even though
this branch is unlikely to be used in real life now that the old Keep
server has been replaced).
Actions #14

Updated by Anonymous over 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:7a54fd6a073e03f98f68d0b6c816a9f394f625dd.

Actions

Also available in: Atom PDF