Bug #3351

[SDK] arv-put hangs during remote upload

Added by Tom Clegg about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/28/2014
Due date:
% Done:

100%

Estimated time:
(Total: 3.00 h)
Story points:
-

Description

Sequence seems to be
  1. Upload hangs (progress stuck at 832M / 938M 88.7%)
  2. Nginx proxy (in front of Keep proxy) reports 408 Request Timeout
  3. User types ^C
  4. arv-put reports a BadStatusLine warning, then a stack trace
  5. Nginx proxy reports a few 200 requests for earlier PUT requests

Subtasks

Task #3374: Test patched Python packageResolvedAbram Connelly

Task #3375: Review 3351-keep-timeoutResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.Resolved08/22/2014

Related to Arvados - Bug #3346: ActiveRecord::SaveFailed exception does not include useful error message if AuthorizedKey#permission_to_set_authorized_user_uuid fails during @object.save!New

Associated revisions

Revision 7a54fd6a
Added by Tom Clegg about 5 years ago

Merge branch '3351-keep-timeout' closes #3351

History

#1 Updated by Tom Clegg about 5 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg about 5 years ago

  • Assigned To set to Peter Amstutz

#5 Updated by Tom Clegg about 5 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
tom       8732  4280  8732  0    1 13:59 pts/13   00:00:02 strace -f -C -o arv-put.1343.strace.txt arv-put hg38.fa.gz
tom       8737  8732  8737  0    2 13:59 pts/13   00:00:02 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz
tom       8737  8732  8877  0    2 14:01 pts/13   00:00:00 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz

https://workbench.qr1hi.arvadosapi.com/collections/ef96312671f76e17b558bd445b404bc2+77

#6 Updated by Tom Clegg about 5 years ago

  • Subject changed from [SDK] Retry Keep requests once after httplib2.BadStatusLine exception to [SDK] arv-put hangs during remote upload
  • Description updated (diff)

#7 Updated by Tom Clegg about 5 years ago

8877  write(3, "\27\3\3@\30\24\\\16\366p\313\330R\215\3001\342\7\267\255\f\325t\t\2]v\310'\307\33\30"..., 16413 <unfinished ...>
8737  <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
8737  --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
8737  rt_sigreturn()                    = -1 EINTR (Interrupted system call)
8737  futex(0xd55a70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8877  <... write resumed> )             = 9595
8877  write(3, "\210\226\3.\214 \33{\16N\334\251\254\351\303\3577\"\251\373-\267\323\"\316\242^E$#h\224"..., 6818) = -1 ETIMEDOUT (Connection timed out)

#8 Updated by Tom Clegg about 5 years ago

Setting httplib2's socket timeout to 60 seconds does cause occasional crash instead of occasional hang.

#9 Updated by Tom Clegg about 5 years ago

Added a 60-second timeout and automatic retry (once) after timeout.

To try the patch without waiting for code review, download the .tar.gz package from the "files" section on this page, and install it on your workstation like this:

sudo pip install --upgrade arvados-python-client-0.1.1406400797.673a10d.tar.gz

Now, instead of hanging, your arv-put process should occasionally blab timeout warnings and then continue uploading, like this:

576M / 938M 61.4% WARNING:root:Request fail: PUT https://keep.qr1hi.arvadosapi.com:443/ec7741d104001d7e259287029a2d595a => <class 'ssl.SSLError'>: The write operation timed out
WARNING:root:Retrying: PUT https://keep.qr1hi.arvadosapi.com:443/ ec7741d104001d7e259287029a2d595a
938M / 938M 100.0%
576a6e6e0ad9f3e07ff56b8623263fed+655

#10 Updated by Tom Clegg about 5 years ago

  • Status changed from New to In Progress

#12 Updated by Tom Clegg about 5 years ago

  • Story points deleted (0.5)

#13 Updated by Tom Clegg about 5 years ago

Notes
  • Default timeout is 60 seconds and it's not yet user-controllable.
  • The exact meaning of timeout is elusive: Does it set a maximum time for an entire 64MiB block write to turn around? Or do httplib2 and the network stack send the data in smaller chunks and limit the buffering between us and the server? If the latter, 60 seconds is the maximum time to send the last {network buffer size} bytes of data, transfer 64MiB from Keep proxy to Keep block server, write the data to disk, and return a response to the client. This should be plenty.

Also, added d525bce to fix issue exposed in review:

3351: Propagate caller-provided timeout from KeepClient to KeepWriterThread.
Also, use the timeout in the sign_for_old_server block (even though
this branch is unlikely to be used in real life now that the old Keep
server has been replaced).

#14 Updated by Anonymous about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:7a54fd6a073e03f98f68d0b6c816a9f394f625dd.

Also available in: Atom PDF