Bug #3351
closed[SDK] arv-put hangs during remote upload
Added by Tom Clegg over 10 years ago. Updated over 10 years ago.
Description
- Upload hangs (progress stuck at 832M / 938M 88.7%)
- Nginx proxy (in front of Keep proxy) reports 408 Request Timeout
- User types ^C
- arv-put reports a BadStatusLine warning, then a stack trace
- Nginx proxy reports a few 200 requests for earlier PUT requests
Files
arvados-python-client-0.1.1406400797.673a10d.tar.gz (45.1 KB) arvados-python-client-0.1.1406400797.673a10d.tar.gz | Tom Clegg, 07/26/2014 03:22 PM |
Updated by Tom Clegg over 10 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
tom 8732 4280 8732 0 1 13:59 pts/13 00:00:02 strace -f -C -o arv-put.1343.strace.txt arv-put hg38.fa.gz tom 8737 8732 8737 0 2 13:59 pts/13 00:00:02 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz tom 8737 8732 8877 0 2 14:01 pts/13 00:00:00 /usr/bin/python /usr/local/bin/arv-put hg38.fa.gz
https://workbench.qr1hi.arvadosapi.com/collections/ef96312671f76e17b558bd445b404bc2+77
Updated by Tom Clegg over 10 years ago
- Subject changed from [SDK] Retry Keep requests once after httplib2.BadStatusLine exception to [SDK] arv-put hangs during remote upload
- Description updated (diff)
Updated by Tom Clegg over 10 years ago
8877 write(3, "\27\3\3@\30\24\\\16\366p\313\330R\215\3001\342\7\267\255\f\325t\t\2]v\310'\307\33\30"..., 16413 <unfinished ...> 8737 <... futex resumed> ) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 8737 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- 8737 rt_sigreturn() = -1 EINTR (Interrupted system call) 8737 futex(0xd55a70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> 8877 <... write resumed> ) = 9595 8877 write(3, "\210\226\3.\214 \33{\16N\334\251\254\351\303\3577\"\251\373-\267\323\"\316\242^E$#h\224"..., 6818) = -1 ETIMEDOUT (Connection timed out)
Updated by Tom Clegg over 10 years ago
Setting httplib2's socket timeout to 60 seconds does cause occasional crash instead of occasional hang.
Updated by Tom Clegg over 10 years ago
Added a 60-second timeout and automatic retry (once) after timeout.
To try the patch without waiting for code review, download the .tar.gz package from the "files" section on this page, and install it on your workstation like this:
sudo pip install --upgrade arvados-python-client-0.1.1406400797.673a10d.tar.gz
Now, instead of hanging, your arv-put process should occasionally blab timeout warnings and then continue uploading, like this:
576M / 938M 61.4% WARNING:root:Request fail: PUT https://keep.qr1hi.arvadosapi.com:443/ec7741d104001d7e259287029a2d595a => <class 'ssl.SSLError'>: The write operation timed out WARNING:root:Retrying: PUT https://keep.qr1hi.arvadosapi.com:443/ ec7741d104001d7e259287029a2d595a 938M / 938M 100.0% 576a6e6e0ad9f3e07ff56b8623263fed+655
Updated by Tom Clegg over 10 years ago
Updated by Tom Clegg over 10 years ago
- Default timeout is 60 seconds and it's not yet user-controllable.
- The exact meaning of timeout is elusive: Does it set a maximum time for an entire 64MiB block write to turn around? Or do httplib2 and the network stack send the data in smaller chunks and limit the buffering between us and the server? If the latter, 60 seconds is the maximum time to send the last {network buffer size} bytes of data, transfer 64MiB from Keep proxy to Keep block server, write the data to disk, and return a response to the client. This should be plenty.
Also, added d525bce to fix issue exposed in review:
3351: Propagate caller-provided timeout from KeepClient to KeepWriterThread.
Also, use the timeout in the sign_for_old_server block (even though
this branch is unlikely to be used in real life now that the old Keep
server has been replaced).
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:7a54fd6a073e03f98f68d0b6c816a9f394f625dd.