Bug #10932

[arv-put] job resume too slow & uninformative

Added by Tom Morris over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
02/02/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Completed 97% if a 636959M upload (637GB) and then attempted to resume it after arv-put was killed. It took almost 3 hours (175 minutes) of 100% CPU for arv-put to calculate/confirm the restart point (which seemed to only be 45%, not 97%) before beginning to do any additional uploading. During this time there was no user feedback other than "0M / 636959M 0.0%" for hours on end/

Some other, possibly relevant, details:

98 MB cache file in ~/.cache/arvados/ :

97913649 Jan 18 21:54 476acf1a6d4cfb9b630a13289c4a72be

of which ~7 MB is manifest text:

$ jq .manifest 476acf1a6d4cfb9b630a13289c4a72be | wc
1 240325 6861939

and the rest is info for the 478502 filesL

$ jq '.files | length' 476acf1a6d4cfb9b630a13289c4a72be
478502
$ jq -r '.files[] | .size ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478502 478502 3827704
$ jq -r '.files[] | .mtime ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478502 478502 8928344
$ jq -r '.files | keys ' 476acf1a6d4cfb9b630a13289c4a72be | wc
478504 480984 69204103

That's 69 MB of file names which are compressable to 1.4 MB due to high levels of redundancy

$ jq -r '.files | keys ' 476acf1a6d4cfb9b630a13289c4a72be | gzip | wc
10074 39234 1442694


Subtasks

Task #11057: Review 10932-arvput-slow-resumingResolvedLucas Di Pentima

Associated revisions

Revision a8b210d2
Added by Lucas Di Pentima over 2 years ago

Merge branch '10932-arvput-slow-resuming'
Closes #10932

History

#1 Updated by Tom Morris over 2 years ago

I added some logging to the client and it shows that loading from JSON takes just seconds, but the deep object copy takes almost half an hour and the JSON dump takes OVER AN HOUR.

Investigate:
- more compact storage structure like a trie or b-tree
- cPickle for save/restore
- non- JSON storage format since JSON is overkill

2017-01-19 03:24:49 arvados.arv_put571: About to do setup state
2017-01-19 03:24:49 arvados.arv_put571: Loading JSON cache file
2017-01-19 03:24:53 arvados.arv_put571: Done loading cache file
2017-01-19 03:24:53 arvados.arv_put571: Loading local collection from manifest text
2017-01-19 03:25:08 arvados.arv_put571: Done loading local collection
2017-01-19 03:25:08 arvados.arv_put571: done setup state
2017-01-19 03:25:53 arvados.arv_put571: Removing local files which don't exist any more from collection
2017-01-19 03:26:08 arvados.arv_put571: Entering _update
2017-01-19 03:26:19 arvados.arv_put571: getting nonfinal local manifest text
2017-01-19 03:27:06 arvados.arv_put571: Done getting local manifest text
2017-01-19 03:27:06 arvados.arv_put571: acquired lock doing deep copy self._state - 280 bytes
2017-01-19 03:53:13 arvados.arv_put571: done deep copy - 26 minutes
2017-01-19 03:53:17 arvados.arv_put571: About to do json dump
2017-01-19 04:55:52 arvados.arv_put571: Leaving _save_state - 63 minutes
282275M / 636959M 44.3%
2017-01-19 04:55:56 arvados.arv_put571 WARNING: Leaving _update

#2 Updated by Tom Morris over 2 years ago

  • Target version changed from Arvados Future Sprints to 2017-02-15 sprint

#3 Updated by Tom Morris over 2 years ago

  • Target version changed from 2017-02-15 sprint to Arvados Future Sprints

#4 Updated by Tom Morris over 2 years ago

  • Subject changed from arv-put job resume too slow & uninformative to [arv-put] job resume too slow & uninformative

#5 Updated by Lucas Di Pentima over 2 years ago

  • Assigned To changed from Tom Morris to Lucas Di Pentima
  • Target version changed from Arvados Future Sprints to 2017-02-15 sprint
  • Story points set to 0.5

#6 Updated by Lucas Di Pentima over 2 years ago

  • Status changed from New to In Progress

#7 Updated by Lucas Di Pentima over 2 years ago

Updates at: 9c9b520
Tests: https://ci.curoverse.com/job/developer-run-tests/151/

A couple of optimizations have improved times a lot.
On my tests with a ~12MB cache, state saving times went down from ~14 secs to ~0.8 secs by just avoiding the use of copy.deepcopy() and json.dump() and replacing them with json.dumps()
As for the resume start delay, using a set() instead of a list on the code reduced time spent on checking if every file on the local collection existed on the local disk before starting to upload the rest.

#8 Updated by Peter Amstutz over 2 years ago

Is there any reason self._file_paths can't be a set()? It looks like the only thing that would need to be changed is to update self._file_paths.append(filename) to self._file_paths.add(filename).

Caching change looks good, might want to add a comment with what you learned about the performance of deepcopy/dump/dumps.

#9 Updated by Lucas Di Pentima over 2 years ago

Peter Amstutz wrote:

Is there any reason self._file_paths can't be a set()? It looks like the only thing that would need to be changed is to update self._file_paths.append(filename) to self._file_paths.add(filename).

I read that sets weren't performant on iterations as list are, but didn't tested it myself. As you mentioned, the difference between both is some microseconds so it's updated as you requested.

Caching change looks good, might want to add a comment with what you learned about the performance of deepcopy/dump/dumps.

Done.

Test run: https://ci.curoverse.com/job/developer-run-tests/153/
Updates: e41dc06

#10 Updated by Lucas Di Pentima over 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:a8b210d2bb5b10e6583abf295b99788b3dff7479.

Also available in: Atom PDF