Idea #21078
openPerformance of trashing / deleting large numbers of objects on S3
Description
Update:
We now have a process documented and some associated changes were made:
https://doc.arvados.org/v2.7/admin/keep-faster-gc-s3.html
Metrics suggest this is now deleting around 700 objects per second, which is a significant improvement.
At this rate, we should be able to delete about 1 million objects per day.
Unfortunately, we have 31 million objects to delete, so this is still going to take an entire month.
If we used the DeleteObjects API, we could submit 1000 delete requests at a time. This would presumably improve the throughput of object delete by somewhere between 10 to 1000 times, which would enable us to delete a petabyte of data in a few days instead of a few weeks.
Old description:
I need to delete 100s of TB of data from S3.
It seems we can submit delete requests at a pretty high rate, but "trash" operations are a bottleneck.
I currently have trash operations at 40 concurrent operations, and it is reports running about 60 keep operations per second.
In 3 hours it is able to put somewhere between 20,000 and 90,000 blocks in the trash. Trashing each block is a multi-step operation; presumably making a copy of each block to the "trash/" prefix is the main bottleneck.
At the current rate, it is deleting somewhere between 1 TiB to 5 TiB data on each 3 hour EmptyTrash cycle.
I think the concurrency rates are actually a little bit too high, the log is showing 503 errors, and since I dialed up the concurrency, it hasn't been able to return a full object index to keep-web, presumably because the list objects requests are also failing with 503 errors.
I'm looking at the code and maybe this is a situation where setting BlobTrashLifetime to 0 and UnsafeDelete is the best option, but aside from forfeiting the ability to recover from a keep-balance mistake:
# Enable deletion (garbage collection) even when the # configured BlobTrashLifetime is zero. WARNING: eventual # consistency may result in race conditions that can cause # data loss. Do not enable this unless you understand and # accept the risk.
I don't know how anyone could understand the risk since it isn't documented anywhere.
Updated by Peter Amstutz 9 months ago
- Target version changed from Future to Development 2024-01-31 sprint
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-01-31 sprint to Future