Project

General

Profile

Actions

Idea #21078

open

Performance of trashing / deleting large numbers of objects on S3

Added by Peter Amstutz almost 1 year ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Start date:
Due date:
Story points:
-

Description

Update:

We now have a process documented and some associated changes were made:

https://doc.arvados.org/v2.7/admin/keep-faster-gc-s3.html

Metrics suggest this is now deleting around 700 objects per second, which is a significant improvement.

At this rate, we should be able to delete about 1 million objects per day.

Unfortunately, we have 31 million objects to delete, so this is still going to take an entire month.

If we used the DeleteObjects API, we could submit 1000 delete requests at a time. This would presumably improve the throughput of object delete by somewhere between 10 to 1000 times, which would enable us to delete a petabyte of data in a few days instead of a few weeks.

Old description:

I need to delete 100s of TB of data from S3.

It seems we can submit delete requests at a pretty high rate, but "trash" operations are a bottleneck.

I currently have trash operations at 40 concurrent operations, and it is reports running about 60 keep operations per second.

In 3 hours it is able to put somewhere between 20,000 and 90,000 blocks in the trash. Trashing each block is a multi-step operation; presumably making a copy of each block to the "trash/" prefix is the main bottleneck.

At the current rate, it is deleting somewhere between 1 TiB to 5 TiB data on each 3 hour EmptyTrash cycle.

I think the concurrency rates are actually a little bit too high, the log is showing 503 errors, and since I dialed up the concurrency, it hasn't been able to return a full object index to keep-web, presumably because the list objects requests are also failing with 503 errors.

I'm looking at the code and maybe this is a situation where setting BlobTrashLifetime to 0 and UnsafeDelete is the best option, but aside from forfeiting the ability to recover from a keep-balance mistake:

          # Enable deletion (garbage collection) even when the
          # configured BlobTrashLifetime is zero.  WARNING: eventual
          # consistency may result in race conditions that can cause
          # data loss.  Do not enable this unless you understand and
          # accept the risk.

I don't know how anyone could understand the risk since it isn't documented anywhere.

Actions #1

Updated by Peter Amstutz almost 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz almost 1 year ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 9 months ago

  • Target version changed from Future to Development 2024-01-31 sprint
Actions #6

Updated by Peter Amstutz 9 months ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-01-31 sprint to Future
Actions

Also available in: Atom PDF