Story #21078

Updated by Peter Amstutz 4 months ago

I need to delete 100s of TB of data from S3. 

 It seems we can submit delete requests at a pretty high rate, but "trash" operations are a bottleneck. 

 I currently have trash operations at 40 concurrent operations, and it is reports running about 60 keep operations per second. 

 In 3 hours it is able to put somewhere between 20,000 and 90,000 blocks in the trash. 

 At the current rate, it is deleting somewhere between 1 TiB to 5 TiB data on each 3 hour EmptyTrash cycle. 

 I think the concurrency rates are actually a little bit too high, the log is showing 503 errors, and since I dialed up the concurrency, it hasn't been able to return a full object index to keep-web, presumably because the list objects requests are also failing with 503 errors. 

 I'm looking at the code and maybe this is a situation where setting BlobTrashLifetime to 0 and UnsafeDelete is the best option, but aside from forfeiting the ability to recover from a keep-balance mistake, we don't document what the risk is. 

           # Enable deletion (garbage collection) even when the 
           # configured BlobTrashLifetime is zero.    WARNING: eventual 
           # consistency may result in race conditions that can cause 
           # data loss.    Do not enable this unless you understand and 
           # accept the risk. 

 I don't know how anyone could understand the risk since it isn't documented anywhere.