Project

General

Profile

Bug #19563

Updated by Peter Amstutz over 1 year ago

We have crunch-run processes that are getting OOM killed (and restarted) in the upload phase. 

 Crunch-run is uploading very large files (30+ GB) and running on very small nodes (t3.small) which have 1 core, 2 GB RAM, and throttled network bandwidth.    The hoststat numbers show a much greater amount of data being received than transmitted. 

 The suspicion is that the crunch-run process is buffering data in RAM, which is piling up until it gets OOM killed. 

 a) determine if it is true that the queue of blocks to be uploaded is uncapped 

 b) if so, make it possible to set some cap which ensures there is backpressure that will block the uploader until there is more buffer space.    Experimentally, I think we've found optimal upload rates with around 4-6 parallel block uploads. 

Back