Project

General

Profile

Bug #21640

Updated by Brett Smith about 1 month ago

User's Arvados controller started failing with this error on all incoming requests: 

 <pre><code>Mar 27 17:31:08 arvados-controller[3502095]: 2024/03/27 17:31:08 http: Accept error: accept tcp 127.0.0.1:8003: accept4: too many open files; retrying in 1s 
 </code></pre> 

 User deployed with our provided @arvados-controller.service@ definition, which sets @LimitNOFILE=65536@. Controller apparently exceeded this limit. This seems to correlate with when Keep balance starts working. 

 User has had the cluster for over a year and only encountered this issue when running a workflow with a particularly large multithreaded job. The error has happened twice with this job in the mix (not necessarily the same container, the workflow is also being debugged). 

 I note this "@LimitNOFILE@ documentation":https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#LimitNOFILE=: 

 > Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory. 

 * Does controller use @select@? Is it possible we're exceeding 1024 open connections, and then hitting that documented limit? 
 * Is it possible the controller is leaking file descriptors? I can go get other logs if someone can give me at least a general sense of what I should be looking for. 
 * If we're not using @select@ and we think everything is behaved, I think it would be a nice change to have controller raise its own soft limit as suggested.

Back