Bug #22411
closedkeepstore index API should be exempt from default 5m request timeout
Description
It is normal for keepstore index to take longer than 5 minutes.
If the keepstore index handler times out after 5 minutes, keep-balance will log something like this:
run failed: [...]: retrieve index: Index response had no EOF marker
Updated by Tom Clegg about 1 month ago
22411-keepstore-index-timeout @ 103db3abb62683329c93378e88347a476286a920 -- developer-run-tests: #4592
retry wb2 run-tests-services-workbench2-integration: #88
Well, this involved more refactoring than I expected.
We had an "exempt from timeout" facility for SSH tunnels, but it relied on intercepting the (ResponseWriter)Hijack() call, so- it can't be used by a handler (like keepstore index) that returns an HTTP response instead of hijacking the connection, and
- it is no longer a reliable way to detect connection hijacking, now that the
net/http
package offers theResponseController
mechanism.
I figured the best way forward would be to modernize our Hijack usage and make deadline exemptions explicit.
Modernize:- remove the code that detects hijacking
- remove the code that preserves the Hijack interface when wrapping ResponseWriter (the reason stdlib added ResponseController is to make this code unnecessary)
- use the ResponseController mechanism when we hijack connections
- update dependencies (our old version of the prometheus metrics collector did not support ResponseController)
- any handler that is exempt from RequestTimeout calls
httpserver.ExemptFromDeadline(req)
- works exactly the same way for connection-hijacking (websocket, tunnel) and regular (keepstore index) handlers
Updated by Peter Amstutz about 1 month ago
- Target version changed from Development 2025-01-08 to Development 2025-01-29
Updated by Tom Clegg about 1 month ago
22411-keepstore-index-timeout @ 103db3abb62683329c93378e88347a476286a920 (tests passed with wb2 retry, see #note-2)
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- ✅
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- ✅ added test for keepstore index timeout exemption
- New or changed UX/UX and has gotten feedback from stakeholders.
- N/A
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- N/A
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
Updated by Brett Smith about 1 month ago
I mean, look, I have thoughts about how the tests could be improved, but I always make these comments and they're obviously never gonna stick, so I give up. We don't have any criteria about whether the tests are maintainable or understandable, our only criteria is that tests exist, and yes, you wrote tests, so great, LGTM. I will make my peace with the fact that it is normal and acceptable for every test to start with 50 lines of boilerplate to open new TCP ports, inline 20 lines of half-baked cluster configuration YAML, start multiple services, and then wait for them to actually come up, because we still have no standard way to do any of this even though we do it every single time, just to make an HTTP request.
Is there anything in TestTimeout
to stop the keepstore process after the test is done with it? I see the code in RunCommand
that checks handler.Done()
but the test doesn't seem to do anything with the handler so I'm not confident either way this will ever happen. I realize there's a lot of places in the stack it could happen that I'm not familiar with though.
Updated by Tom Clegg about 1 month ago
Brett Smith wrote in #note-8:
I mean, look, I have thoughts about how the tests could be improved, but I always make these comments and they're obviously never gonna stick, so I give up. We don't have any criteria about whether the tests are maintainable or understandable, our only criteria is that tests exist, and yes, you wrote tests, so great, LGTM. I will make my peace with the fact that it is normal and acceptable for every test to start with 50 lines of boilerplate to open new TCP ports, inline 20 lines of half-baked cluster configuration YAML, start multiple services, and then wait for them to actually come up, because we still have no standard way to do any of this even though we do it every single time, just to make an HTTP request.
FBOW the other keepstore tests don't bring in the lib/service
wrapper, they just test the keepstore handler, but for this I wanted an integration test. Indeed, this is not the best place for all that integration-test boilerplate. It should probably be provided by lib/service
.
Is there anything in
TestTimeout
to stop the keepstore process after the test is done with it? I see the code inRunCommand
that checkshandler.Done()
but the test doesn't seem to do anything with the handler so I'm not confident either way this will ever happen. I realize there's a lot of places in the stack it could happen that I'm not familiar with though.
No. This is a shortcoming/defect in the lib/cmd.Handler
interface -- it should take a context argument so it can be started and then interrupted by the caller. This is one of those things that never fits in scope of the current story. I guess it would help to add a ticket, so now we have #22457.
Updated by Tom Clegg about 1 month ago
It wouldn't be a big deal to add something to services/keepstore.router
so a test case can terminate the service via lib/service.Handler.Done()
. Perhaps that (with a comment that it should be removed/replaced by #22457) would be worthwhile?
Updated by Brett Smith about 1 month ago
Tom Clegg wrote in #note-9:
FBOW the other keepstore tests don't bring in the
lib/service
wrapper, they just test the keepstore handler, but for this I wanted an integration test. Indeed, this is not the best place for all that integration-test boilerplate. It should probably be provided bylib/service
.
So, like, look, you're catching me at a bad time because #22406 follows all these same patterns and I've been spending a lot of time untangling and dealing with all the problems that causes. (#22406 is extra bad because on top of all this it's trying to do inter-container networking which is its whole own kettle of fish.) And while I've also been working on "put the boilerplate somewhere maintainable" I don't think that's really a fantastic end state either.
In Rails we can just modify Rails.configuration
and then test the thing. This is great because it's very easy for the reader to follow (the specific configuration changes are likely the ones being tested) and it's basically the lowest possible overhead. IMO something like that should be the ideal. It drives me batty that in our Go services, changing the configuration is the primary way administrators parametrize the install, and we have not implemented any way to parametrize this in testing.
The barest minimum thing I would call "good" is if every Go service under test had a well-defined way to (a) splice in some provided configuration (being complete must not be required) and (b) revert back to the configuration on disk, both without restarting. e.g., maybe these could be endpoints that are only exposed under test a la database reset. That way you can test different configuration using just the cluster brought up by run_test_server.py
. And then arvadostest provides functions with the simplest possible interface to do both these things, so you have no excuse not to use them in exactly the way you want, whether that's in test setup or suite setup or whatever. And then we review and make sure every test that should use these interfaces is actually using these interfaces, so future developers copy the good pattern instead of a bad one.
And yes, like you said, obviously that's out of scope for that ticket, but we have definitely wasted more time writing and maintaining the existing tests that try to cobble together a second test cluster from scratch than it would've taken to write this stuff to begin with.
In order to write good tests, you have to write testable code. Our core Go services don't get good marks on this front. And that's extremely frustrating given they're critical pieces of the stack.
It wouldn't be a big deal to add something to services/keepstore.router so a test case can terminate the service via lib/service.Handler.Done(). Perhaps that (with a comment that it should be removed/replaced by #22457) would be worthwhile?
I think so. I currently have a wrapper that runs run-tests.sh
under systemd-run
because the script leaves a lingering tail
process that screws up my terminal if nothing else cleans it up, so I forced it all into a cgroup to make sure it behaved. Leaving a lingering daemon listening on a network port is way, way worse. We shouldn't do that.
Updated by Tom Clegg 28 days ago
Brett Smith wrote in #note-11:
I think so. I currently have a wrapper that runs
run-tests.sh
undersystemd-run
because the script leaves a lingeringtail
process that screws up my terminal if nothing else cleans it up, so I forced it all into a cgroup to make sure it behaved. Leaving a lingering daemon listening on a network port is way, way worse. We shouldn't do that.
Well, yes, but we already aren't doing that here. We're just listening on the port for the duration of the keepstore test process, instead of hanging up at the end of the one test case that uses it. There's no stray background process or anything.
Updated by Brett Smith 28 days ago
Tom Clegg wrote in #note-12:
Well, yes, but we already aren't doing that here. We're just listening on the port for the duration of the keepstore test process, instead of hanging up at the end of the one test case that uses it. There's no stray background process or anything.
Oh, I think I see now. In my head I was modeling it like a subprocess but it's not, it's a goroutine so it lives as long as the go test
process, right? Yeah that's fine to merge.
Updated by Tom Clegg 28 days ago
Brett Smith wrote in #note-13:
Oh, I think I see now. In my head I was modeling it like a subprocess but it's not, it's a goroutine so it lives as long as the
go test
process, right? Yeah that's fine to merge.
The name "RunCommand" does evoke "start a process", doesn't it, but ... yes.
Updated by Tom Clegg 28 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|c261c58e6e983a92f9bee5b759ddd1a89975f192.