Project

General

Profile

Actions

Bug #22655

closed

Block randomly found when not expected in keepclient TestPutGetHead

Added by Brett Smith 4 days ago. Updated about 22 hours ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Story points:
-

Description

run-tests-remainder: #5036

======= test sdk/go/keepclient
child.pid is 48663
child.pid is 48673

----------------------------------------------------------------------
FAIL: keepclient_test.go:1133: ServerRequiredSuite.TestPutGetHead

keepclient_test.go:1145:
    c.Check(err, Equals, BlockNotFound)
... obtained *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0xc0012ac000), isTemp:false}} ("HEAD a8383a37e70cc4c384694c32c3face5f+14 failed: [http://localhost:37375/a8383a37e70cc4c384694c32c3face5f+14: HTTP 405 \"\" http://localhost:37701/a8383a37e70cc4c384694c32c3face5f+14: HTTP 404 \"\"]")
... expected *keepclient.ErrNotFound = &keepclient.ErrNotFound{multipleResponseError:keepclient.multipleResponseError{error:(*errors.errorString)(0x11457f0), isTemp:false}} ("Block not found")
... Difference:
...     multipleResponseError.error.s: "HEAD a8383a37e70cc4c384694c32c3face5f+14 failed: [http://localhost:37375/a8383a37e70cc4c384694c32c3face5f+14: HTTP 405 \"\" http://localhost:37701/a8383a37e70cc4c384694c32c3face5f+14: HTTP 404 \"\"]" != "Block not found" 

Sent SIGTERM to 48673 (/tmp/workspace/run-tests-remainder/tmp/keep1.pid)
OOPS: 53 passed, 1 FAILED
--- FAIL: Test (5.45s)
FAIL
coverage: 85.3% of statements
FAIL    git.arvados.org/arvados.git/sdk/go/keepclient    5.470s
FAIL
======= sdk/go/keepclient tests -- FAILED
======= test sdk/go/keepclient -- 6s

Files

run-tests-remainder-5036.log (822 KB) run-tests-remainder-5036.log Brett Smith, 03/11/2025 01:38 PM

Subtasks 1 (0 open1 closed)

Task #22663: Review 22655-keepclient-testResolvedTom Clegg03/14/2025Actions
Actions #1

Updated by Tom Clegg 4 days ago

keep1.log

{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000002.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:38533/","time":"2025-03-10T20:23:23.172325876Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000003.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:55825/","time":"2025-03-10T20:23:23.172366356Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000000.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:37375/","time":"2025-03-10T20:23:23.172373786Z"}
{"ClusterID":"zzzzz","PID":46042,"level":"warning","msg":"ARVADOS_API_HOST environment variable is present, but will not be used","time":"2025-03-10T20:23:23.172391216Z"}
{"ClusterID":"zzzzz","PID":46042,"level":"warning","msg":"ARVADOS_API_TOKEN environment variable is present, but will not be used","time":"2025-03-10T20:23:23.172399486Z"}
{"ClusterID":"zzzzz","PID":46042,"level":"warning","msg":"Possible configuration mistake: not useful to set API.MaxKeepBlobBuffers (128) higher than API.MaxConcurrentRequests (64)","time":"2025-03-10T20:23:23.172546046Z"}
{"ClusterID":"zzzzz","PID":46042,"level":"info","msg":"started volume zzzzz-nyw5e-000000000000001 (d1bc9782-2153-4316-b1a2-4e373a432577/tmp/workspace/run-tests-remainder/tmp/keep1.data), AllowWrite=true, AllowTrash=true","time":"2025-03-10T20:23:23.178694225Z"}
{"ClusterID":"zzzzz","Listen":"127.0.0.1:37701","PID":46042,"Service":"keepstore","URL":"http://127.0.0.1:37701/","Version":"1bca6728bf4659e4b13146a6f041a0c188a0a8bc-dev (go1.23.6)","level":"info","msg":"listening","time":"2025-03-10T20:23:23.179219675Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000003.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:55825/","time":"2025-03-10T20:24:03.982246237Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000000.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:37375/","time":"2025-03-10T20:24:03.982290317Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000002.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:38533/","time":"2025-03-10T20:24:03.982298167Z"}
{"ClusterID":"zzzzz","PID":48673,"level":"warning","msg":"ARVADOS_API_HOST environment variable is present, but will not be used","time":"2025-03-10T20:24:03.982331707Z"}
{"ClusterID":"zzzzz","PID":48673,"level":"warning","msg":"ARVADOS_API_TOKEN environment variable is present, but will not be used","time":"2025-03-10T20:24:03.982340077Z"}
{"ClusterID":"zzzzz","PID":48673,"level":"warning","msg":"Possible configuration mistake: not useful to set API.MaxKeepBlobBuffers (128) higher than API.MaxConcurrentRequests (64)","time":"2025-03-10T20:24:03.982493597Z"}
{"ClusterID":"zzzzz","PID":48673,"level":"info","msg":"started volume zzzzz-nyw5e-000000000000001 (d1bc9782-2153-4316-b1a2-4e373a432577/tmp/workspace/run-tests-remainder/tmp/keep1.data), AllowWrite=true, AllowTrash=true","time":"2025-03-10T20:24:03.988949667Z"}
{"ClusterID":"zzzzz","Listen":"127.0.0.1:37701","PID":48673,"Service":"keepstore","URL":"http://127.0.0.1:37701/","Version":"1bca6728bf4659e4b13146a6f041a0c188a0a8bc-dev (go1.23.6)","level":"info","msg":"listening","time":"2025-03-10T20:24:03.989514657Z"}
{"ClusterID":"zzzzz","PID":48673,"RequestID":"req-755yrrwrn8x51gp68t70","level":"info","msg":"request","remoteAddr":"127.0.0.1:38756","reqBytes":0,"reqForwardedFor":"","reqHost":"localhost:37701","reqMethod":"HEAD","reqPath":"a8383a37e70cc4c384694c32c3face5f+14","reqQuery":"","time":"2025-03-10T20:24:04.308891718Z"}

but... keep0.log

{"ClusterID":"zzzzz","PID":44641,"RequestID":"req-gre4z9ubu2z1tyvf0bst","level":"info","msg":"response","priority":0,"queue":"api","remoteAddr":"127.0.0.1:52254","reqBytes":0,"reqForwardedFor":"","reqHost":"localhost:37375","reqMethod":"GET","reqPath":"index/dcd0348cb2532ee90c99f1b846efaee7","reqQuery":"","respBytes":1,"respStatus":"OK","respStatusCode":200,"time":"2025-03-10T20:22:41.816645158Z","timeToStatus":0.000097,"timeTotal":0.000137,"timeWriteBody":0.000040}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000001.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:37701/","time":"2025-03-10T20:23:23.012468258Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000002.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:38533/","time":"2025-03-10T20:23:23.012510778Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000003.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:55825/","time":"2025-03-10T20:23:23.012518148Z"}
{"ClusterID":"zzzzz","PID":46026,"level":"warning","msg":"ARVADOS_API_HOST environment variable is present, but will not be used","time":"2025-03-10T20:23:23.012536328Z"}
{"ClusterID":"zzzzz","PID":46026,"level":"warning","msg":"ARVADOS_API_TOKEN environment variable is present, but will not be used","time":"2025-03-10T20:23:23.012549158Z"}
{"ClusterID":"zzzzz","PID":46026,"level":"warning","msg":"Possible configuration mistake: not useful to set API.MaxKeepBlobBuffers (128) higher than API.MaxConcurrentRequests (64)","time":"2025-03-10T20:23:23.012738438Z"}
{"ClusterID":"zzzzz","PID":46026,"level":"info","msg":"started volume zzzzz-nyw5e-000000000000000 (d1bc9782-2153-4316-b1a2-4e373a432577/tmp/workspace/run-tests-remainder/tmp/keep0.data), AllowWrite=true, AllowTrash=true","time":"2025-03-10T20:23:23.018740318Z"}
{"ClusterID":"zzzzz","Listen":"127.0.0.1:37375","PID":46026,"Service":"keepstore","URL":"http://127.0.0.1:37375/","Version":"1bca6728bf4659e4b13146a6f041a0c188a0a8bc-dev (go1.23.6)","level":"info","msg":"listening","time":"2025-03-10T20:23:23.019221788Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000001.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:37701/","time":"2025-03-10T20:24:03.929526819Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000002.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:38533/","time":"2025-03-10T20:24:03.929591459Z"}
{"level":"warning","msg":"Volumes.zzzzz-nyw5e-000000000000003.AccessViaHosts refers to nonexistent keepstore server http://127.0.0.1:55825/","time":"2025-03-10T20:24:03.929599879Z"}
{"ClusterID":"zzzzz","PID":48663,"level":"warning","msg":"ARVADOS_API_HOST environment variable is present, but will not be used","time":"2025-03-10T20:24:03.929630779Z"}
{"ClusterID":"zzzzz","PID":48663,"level":"warning","msg":"ARVADOS_API_TOKEN environment variable is present, but will not be used","time":"2025-03-10T20:24:03.929641769Z"}
{"error":"could not enable the \"keepstore\" service on this host: http://127.0.0.1:37375/: listen tcp 127.0.0.1:37375: bind: address already in use","level":"error","msg":"exiting","time":"2025-03-10T20:24:03.929818079Z"}
Actions #2

Updated by Tom Clegg 4 days ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

The problem is that something (other than the intended keepstore process) was using port 37375, so keepstore didn't start up, and run_test_servers.py didn't notice, because it just waits for something to listen on the port.

It could have been a keepstore process started by a previous test suite (that would have been sdk/go/arvadosclient according to the timestamps) -- perhaps "stop_keep" isn't reliable? -- but I don't think so because
  • keepstore would still have been logging to keep0.log in that case, and this 405 wasn't logged; and
  • keepstore only returns 405 for "GET locator without size hint", and this wasn't that.

I've updated run_test_servers.py to wait for the intended process to listen on the port in the cases where that's easy to check (i.e., not passenger, where r_t_s's child process isn't the one that binds).

This still doesn't fix the issue, but it makes a slight improvement of producing a log message like "WARNING: Process 1752372 is not listening on port 51063 (waited 10 seconds)."

If this interfering process is short-lived -- which it seems to have been in this case, given that 37 seconds later, the next time another test suite attempted to bring up keepstore on the same port, it worked -- we could work around it in run-tests or (easier) in our "start http service on port" code in lib/service with a configurable timeout. Should we do this?

While tracking this down I
  • fixed some needless global state changes in arvadostest (chdir) and arvadosclient tests (setenv).
  • updated the start_keep code to pass ARVADOS_SERVICE_INTERNAL_URL, because (contrary to what the comment used to say) writing a new config file is no longer the only way to tell keepstore which port to listen on. We still need to write a custom config file anyway to obey the blob_signing arg in the general case, but at least the config is a little more like real life, where all keepstores get to see one another's InternalURLs.

22655-keepclient-test @ ef51dea84d27adbc556f184186909dee383fce1b -- developer-run-tests: #4692

Actions #3

Updated by Tom Clegg 4 days ago

  • Target version set to Development 2025-03-19
Actions #4

Updated by Tom Clegg 3 days ago

conclusion from eng meeting
  • When running inside the test suite (check ARVADOS_TEST_API_HOST), keepstore et al. should retry up to 5 minutes if listening fails with "address already in use"
  • In run_test_servers.py, _wait_until_port_listens should crash (instead of just warning) if the specified process is not listening after 5 minutes
  • In run_test_servers.py, _wait_until_port_listens should crash if the specified process exits
  • Test suites that invoke run_test_servers.py start_keep should fail on non-zero exit
Actions #5

Updated by Tom Clegg 3 days ago

22655-keepclient-test @ 599e538d27f34c1d54c334f7678cfa9477358c0b -- developer-run-tests: #4693

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ when a go service starts up and ARVADOS_TEST_API_HOST and ARVADOS_SERVICE_INTERNAL_URL env vars are set (i.e., keepstore running under run-tests.sh), if listening fails with "address already in use", it will retry for up to 5 minutes instead of giving up immediately
    • run_test_servers.py waits up to 5 minutes for a service to come up and listen on its assigned port
    • run_test_servers.py gives up right away and crashes (instead of just warning and exiting 0) if a service does not listen on its assigned port
    • run_test_servers.py gives up right away and crashes (instead of waiting the full timeout) if a service process exits while we're waiting for it to listen
    • arvadostest.StartKeep et al. fail the test suite (instead of ignoring the error) if run_test_servers.py exits non-zero (with a bit more effort it could fail the test case, I'm just not sure how far to creep scope)
    • keepstore and run_test_servers.py both log a "waiting..." message if a retry/wait is needed
    • In the case of Passenger we don't know the pid of the grandchild process that will listen, so we just check that something is listening, and use a shorter timeout -- but we still get the "give up without waiting 10 seconds if passenger crashes/exits at startup" benefit.
    • See also "while tracking this down..." extras noted above in #note-2
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ tested "wait for port" manually by running nc -l -p $port -s 127.0.0.1
      ======= test sdk/go/keepclient
      START: keepclient_test.go:68: ServerRequiredSuite.SetUpSuite
      keep0.pid is 1811992
      waiting for port 54501...
      keep1.pid is 1812050
      PASS: keepclient_test.go:68: ServerRequiredSuite.SetUpSuite     4.750s
      
    • ✅ tested "fail tests if run_test_servers.py fails" by adding a typo
      ======= test sdk/go/keepclient
      START: keepclient_test.go:68: ServerRequiredSuite.SetUpSuite
      keep0.pid is 1810600
      Traceback (most recent call last):
        File "/home/tom/arvados/sdk/python/tests/run_test_server.py", line 1037, in <module>
          run_keep(blob_signing=args.keep_blob_signing, num_servers=args.num_keep_servers)
        File "/home/tom/arvados/sdk/python/tests/run_test_server.py", line 570, in run_keep
          port = _start_keep(d, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/tom/arvados/sdk/python/tests/run_test_server.py", line 553, in _start_keep
          _wait_until_port_listens(port, pid=child.pid)
        File "/home/tom/arvados/sdk/python/tests/run_test_server.py", line 226, in _wait_until_port_listens
          if re.searchg(r'\ntcp.*:'+str(port)+' .* LISTEN +'+matchpid+'/',
             ^^^^^^^^^^
      AttributeError: module 're' has no attribute 'searchg'. Did you mean: 'search'?
      2025/03/12 13:23:42 [python run_test_server.py start_keep --num-keep-servers 2]: exited 1
      exit status 1
      FAIL    git.arvados.org/arvados.git/sdk/go/keepclient   0.292s
      ======= sdk/go/keepclient tests -- FAILED
      
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • n/a
  • Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • Follows our coding standards and GUI style guidelines.
    • ✅ I hope my Python wrapping/indentation is correct
Actions #6

Updated by Tom Clegg 3 days ago

  • Subtask #22663 added
Actions #7

Updated by Brett Smith 2 days ago

Tom Clegg wrote in #note-5:

22655-keepclient-test @ 599e538d27f34c1d54c334f7678cfa9477358c0b -- developer-run-tests: #4693

  • ef51dea84d27adbc556f184186909dee383fce1b just removes the wait-for-port call for keepproxy. Was this intentional? What's the rationale?
  • In _wait_until_port_listens it would make a nicer API to make listener_must_be_pid a keyword-only argument (i.e., put it after * in the signature). This way callers have to spell out listener_must_be_pid=True and can't just pass a bare True that's hard to read at the call site.
  • raise "error" is not and has never been valid Python 3, you gotta raise an exception:
    >>> raise "foo" 
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: exceptions must derive from BaseException
    

Thanks.

Actions #8

Updated by Tom Clegg about 24 hours ago

Brett Smith wrote in #note-7:

Yes. There used to be two calls for the same keepproxy process, now there's just one.

  • In _wait_until_port_listens it would make a nicer API to make listener_must_be_pid a keyword-only argument (i.e., put it after * in the signature). This way callers have to spell out listener_must_be_pid=True and can't just pass a bare True that's hard to read at the call site.

Good call, thanks. Done.

  • raise "error" is not and has never been valid Python 3, you gotta raise an exception:[...]

Fixed (2 new ones plus 1 from 2022).

22655-keepclient-test @ 207f55d8ec4a6bae04652ece8a63495718028f94 -- developer-run-tests: #4707

Actions #9

Updated by Brett Smith about 23 hours ago

Tom Clegg wrote in #note-8:

  • In _wait_until_port_listens it would make a nicer API to make listener_must_be_pid a keyword-only argument (i.e., put it after * in the signature). This way callers have to spell out listener_must_be_pid=True and can't just pass a bare True that's hard to read at the call site.

Good call, thanks. Done.

For the ticket record I only meant listener_must_be_pid, not all the keyword arguments, but sure this works for me too.

22655-keepclient-test @ 207f55d8ec4a6bae04652ece8a63495718028f94 -- developer-run-tests: #4707

LGTM (the test failure is #22654), thanks.

Actions #10

Updated by Tom Clegg about 22 hours ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF