Project

General

Profile

Actions

Bug #21258

closed

lib/dispatchcloud TestDispatchToStubDriver is racey

Added by Brett Smith about 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Story points:
-
Release:
Release relationship:
Auto

Description

Failure mode 2

This failure was seen in a run that had b015c9e45f2a81b7069e5ecde3e0e9e0c5c619fa merged.

dispatcher_test.go:297:
    c.Check(resp.Body.String(), check.Matches, `(?ms).*boot_outcomes{outcome="failure"} [^0].*`)
... value string = "" +
...     "# HELP arvados_dispatchcloud_at_quota Flag indicating the cloud driver is reporting an at-quota condition.\n" +
...     "# TYPE arvados_dispatchcloud_at_quota gauge\n" +
...     "arvados_dispatchcloud_at_quota 0\n" +
...     "# HELP arvados_dispatchcloud_boot_outcomes Boot outcomes by type.\n" +
...     "# TYPE arvados_dispatchcloud_boot_outcomes counter\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"aborted\"} 7\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"disappeared\"} 5\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"failure\"} 0\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"success\"} 41\n" +
...     "# HELP arvados_dispatchcloud_containers_allocated_not_started Number of containers allocated to a worker but not started yet (worker is booting).\n" +
...     "# TYPE arvados_dispatchcloud_containers_allocated_not_started gauge\n" +
...     "arvados_dispatchcloud_containers_allocated_not_started 0\n" +
...     "# HELP arvados_dispatchcloud_containers_longest_wait_time_seconds Current longest wait time of any container since queuing, and before the start of crunch-run.\n" +
...     "# TYPE arvados_dispatchcloud_containers_longest_wait_time_seconds gauge\n" +
...     "arvados_dispatchcloud_containers_longest_wait_time_seconds 0\n" +
...     "# HELP arvados_dispatchcloud_containers_not_allocated_over_quota Number of containers not allocated to a worker because the system has hit a quota.\n" +
...     "# TYPE arvados_dispatchcloud_containers_not_allocated_over_quota gauge\n" +
...     "arvados_dispatchcloud_containers_not_allocated_over_quota 0\n" +
...     "# HELP arvados_dispatchcloud_containers_running Number of containers reported running by cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_containers_running gauge\n" +
...     "arvados_dispatchcloud_containers_running 0\n" +
...     "# HELP arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds Number of seconds between the queuing of a container and the start of crunch-run.\n" +
...     "# TYPE arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds summary\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.5\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.9\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.95\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.99\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds_sum 1.9461314997763523e+12\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds_count 211\n" +
...     "# HELP arvados_dispatchcloud_driver_operations Number of instance-create/destroy/list operations performed via cloud driver.\n" +
...     "# TYPE arvados_dispatchcloud_driver_operations counter\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"Create\"} 53\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"Destroy\"} 53\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"List\"} 111\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"SetTags\"} 25\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"Create\"} 15\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"Destroy\"} 7\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"List\"} 0\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"SetTags\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_disappeared Number of occurrences of an instance disappearing from the cloud provider's list of instances.\n" +
...     "# TYPE arvados_dispatchcloud_instances_disappeared counter\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"idle\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"running\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"shutdown\"} 52\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_price Price of cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_instances_price gauge\n" +
...     "arvados_dispatchcloud_instances_price{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"idle\"} 0.984\n" +
...     "arvados_dispatchcloud_instances_price{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_run_probe_duration_seconds Number of seconds per runProbe call.\n" +
...     "# TYPE arvados_dispatchcloud_instances_run_probe_duration_seconds summary\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.5\"} 0.000299431\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.9\"} 0.00055759\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.95\"} 0.00072938\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.99\"} 0.00126165\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_sum{outcome=\"fail\"} 0.04786191400000002\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_count{outcome=\"fail\"} 128\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.5\"} 0.000334\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.9\"} 0.00065138\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.95\"} 0.00091094\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.99\"} 0.00174714\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_sum{outcome=\"success\"} 0.3429287439999997\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_count{outcome=\"success\"} 876\n" +
...     "# HELP arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds Number of seconds between the first shutdown attempt and the disappearance of the worker.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.5\"} 0.00741143\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.9\"} 0.01473868\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.95\"} 0.01628774\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.99\"} 0.01721516\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds_sum 0.40030032699999996\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds_count 52\n" +
...     "# HELP arvados_dispatchcloud_instances_time_to_ready_for_container_seconds Number of seconds between the first successful SSH connection and ready to run a container.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_to_ready_for_container_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.5\"} 0.009143631\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.9\"} 0.01485901\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.95\"} 0.01578203\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.99\"} 0.0195959\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds_sum 0.37931822099999996\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds_count 41\n" +
...     "# HELP arvados_dispatchcloud_instances_time_to_ssh_seconds Number of seconds between instance creation and the first successful SSH connection.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_to_ssh_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.5\"} 0.0181355\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.9\"} 0.02653959\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.95\"} 0.02768423\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.99\"} 0.03185776\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds_sum 0.9493334640000001\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds_count 50\n" +
...     "# HELP arvados_dispatchcloud_instances_total Number of cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_instances_total gauge\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type8\"} 1\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type8\"} 0\n" +
...     "# HELP arvados_dispatchcloud_last_503_time Time of most recent 503 error received from API.\n" +
...     "# TYPE arvados_dispatchcloud_last_503_time gauge\n" +
...     "arvados_dispatchcloud_last_503_time 1.703100514e+09\n" +
...     "# HELP arvados_dispatchcloud_max_concurrent_containers Dynamically assigned limit on number of containers scheduled concurrency, set after receiving 503 errors from API.\n" +
...     "# TYPE arvados_dispatchcloud_max_concurrent_containers gauge\n" +
...     "arvados_dispatchcloud_max_concurrent_containers 5\n" +
...     "# HELP arvados_dispatchcloud_memory_bytes_total Total memory on all cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_memory_bytes_total gauge\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"idle\"} 8.589934592e+09\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_probe_age_seconds_max Maximum number of seconds since an instance's most recent successful probe.\n" +
...     "# TYPE arvados_dispatchcloud_probe_age_seconds_max gauge\n" +
...     "arvados_dispatchcloud_probe_age_seconds_max 0.01719935\n" +
...     "# HELP arvados_dispatchcloud_probe_age_seconds_median Median number of seconds since an instance's most recent successful probe.\n" +
...     "# TYPE arvados_dispatchcloud_probe_age_seconds_median gauge\n" +
...     "arvados_dispatchcloud_probe_age_seconds_median 0.01719935\n" +
...     "# HELP arvados_dispatchcloud_vcpus_total Total VCPUs on all cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_vcpus_total gauge\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"idle\"} 8\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"unknown\"} 0\n" 
... regex string = "(?ms).*boot_outcomes{outcome=\"failure\"} [^0].*" 

time="2023-12-20T19:28:35.182068484Z" level=info msg="instance disappeared in cloud" Instance="inst46,providertype8" WorkerState=shutdown
OOPS: 11 passed, 1 FAILED
--- FAIL: Test (1.25s)
FAIL
    git.arvados.org/arvados.git/lib/dispatchcloud    coverage: 75.1% of statements
FAIL    git.arvados.org/arvados.git/lib/dispatchcloud    1.271s
FAIL

Failure mode 1

FAIL: dispatcher_test.go:160: DispatcherSuite.TestDispatchToStubDriver
[way too many logs omitted, see attachment]
dispatcher_test.go:296:
    c.Check(resp.Body.String(), check.Matches, `(?ms).*boot_outcomes{outcome="failure"} [^0].*`)
... value string = "" +
...     "# HELP arvados_dispatchcloud_at_quota Flag indicating the cloud driver is reporting an at-quota condition.\n" +
...     "# TYPE arvados_dispatchcloud_at_quota gauge\n" +
...     "arvados_dispatchcloud_at_quota 0\n" +
...     "# HELP arvados_dispatchcloud_boot_outcomes Boot outcomes by type.\n" +
...     "# TYPE arvados_dispatchcloud_boot_outcomes counter\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"aborted\"} 9\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"disappeared\"} 4\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"failure\"} 0\n" +
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"success\"} 49\n" +
...     "# HELP arvados_dispatchcloud_containers_allocated_not_started Number of containers allocated to a worker but not started yet (worker is booting).\n" +
...     "# TYPE arvados_dispatchcloud_containers_allocated_not_started gauge\n" +
...     "arvados_dispatchcloud_containers_allocated_not_started 0\n" +
...     "# HELP arvados_dispatchcloud_containers_longest_wait_time_seconds Current longest wait time of any container since queuing, and before the start of crunch-run.\n" +
...     "# TYPE arvados_dispatchcloud_containers_longest_wait_time_seconds gauge\n" +
...     "arvados_dispatchcloud_containers_longest_wait_time_seconds 0\n" +
...     "# HELP arvados_dispatchcloud_containers_not_allocated_over_quota Number of containers not allocated to a worker because the system has hit a quota.\n" +
...     "# TYPE arvados_dispatchcloud_containers_not_allocated_over_quota gauge\n" +
...     "arvados_dispatchcloud_containers_not_allocated_over_quota 0\n" +
...     "# HELP arvados_dispatchcloud_containers_running Number of containers reported running by cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_containers_running gauge\n" +
...     "arvados_dispatchcloud_containers_running 0\n" +
...     "# HELP arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds Number of seconds between the queuing of a container and the start of crunch-run.\n" +
...     "# TYPE arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds summary\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.5\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.9\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.95\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds{quantile=\"0.99\"} 9.223372036854776e+09\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds_sum 1.9830249879237712e+12\n" +
...     "arvados_dispatchcloud_containers_time_from_queue_to_crunch_run_seconds_count 215\n" +
...     "# HELP arvados_dispatchcloud_driver_operations Number of instance-create/destroy/list operations performed via cloud driver.\n" +
...     "# TYPE arvados_dispatchcloud_driver_operations counter\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"Create\"} 62\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"Destroy\"} 62\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"List\"} 103\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"0\",operation=\"SetTags\"} 21\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"Create\"} 13\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"Destroy\"} 9\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"List\"} 0\n" +
...     "arvados_dispatchcloud_driver_operations{error=\"1\",operation=\"SetTags\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_disappeared Number of occurrences of an instance disappearing from the cloud provider's list of instances.\n" +
...     "# TYPE arvados_dispatchcloud_instances_disappeared counter\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"idle\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"running\"} 0\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"shutdown\"} 62\n" +
...     "arvados_dispatchcloud_instances_disappeared{state=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_price Price of cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_instances_price gauge\n" +
...     "arvados_dispatchcloud_instances_price{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"idle\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_instances_price{category=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_instances_run_probe_duration_seconds Number of seconds per runProbe call.\n" +
...     "# TYPE arvados_dispatchcloud_instances_run_probe_duration_seconds summary\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.5\"} 0.00025819\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.9\"} 0.00049002\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.95\"} 0.00068864\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"fail\",quantile=\"0.99\"} 0.0016936\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_sum{outcome=\"fail\"} 0.04495730099999998\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_count{outcome=\"fail\"} 138\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.5\"} 0.00025914\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.9\"} 0.00051182\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.95\"} 0.00081387\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds{outcome=\"success\",quantile=\"0.99\"} 0.001874889\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_sum{outcome=\"success\"} 0.2827972210000001\n" +
...     "arvados_dispatchcloud_instances_run_probe_duration_seconds_count{outcome=\"success\"} 843\n" +
...     "# HELP arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds Number of seconds between the first shutdown attempt and the disappearance of the worker.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.5\"} 0.009525099\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.9\"} 0.016427759\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.95\"} 0.017818548\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds{quantile=\"0.99\"} 0.018422408\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds_sum 0.5827228839999999\n" +
...     "arvados_dispatchcloud_instances_time_from_shutdown_request_to_disappearance_seconds_count 62\n" +
...     "# HELP arvados_dispatchcloud_instances_time_to_ready_for_container_seconds Number of seconds between the first successful SSH connection and ready to run a container.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_to_ready_for_container_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.5\"} 0.007698769\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.9\"} 0.018579038\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.95\"} 0.021335438\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds{quantile=\"0.99\"} 0.036827616\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds_sum 0.4455547920000001\n" +
...     "arvados_dispatchcloud_instances_time_to_ready_for_container_seconds_count 49\n" +
...     "# HELP arvados_dispatchcloud_instances_time_to_ssh_seconds Number of seconds between instance creation and the first successful SSH connection.\n" +
...     "# TYPE arvados_dispatchcloud_instances_time_to_ssh_seconds summary\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.5\"} 0.017542368\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.9\"} 0.026203417\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.95\"} 0.028872557\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds{quantile=\"0.99\"} 0.033338397\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds_sum 1.1238655879999997\n" +
...     "arvados_dispatchcloud_instances_time_to_ssh_seconds_count 60\n" +
...     "# HELP arvados_dispatchcloud_instances_total Number of cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_instances_total gauge\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"booting\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"hold\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type2\"} 1\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"idle\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"inuse\",instance_type=\"type8\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type1\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type16\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type2\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type3\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type4\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type6\"} 0\n" +
...     "arvados_dispatchcloud_instances_total{category=\"unknown\",instance_type=\"type8\"} 0\n" +
...     "# HELP arvados_dispatchcloud_last_503_time Time of most recent 503 error received from API.\n" +
...     "# TYPE arvados_dispatchcloud_last_503_time gauge\n" +
...     "arvados_dispatchcloud_last_503_time 1.701292139e+09\n" +
...     "# HELP arvados_dispatchcloud_max_concurrent_containers Dynamically assigned limit on number of containers scheduled concurrency, set after receiving 503 errors from API.\n" +
...     "# TYPE arvados_dispatchcloud_max_concurrent_containers gauge\n" +
...     "arvados_dispatchcloud_max_concurrent_containers 6\n" +
...     "# HELP arvados_dispatchcloud_memory_bytes_total Total memory on all cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_memory_bytes_total gauge\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"idle\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_memory_bytes_total{category=\"unknown\"} 0\n" +
...     "# HELP arvados_dispatchcloud_probe_age_seconds_max Maximum number of seconds since an instance's most recent successful probe.\n" +
...     "# TYPE arvados_dispatchcloud_probe_age_seconds_max gauge\n" +
...     "arvados_dispatchcloud_probe_age_seconds_max 0\n" +
...     "# HELP arvados_dispatchcloud_probe_age_seconds_median Median number of seconds since an instance's most recent successful probe.\n" +
...     "# TYPE arvados_dispatchcloud_probe_age_seconds_median gauge\n" +
...     "arvados_dispatchcloud_probe_age_seconds_median 0\n" +
...     "# HELP arvados_dispatchcloud_vcpus_total Total VCPUs on all cloud VMs.\n" +
...     "# TYPE arvados_dispatchcloud_vcpus_total gauge\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"booting\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"hold\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"idle\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"inuse\"} 0\n" +
...     "arvados_dispatchcloud_vcpus_total{category=\"unknown\"} 0\n" 
... regex string = "(?ms).*boot_outcomes{outcome=\"failure\"} [^0].*" 

Files

developer-run-tests-remainder-4133.log (1.33 MB) developer-run-tests-remainder-4133.log Brett Smith, 12/01/2023 05:05 PM
developer-run-tests-remainder-4175.log (1.31 MB) developer-run-tests-remainder-4175.log Full log exhibiting failure mode #2 Brett Smith, 12/28/2023 05:26 PM

Subtasks 2 (0 open2 closed)

Task #21270: Review 21258-flaky-adc-testResolvedTom Clegg12/06/2023Actions
Task #21352: Review 21258-flaky-adc-testResolvedTom Clegg12/06/2023Actions
Actions #1

Updated by Tom Clegg about 1 year ago

  • Target version changed from To be scheduled to Development 2024-01-03 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

21258-flaky-adc-test @ b015c9e45f2a81b7069e5ecde3e0e9e0c5c619fa -- developer-run-tests: #3951

Fixes
  • regex string = "(?ms).*boot_outcomes{outcome=\"failure\"} [^0].*"
    • the previous attempt to fix this was ensuring that one instance would always crash crunch-run when starting a container, but that didn't ensure that the instance would be reported as a boot failure
    • the real fix is just to set Broken=now
    • there was also a potential race condition that could skip n=3 and miss trying to create a guaranteed-broken node, so that is fixed too
  • "container completed twice"
    • not previously mentioned on this issue, but I've seen it before, and it came up while testing for the other failure mode
    • caused by crunch-run on a stub instance making API calls after the stub cloud driver already reported that the instance no longer exists (which we're assuming is impossible IRL)
    • fixed in stub driver by waiting for the fake crunch-run processes to finish before removing the stub instance from the "running instances" list
Actions #2

Updated by Lucas Di Pentima about 1 year ago

This LGTM, thanks.

Actions #3

Updated by Lucas Di Pentima about 1 year ago

Sidenote: I've seen a test keep-balance failure once while running the tests: developer-run-tests-remainder: #4159

...
11:59:36 ----------------------------------------------------------------------
11:59:36 FAIL: balance_run_test.go:626: runSuite.TestRunForever
11:59:36 
11:59:36 time="2023-12-07T14:59:33.352850978Z" level=info msg="acquiring service lock" 
11:59:36 time="2023-12-07T14:59:33.357886518Z" level=info msg="starting up: will scan every 1ms and on SIGUSR1" 
11:59:36 time="2023-12-07T14:59:33.357987218Z" level=info msg="acquiring active lock" 
...
11:59:36 balance_run_test.go:667:
11:59:36     c.Check(metrics, check.Matches, `(?ms).*\narvados_keepbalance_changeset_compute_seconds_count `+fmt.Sprintf("%d", pullReqs.Count()/4)+`\n.*`)
11:59:36 ... value string = "" +
11:59:36 ...     "# HELP arvados_keep_dedup_block_ratio deduplication ratio, blocks referenced / blocks stored\n" +
11:59:36 ...     "# TYPE arvados_keep_dedup_block_ratio gauge\n" +
...
11:59:36 ...     "arvados_keepbalance_sweep_seconds_sum 0.022947318\n" +
11:59:36 ...     "arvados_keepbalance_sweep_seconds_count 5\n" 
11:59:36 ... regex string = "(?ms).*\\narvados_keepbalance_changeset_compute_seconds_count 4\\n.*" 
...
11:59:36 OOPS: 37 passed, 1 FAILED
11:59:36 --- FAIL: Test (4.01s)
11:59:36 FAIL
11:59:36     git.arvados.org/arvados.git/services/keep-balance    coverage: 89.4% of statements
11:59:36 FAIL    git.arvados.org/arvados.git/services/keep-balance    4.021s
11:59:36 FAIL
11:59:36 ======= services/keep-balance tests -- FAILED
...

(not sure if this is the real error with all the verbosity)

Actions #4

Updated by Brett Smith about 1 year ago

Lucas Di Pentima wrote in #note-3:

Sidenote: I've seen a test keep-balance failure once while running the tests

That's #21254.

Actions #5

Updated by Tom Clegg about 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #6

Updated by Brett Smith 12 months ago

Reopening this because we got a very similar-looking failure on a test run that included the previous fix.

Actions #7

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #8

Updated by Tom Clegg 12 months ago

It looks like the test run in #note-6 completed all four of the containers that needed a type4 instance, and then there was a quota error, so the "guaranteed broken" node (inst3) got shut down (to reduce quota pressure) before it could reach boot timeout and be classified as a boot failure.

Excerpt from developer-run-tests-remainder-4175.log

time="2023-12-20T19:28:33.995770518Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000020 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:33.995891178Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000060 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:33.995919378Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000100 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:33.995943748Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000140 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:33.997648898Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000140 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:34.003513368Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000140 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:34.004813378Z" level=info msg="instance appeared in cloud" Address= IdleBehavior=run Instance="inst3,providertype4" InstanceType=type4 State=booting
time="2023-12-20T19:28:34.018466178Z" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000140 InstanceType="{type4 providertype4 4 4294967296 0 0 0 0.492 false {  0}}" 
time="2023-12-20T19:28:34.078744648Z" level=info msg="crunch-run process ended" Address= ContainerUUID=zzzzz-dz642-000000000000020 Instance="inst6,providertype4" InstanceType=type4
time="2023-12-20T19:28:34.098808389Z" level=info msg="crunch-run process ended" Address= ContainerUUID=zzzzz-dz642-000000000000060 Instance="inst6,providertype4" InstanceType=type4
time="2023-12-20T19:28:34.105088569Z" level=info msg="crunch-run process ended" Address= ContainerUUID=zzzzz-dz642-000000000000100 Instance="inst9,providertype4" InstanceType=type4
time="2023-12-20T19:28:34.124886829Z" level=info msg="crunch-run process ended" Address= ContainerUUID=zzzzz-dz642-000000000000140 Instance="inst6,providertype4" InstanceType=type4
time="2023-12-20T19:28:34.145611669Z" level=info msg="quota error -- waiting for next instance shutdown" InstanceType=type8 atQuotaUntilFewerInstances=11
time="2023-12-20T19:28:34.146530459Z" level=info msg="shutting down" Instance="inst3,providertype4" InstanceType=type4
time="2023-12-20T19:28:34.163362939Z" level=info msg="instance disappeared in cloud" Instance="inst3,providertype4" WorkerState=shutdown
Actions #9

Updated by Tom Clegg 12 months ago

21258-flaky-adc-test @ c5380a9ea994f5727ffe0e45dfe7c46a20673db0 -- developer-run-tests: #3984

retry wb2 developer-run-tests-services-workbench2: #127

See commit message for explanation.

With this fix I ran tests 200x with no failures. This would be even more convincing if I hadn't already run tests 200x with no failures even without the fix. 🤷

Actions #10

Updated by Tom Clegg 12 months ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Peter Amstutz 8 months ago

  • Release set to 70
Actions

Also available in: Atom PDF