Project

General

Profile

Actions

Bug #21521

closed

Uploading deb packages fails intermittently

Added by Brett Smith 10 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
-

Description

Because Jenkins hates me personally, occasionally build-packages-debian12 fails on the upload step:

======= Start upload packages
/usr/local/arvados-dev/jenkins/run_upload_packages.py --repo dev -H jenkinsapt@apt.arvados.org -o Port=2222 --workspace /tmp/workspace/build-packages-debian12 debian12
Unable to open database, sleeping 8.753268149s, attempts left 10...
Unable to open database, sleeping 9.298699287s, attempts left 9...
Unable to open database, sleeping 8.465308774s, attempts left 8...
Unable to open database, sleeping 9.709491264s, attempts left 7...
Unable to open database, sleeping 8.100403568s, attempts left 6...
Unable to open database, sleeping 11.044444422s, attempts left 5...
Unable to open database, sleeping 11.153889278s, attempts left 4...
Unable to open database, sleeping 8.410966832s, attempts left 3...
Unable to open database, sleeping 10.525453171s, attempts left 2...
Unable to open database, sleeping 11.82935613s, attempts left 1...
ERROR: unable to reopen the DB, maximum number of retries reached
Traceback (most recent call last):
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 362, in <module>
    main(sys.argv[1:])
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 358, in main
    build_suite_and_upload(target, last_upload_ts, args)
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 348, in build_suite_and_upload
    suite.update_packages(since_timestamp)
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 115, in update_packages
    self.post_uploads(upload_paths)
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 237, in post_uploads
    self._run_script(self.APT_SCRIPT, self.REMOTE_DEST_DIR + '/' + self.target,
  File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 193, in _run_script
    subprocess.check_call(self._build_cmd(
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-oPort=2222', '-q', 'jenkinsapt@apt.arvados.org', 'bash', '-ec', '\'\ncd "$1"; shift\nDISTNAME=$1; shift\nfor package in "$@"; do\n  set +e\n  aptly repo search "$DISTNAME" "${package%.deb}" >/dev/null 2>&1\n  RET=$?\n  set -e\n  if [[ $RET -eq 0 ]]; then\n    echo "Not adding $package, it is already present in repo $DISTNAME"\n    rm "$package"\n  else\n    aptly repo add -remove-files "$DISTNAME" "$package"\n  fi\ndone\naptly publish update "$DISTNAME" filesystem:"${DISTNAME%-*}":\n\'', 'DebianPackageSuite', 'tmp/debian12', 'bookworm-dev', "'arvados-sync-users_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-slurm_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-crunchstat-summary_2.8.0~dev20240213172635-1_amd64.deb'", "'keep-web_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-sync-groups_2.8.0~dev20240214163440-1_arm64.deb'", "'keepproxy_2.8.0~dev20240214163440-1_arm64.deb'", "'crunch-run_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-exercise_2.8.0~dev20240214163440-1_arm64.deb'", "'python3-arvados-user-activity_2.8.0~dev20240213172635-1_amd64.deb'", "'crunch-dispatch-local_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-ws_2.8.0~dev20240214163440-1_arm64.deb'", "'keepstore_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-sync-users_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-git-httpd_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-local_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-dispatch-lsf_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-controller_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-client_2.8.0~dev20240214163440-1_amd64.deb'", "'libpam-arvados-go_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-block-check_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-client_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-ws_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-server_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-health_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-src_2.8.0~dev20240214163440-1_all.deb'", "'arvados-docker-cleaner_2.8.0~dev20240207214436-1_amd64.deb'", "'keep-balance_2.8.0~dev20240214163440-1_arm64.deb'", "'keepstore_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-rsync_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-sync-groups_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-dispatch-lsf_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-block-check_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-dispatch-cloud_2.8.0~dev20240214163440-1_arm64.deb'", "'crunch-run_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-git-httpd_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-balance_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-health_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-rsync_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-fuse_2.8.0~dev20240213172635-1_amd64.deb'", "'arvados-workbench2_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-exercise_2.8.0~dev20240214163440-1_amd64.deb'", 'python3-cwltest_2.3.20230108193615-1_amd64.deb', "'arvados-dispatch-cloud_2.8.0~dev20240214163440-1_amd64.deb'", "'keepproxy_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-python-client_2.8.0~dev20240213172635-1_amd64.deb'", "'arvados-api-server_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-slurm_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-controller_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-server_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-cwl-runner_2.8.0~dev20240213172635-1_amd64.deb'", "'keep-web_2.8.0~dev20240214163440-1_arm64.deb'"]' returned non-zero exit status 1.
======= upload packages -- FAILED
======= End of upload packages (200s)

Those "Unable to open database" errors are from aptly. I don't understand what seems to be special about debian12 that it seems more prone to these database timeouts, but it definitely happens too often to be just bad luck.

The silver lining here is at least all this is infrastructure under our control so we can presumably fix it once we figure out what causes the issue.


Files

build-packages-debian12-43.log (1.54 MB) build-packages-debian12-43.log Brett Smith, 02/14/2024 06:34 PM

Subtasks 1 (0 open1 closed)

Task #21564: Review arvados-dev branch 21521-aptly-flockResolvedBrett Smith04/29/2024Actions
Actions #2

Updated by Peter Amstutz 10 months ago

FWIW I think the "unable to open database" error is a locking issue, but what else would be trying to update aptly at the same time?

Actions #3

Updated by Brett Smith 10 months ago

Peter Amstutz wrote in #note-2:

FWIW I think the "unable to open database" error is a locking issue, but what else would be trying to update aptly at the same time?

Yeah, that's my understanding too and I have the same question open in my head.

Actions #4

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #5

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #6

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #7

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #8

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #9

Updated by Brett Smith 8 months ago

  • Status changed from New to In Progress

Earlier this morning this happened to ubuntu2004, which made me think it's actually not distro-specific. Take a look at the build logs: build-packages-multijob: #4179

build-packages-ubuntu2004 tried and failed to get the lock between 07:53:51 and 07:55:28. This hits the tail end of build-packages-debian11, which started cleaning up at 07:53:49, and then bleeds into the upload of build-packages-debian12, which ran from 07:53:57 to 07:55:40.

In other words, our uploads take long enough (about 120 seconds), and the attempt to get the lock is impatient enough (about 100 seconds), that if one upload tries to start shortly after another, it's very likely to fail. The intermittentness is mostly coming just from Jenkins jitter.

I'll see if there are aptly parameters to adjust the lock wait. If not I'll just wrap it in flock.

Actions #10

Updated by Brett Smith 8 months ago

Brett Smith wrote in #note-9:

I'll see if there are aptly parameters to adjust the lock wait. If not I'll just wrap it in flock.

No parameters, these durations are hardcoded. flock it is.

Actions #11

Updated by Brett Smith 8 months ago

  • Subject changed from Uploading debian12 packages fails intermittently to Uploading deb packages fails intermittently
Actions #12

Updated by Brett Smith 8 months ago

arvados-dev branch 21521-aptly-flock @ e9615c3c0d8a4820f2dcd967a02e70a046a77182

  • All agreed upon points are implemented / addressed.
    • Yes
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • N/A
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • I haven't tested the code beyond some local scripts to convince myself I have the basic idea right. The change is too small to sweat too much. IMO we should just deploy it, and if it turns out to be completely unworkable, it's not a huge loss to revert it.
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • Probably will add some overhead to uploads, but a relatively small amount compared to the rest of the process generally.
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
    • N/A, we have no style guide for shell
Actions #13

Updated by Tom Clegg 8 months ago

LGTM, thanks

Actions #14

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Actions #15

Updated by Brett Smith 7 months ago

  • Status changed from In Progress to Resolved

Last failure was on April 26 so calling this a success.

Actions

Also available in: Atom PDF