Bug #21521
closedUploading deb packages fails intermittently
Description
Because Jenkins hates me personally, occasionally build-packages-debian12 fails on the upload step:
======= Start upload packages /usr/local/arvados-dev/jenkins/run_upload_packages.py --repo dev -H jenkinsapt@apt.arvados.org -o Port=2222 --workspace /tmp/workspace/build-packages-debian12 debian12 Unable to open database, sleeping 8.753268149s, attempts left 10... Unable to open database, sleeping 9.298699287s, attempts left 9... Unable to open database, sleeping 8.465308774s, attempts left 8... Unable to open database, sleeping 9.709491264s, attempts left 7... Unable to open database, sleeping 8.100403568s, attempts left 6... Unable to open database, sleeping 11.044444422s, attempts left 5... Unable to open database, sleeping 11.153889278s, attempts left 4... Unable to open database, sleeping 8.410966832s, attempts left 3... Unable to open database, sleeping 10.525453171s, attempts left 2... Unable to open database, sleeping 11.82935613s, attempts left 1... ERROR: unable to reopen the DB, maximum number of retries reached Traceback (most recent call last): File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 362, in <module> main(sys.argv[1:]) File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 358, in main build_suite_and_upload(target, last_upload_ts, args) File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 348, in build_suite_and_upload suite.update_packages(since_timestamp) File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 115, in update_packages self.post_uploads(upload_paths) File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 237, in post_uploads self._run_script(self.APT_SCRIPT, self.REMOTE_DEST_DIR + '/' + self.target, File "/usr/local/arvados-dev/jenkins/run_upload_packages.py", line 193, in _run_script subprocess.check_call(self._build_cmd( File "/usr/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ssh', '-oPort=2222', '-q', 'jenkinsapt@apt.arvados.org', 'bash', '-ec', '\'\ncd "$1"; shift\nDISTNAME=$1; shift\nfor package in "$@"; do\n set +e\n aptly repo search "$DISTNAME" "${package%.deb}" >/dev/null 2>&1\n RET=$?\n set -e\n if [[ $RET -eq 0 ]]; then\n echo "Not adding $package, it is already present in repo $DISTNAME"\n rm "$package"\n else\n aptly repo add -remove-files "$DISTNAME" "$package"\n fi\ndone\naptly publish update "$DISTNAME" filesystem:"${DISTNAME%-*}":\n\'', 'DebianPackageSuite', 'tmp/debian12', 'bookworm-dev', "'arvados-sync-users_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-slurm_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-crunchstat-summary_2.8.0~dev20240213172635-1_amd64.deb'", "'keep-web_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-sync-groups_2.8.0~dev20240214163440-1_arm64.deb'", "'keepproxy_2.8.0~dev20240214163440-1_arm64.deb'", "'crunch-run_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-exercise_2.8.0~dev20240214163440-1_arm64.deb'", "'python3-arvados-user-activity_2.8.0~dev20240213172635-1_amd64.deb'", "'crunch-dispatch-local_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-ws_2.8.0~dev20240214163440-1_arm64.deb'", "'keepstore_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-sync-users_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-git-httpd_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-local_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-dispatch-lsf_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-controller_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-client_2.8.0~dev20240214163440-1_amd64.deb'", "'libpam-arvados-go_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-block-check_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-client_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-ws_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-server_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-health_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-src_2.8.0~dev20240214163440-1_all.deb'", "'arvados-docker-cleaner_2.8.0~dev20240207214436-1_amd64.deb'", "'keep-balance_2.8.0~dev20240214163440-1_arm64.deb'", "'keepstore_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-rsync_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-sync-groups_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-dispatch-lsf_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-block-check_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-dispatch-cloud_2.8.0~dev20240214163440-1_arm64.deb'", "'crunch-run_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-git-httpd_2.8.0~dev20240214163440-1_arm64.deb'", "'keep-balance_2.8.0~dev20240214163440-1_amd64.deb'", "'arvados-health_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-rsync_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-fuse_2.8.0~dev20240213172635-1_amd64.deb'", "'arvados-workbench2_2.8.0~dev20240214163440-1_amd64.deb'", "'keep-exercise_2.8.0~dev20240214163440-1_amd64.deb'", 'python3-cwltest_2.3.20230108193615-1_amd64.deb', "'arvados-dispatch-cloud_2.8.0~dev20240214163440-1_amd64.deb'", "'keepproxy_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-python-client_2.8.0~dev20240213172635-1_amd64.deb'", "'arvados-api-server_2.8.0~dev20240214163440-1_amd64.deb'", "'crunch-dispatch-slurm_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-controller_2.8.0~dev20240214163440-1_arm64.deb'", "'arvados-server_2.8.0~dev20240214163440-1_amd64.deb'", "'python3-arvados-cwl-runner_2.8.0~dev20240213172635-1_amd64.deb'", "'keep-web_2.8.0~dev20240214163440-1_arm64.deb'"]' returned non-zero exit status 1. ======= upload packages -- FAILED ======= End of upload packages (200s)
Those "Unable to open database" errors are from aptly. I don't understand what seems to be special about debian12 that it seems more prone to these database timeouts, but it definitely happens too often to be just bad luck.
The silver lining here is at least all this is infrastructure under our control so we can presumably fix it once we figure out what causes the issue.
Files
Updated by Brett Smith 9 months ago
Updated by Peter Amstutz 9 months ago
FWIW I think the "unable to open database" error is a locking issue, but what else would be trying to update aptly at the same time?
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-2:
FWIW I think the "unable to open database" error is a locking issue, but what else would be trying to update aptly at the same time?
Yeah, that's my understanding too and I have the same question open in my head.
Updated by Peter Amstutz 9 months ago
- Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Updated by Brett Smith 7 months ago
- Status changed from New to In Progress
Earlier this morning this happened to ubuntu2004, which made me think it's actually not distro-specific. Take a look at the build logs: build-packages-multijob: #4179
build-packages-ubuntu2004 tried and failed to get the lock between 07:53:51 and 07:55:28. This hits the tail end of build-packages-debian11, which started cleaning up at 07:53:49, and then bleeds into the upload of build-packages-debian12, which ran from 07:53:57 to 07:55:40.
In other words, our uploads take long enough (about 120 seconds), and the attempt to get the lock is impatient enough (about 100 seconds), that if one upload tries to start shortly after another, it's very likely to fail. The intermittentness is mostly coming just from Jenkins jitter.
I'll see if there are aptly parameters to adjust the lock wait. If not I'll just wrap it in flock.
Updated by Brett Smith 7 months ago
Brett Smith wrote in #note-9:
I'll see if there are aptly parameters to adjust the lock wait. If not I'll just wrap it in flock.
No parameters, these durations are hardcoded. flock it is.
Updated by Brett Smith 7 months ago
- Subject changed from Uploading debian12 packages fails intermittently to Uploading deb packages fails intermittently
Updated by Brett Smith 7 months ago
arvados-dev branch 21521-aptly-flock @ e9615c3c0d8a4820f2dcd967a02e70a046a77182
- All agreed upon points are implemented / addressed.
- Yes
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- I haven't tested the code beyond some local scripts to convince myself I have the basic idea right. The change is too small to sweat too much. IMO we should just deploy it, and if it turns out to be completely unworkable, it's not a huge loss to revert it.
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- Probably will add some overhead to uploads, but a relatively small amount compared to the rest of the process generally.
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A, we have no style guide for shell
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Updated by Brett Smith 6 months ago
- Status changed from In Progress to Resolved
Last failure was on April 26 so calling this a success.