Support #22370
openRework how development and testing/release package versions are determined in the source tree
Description
Per Dec 4, 2024 sprint retrospective discussion:
The way version numbers are assigned for the SDK packages is awkward. In the case of Python, it involves running code that does a couple of different things based on whether it is being used to generate development packages (the version is derived from the git commit and git tags) or release packages (the version is specified by an environment variable).
The code for determining the version is a shell script with a little bit of Perl. It isn't the nicest.
The things we don't like about this scheme¶
- The
arvados_version.py
script itself is over 140 lines of code to do something that ought to be simple and still ends up calling out toversion-at-commit.sh
- Correctly assigning versions across module dependencies within Arvados is a little complicated and involves having a the whole module dependency graph embedded in
arvados_version.py
- The
arvados_version.py
script is copied in every Python module. It can't be symlinked because Python packaging tries very hard to avoid including symlinks in the package. - The logic in
arvados_version.py
also needs to exist for Ruby, but until the other day, it didn't. Now we have even more logic duplication. - In general, it doesn't feel great to have to run code at package build/install time to determine the version, and have that depend on the build/install environment.
What this scheme does well¶
- Taking the version number from the git timestamp makes it fairly easy to work backwards from a package version to a specific commit it was built from.
- We've done many development iterations and full releases with this system, so we understand the benefits/drawbacks of the approach.
Alternative approaches¶
One suggestion was to simply rip it all out and manage package versions by hand, relying on the review process to make sure versions get updated.
However, I have to categorically rule out any completely manual process. It's impractical for tasks like making a release candidate to involve updating 12-15 files by hand.
I also dislike the question of "when should the version number be updated" becoming a judgement call instead of purely mechanical, because this makes it that much harder to work backwards from "package version Y was built from commit X".
Some automation is necessary.
It would be nice if an appropriate version number was committed to git automatically so there is a static version string in the package file instead of calling out to code. I am envisioning a single script that could automatically determine all the package versions and update the package files for you.
However, the challenge with committing version numbers to git is that it generates another git commit. So a utility that checks to see if a version number needs to be updated by checking if there are commits to a particular module subdirectory since the last version number update would naively loop forever if it doesn't deal with that.
It also means the scheme that assigns a version number based on the git commit timestamp is forced to take the timestamp from commit N-1 (the last code change) instead of commit N (the one where the version number is actually updated).
If, instead of a time stamp, we use a sequence number, then we know what the next number will be, so it sort of avoids that problem. On the other hand, a nice characteristic of timestamps is since they actually mean something, if you're looking at a log it's easier to figure out when a code change happened.
However, in general I believe this issue of "committing a module version to git changes git" is the reason the current system determines version numbers at build time.
(Just to state the obvious, a large part of the reason to assign version numbers at all is to be able to work backwards from a generated artifact to the code that was used to generate it).
So at this point I don't have a design that is clearly better than what we're currently doing, but we should continue the discussion.
Updated by Peter Amstutz 16 days ago
- Target version changed from Development 2025-01-08 to Future
Updated by Brett Smith 5 days ago
Proposed New Git Version Generation Scheme¶
Current Setup¶
Every package that wants to be able to have a Git-based version number has code like the following:
- Call
git log -n1
with the right arguments to get the hash of the commit we want to use for versioning. - Call
version-at-commit.sh
with that hash to turn it into a full version number string. - Make the version number conform to language conventions.
- Save the version number in the appropriate place(s) for the build.
Challenges¶
- This code gets duplicated a lot: seven copies of
arvados_version.py
, threegemspec
files, and probably more I don't know about. - We have the usual code duplication problem of logic not being kept consistent. The initial 3.0.0 release of gems was buggy because gemfiles didn't have this logic right.
- Because this code generally has to be prepared to run outside of Git (e.g., from PyPI source), it's basically impossible to fail correctly: the code generally has to assume that if there's a problem generating a version from Git, then it just shouldn't, and use an existing static version instead. This makes it more difficult to notice and debug problems.
Proposal¶
A successor tool version-at-commit
(probably written in a more ergonomic language than shell) takes over the entire process. It has the following responsibilities:
- It knows the source location of every package that wants to use a Git version.
- It knows each package's dependencies so it can find the right commit to use for versioning.
- It knows the language version conventions of each package.
- It knows how to write a static file inside that package's source tree to define string constants with all the necessary version information.
As an illustrative example, let's say we want to generate a new version number for services/fuse
. The new versioning tool gets the Git commit hash by calling git log -n1 --first-parent -- build/version-at-commit.sh sdk/python services/fuse
(note it knows about the sdk/python
dependency), and then writes services/fuse/arvados_fuse/_version.py
with the following:
__version__ = '3.1.0.dev20241209170209'
commit = '05e1189fb680150c7737fe957b43b314e48daeb1'
timestamp = '2024-12-09T17:02:09Z'
interdependency = '~=3.1.0.dev0'
Now services/fuse/setup.py
just has to say:
version_info = runpy.run_path('arvados_fuse/_version.py')
setup(name='arvados_fuse',
version=version_info['__version__'],
install_requires=[
f"arvados{version_info['interdependency']}",
...
],
...,
)
Details¶
version-at-commit
still respects ARVADOS_BUILDING_VERSION
. If that's set, it uses that directly for the version number, and generates an interdependency string to match (e.g., '==3.0.0'
).
Exactly where the tool writes version information, and how the package's build tool reads it, can vary by language. The goal should be that this tool just writes static strings in the source language's syntax, and that's simple enough that the build system can source-level include it, or whatever's easy. The tool should not try to edit existing files, just write new ones from scratch. These files should be gitignored (just like they are now) and do not need to be committed.
The introduction of this tool means any time you want to do any kind of build from Git, you must first run this tool before any other build steps. However, this doesn't really add any process overhead to our standard processes. Instead, it just means that our build orchestration scripts like run-tests.sh
, run-build-packages.sh
, etc. need to call this tool early on.
Benefits¶
It solves all the Challenges.
The package-level logic becomes much simpler and unconditional whether it's being built from Git, a source package, etc. This greatly reduces the chances that bugs arise in the less common code paths (e.g., doing a release build instead of a development build).
The change is transparent to most users most of the time. The only time to think about it is if you're developing other build tooling (e.g., editing run-tests.sh
, adding a new build script, etc.). It makes no extra work when developers write branches or prepare a release. It does not require dedicated Git commits.
It reduces the implicit dependencies of source packages. Right now we have a handful of test Docker images that have to have Git installed just so the source code can generate a Git version. Now instead the version files can be generated before the Docker build/start, and then the Docker image only needs the language's standard build tooling to read the static files.
The approach doesn't lock us in to any particular build system. e.g., we can switch our Python packages to pyproject.toml
in the near future, and version-at-commit
is not deeply intertwined in the mechanics of old setuptools. Build changes like this might require us to write version information in a different file or syntax, but it's loose coupling, not deep nesting.