Support #22370
openRework how development and testing/release package versions are determined in the source tree
Description
Per Dec 4, 2024 sprint retrospective discussion:
The way version numbers are assigned for the SDK packages is awkward. In the case of Python, it involves running code that does a couple of different things based on whether it is being used to generate development packages (the version is derived from the git commit and git tags) or release packages (the version is specified by an environment variable).
The code for determining the version is a shell script with a little bit of Perl. It isn't the nicest.
The things we don't like about this scheme¶
- The
arvados_version.py
script itself is over 140 lines of code to do something that ought to be simple and still ends up calling out toversion-at-commit.sh
- Correctly assigning versions across module dependencies within Arvados is a little complicated and involves having a the whole module dependency graph embedded in
arvados_version.py
- The
arvados_version.py
script is copied in every Python module. It can't be symlinked because Python packaging tries very hard to avoid including symlinks in the package. - The logic in
arvados_version.py
also needs to exist for Ruby, but until the other day, it didn't. Now we have even more logic duplication. - In general, it doesn't feel great to have to run code at package build/install time to determine the version, and have that depend on the build/install environment.
What this scheme does well¶
- Taking the version number from the git timestamp makes it fairly easy to work backwards from a package version to a specific commit it was built from.
- We've done many development iterations and full releases with this system, so we understand the benefits/drawbacks of the approach.
Alternative approaches¶
One suggestion was to simply rip it all out and manage package versions by hand, relying on the review process to make sure versions get updated.
However, I have to categorically rule out any completely manual process. It's impractical for tasks like making a release candidate to involve updating 12-15 files by hand.
I also dislike the question of "when should the version number be updated" becoming a judgement call instead of purely mechanical, because this makes it that much harder to work backwards from "package version Y was built from commit X".
Some automation is necessary.
It would be nice if an appropriate version number was committed to git automatically so there is a static version string in the package file instead of calling out to code. I am envisioning a single script that could automatically determine all the package versions and update the package files for you.
However, the challenge with committing version numbers to git is that it generates another git commit. So a utility that checks to see if a version number needs to be updated by checking if there are commits to a particular module subdirectory since the last version number update would naively loop forever if it doesn't deal with that.
It also means the scheme that assigns a version number based on the git commit timestamp is forced to take the timestamp from commit N-1 (the last code change) instead of commit N (the one where the version number is actually updated).
If, instead of a time stamp, we use a sequence number, then we know what the next number will be, so it sort of avoids that problem. On the other hand, a nice characteristic of timestamps is since they actually mean something, if you're looking at a log it's easier to figure out when a code change happened.
However, in general I believe this issue of "committing a module version to git changes git" is the reason the current system determines version numbers at build time.
(Just to state the obvious, a large part of the reason to assign version numbers at all is to be able to work backwards from a generated artifact to the code that was used to generate it).
So at this point I don't have a design that is clearly better than what we're currently doing, but we should continue the discussion.
Updated by Peter Amstutz about 2 months ago
- Target version changed from Development 2025-01-08 to Future
Updated by Brett Smith about 1 month ago
Proposed New Git Version Generation Scheme¶
Current Setup¶
Every package that wants to be able to have a Git-based version number has code like the following:
- Call
git log -n1
with the right arguments to get the hash of the commit we want to use for versioning. - Call
version-at-commit.sh
with that hash to turn it into a full version number string. - Make the version number conform to language conventions.
- Save the version number in the appropriate place(s) for the build.
Challenges¶
- This code gets duplicated a lot: seven copies of
arvados_version.py
, threegemspec
files, and probably more I don't know about. - We have the usual code duplication problem of logic not being kept consistent. The initial 3.0.0 release of gems was buggy because gemfiles didn't have this logic right.
- Because this code generally has to be prepared to run outside of Git (e.g., from PyPI source), it's basically impossible to fail correctly: the code generally has to assume that if there's a problem generating a version from Git, then it just shouldn't, and use an existing static version instead. This makes it more difficult to notice and debug problems.
Proposal¶
A successor tool version-at-commit
(probably written in a more ergonomic language than shell) takes over the entire process. It has the following responsibilities:
- It knows the source location of every package that wants to use a Git version.
- It knows each package's dependencies so it can find the right commit to use for versioning.
- It knows the language version conventions of each package.
- It knows how to write a static file inside that package's source tree to define string constants with all the necessary version information.
As an illustrative example, let's say we want to generate a new version number for services/fuse
. The new versioning tool gets the Git commit hash by calling git log -n1 --first-parent -- build/version-at-commit.sh sdk/python services/fuse
(note it knows about the sdk/python
dependency), and then writes services/fuse/arvados_fuse/_version.py
with the following:
__version__ = '3.1.0.dev20241209170209'
commit = '05e1189fb680150c7737fe957b43b314e48daeb1'
timestamp = '2024-12-09T17:02:09Z'
interdependency = '~=3.1.0.dev0'
Now services/fuse/setup.py
just has to say:
version_info = runpy.run_path('arvados_fuse/_version.py')
setup(name='arvados_fuse',
version=version_info['__version__'],
install_requires=[
f"arvados{version_info['interdependency']}",
...
],
...,
)
Details¶
version-at-commit
still respects ARVADOS_BUILDING_VERSION
. If that's set, it uses that directly for the version number, and generates an interdependency string to match (e.g., '==3.0.0'
).
Exactly where the tool writes version information, and how the package's build tool reads it, can vary by language. The goal should be that this tool just writes static strings in the source language's syntax, and that's simple enough that the build system can source-level include it, or whatever's easy. The tool should not try to edit existing files, just write new ones from scratch. These files should be gitignored (just like they are now) and do not need to be committed.
The introduction of this tool means any time you want to do any kind of build from Git, you must first run this tool before any other build steps. However, this doesn't really add any process overhead to our standard processes. Instead, it just means that our build orchestration scripts like run-tests.sh
, run-build-packages.sh
, etc. need to call this tool early on.
Benefits¶
It solves all the Challenges.
The package-level logic becomes much simpler and unconditional whether it's being built from Git, a source package, etc. This greatly reduces the chances that bugs arise in the less common code paths (e.g., doing a release build instead of a development build).
The change is transparent to most users most of the time. The only time to think about it is if you're developing other build tooling (e.g., editing run-tests.sh
, adding a new build script, etc.). It makes no extra work when developers write branches or prepare a release. It does not require dedicated Git commits.
It reduces the implicit dependencies of source packages. Right now we have a handful of test Docker images that have to have Git installed just so the source code can generate a Git version. Now instead the version files can be generated before the Docker build/start, and then the Docker image only needs the language's standard build tooling to read the static files.
The approach doesn't lock us in to any particular build system. e.g., we can switch our Python packages to pyproject.toml
in the near future, and version-at-commit
is not deeply intertwined in the mechanics of old setuptools. Build changes like this might require us to write version information in a different file or syntax, but it's loose coupling, not deep nesting.
Updated by Peter Amstutz 8 days ago
This is great. I like it. Let's do it.
I'm sorry that it took a month for me to get around to reviewing it.
Updated by Peter Amstutz 8 days ago
- Target version changed from Future to Development 2025-02-05
Updated by Brett Smith 6 days ago
At engineering meeting we talked about:
- I would like to write this in a declarative way, where there is a non-code configuration file in a standard format that describes each "package" in the Arvados monorepo, with information about what build system it uses, where to write version information, what other packages it depends on, etc.
- Once we have that, it would be nice if the tool could not just write version information but do source builds, run tests, etc.
run-tests.sh
has a lot of this information baked into its logic. Moving it to a declarative system would be nice. - Having the tool be in Python with minimal dependencies seems like the lowest common denominator we can agree upon for bootstrapping.
A couple things shake out of this.
One, I think the configuration format should be TOML. tomllib
was added to the Python stdlib as of 3.11, so as of Debian 12 it requires nothing extra. Debian 11 packages python3-toml
which provides a compatible interface, so it's easy to get without setting up a virtualenv or other bootstrapping. If we accept this bit of temporary transitional pain, we get a configuration file format with comments and strong types including arrays. Our other options are:
- INI format parsed by
configparser
, which is human-friendly and in every Python stdlib but all the values are string-typed so we have to write more parsing code ourselves. - JSON parsed by
json
which lacks comments and is kind of annoying for humans to write. - YAML parsed by some non-standard library which adds a dependency and has several footguns.
Since the TOML format is strictly better than INI and the downsides compared to INI disappear over time I think it's the best call overall.
Two, the interface should probably be: command SUBCOMMAND [OPTIONS ...] [TARGET ...]
. This gives us room to grow to support other operations in the future. Also, subcommands can be added over a series of tickets. The first subcommand can be write-version
to fulfill this ticket.
Three, the implementation should probably be in a module directory with a __main__.py
so we have room to grow (and write tests!). We can write a wrapper command.py
that provides a thin wrapper for runpy.run_module()
if that's helpful. Either way it still should not require pre-installation to use.