Support #22370
Updated by Peter Amstutz 17 days ago
Per Dec 4, 2024 sprint retrospective discussion:
The way version numbers are assigned for the SDK packages is awkward. In the case of Python, it involves running code that does a couple of different things based on whether it is being used to generate development packages (the version is derived from the git commit and git tags) or release packages (the version is specified by an environment variable).
The code for determining the version is a shell script with a little bit of Perl. It isn't the nicest.
h2. The things we don't like about this scheme
* The @arvados_version.py@ script itself is over 140 lines of code to do something that ought to be simple and _still_ ends up calling out to @version-at-commit.sh@
* Correctly assigning versions across module dependencies within Arvados is a little complicated and involves having a the whole module dependency graph embedded in @arvados_version.py@
* The @arvados_version.py@ script is copied in every Python module. It can't be symlinked because Python packaging tries very hard to avoid including symlinks in the package.
* The logic in @arvados_version.py@ also needs to exist for Ruby, but until the other day, it didn't. Now we have even more logic duplication.
* In general, it doesn't feel great to have to run code at package build/install time to determine the version, and have _that_ depend on the build/install environment.
h2. What this scheme does well
* Taking the version number from the git timestamp makes it fairly easy to work backwards from a package version to a specific commit it was built from.
* We've done many development iterations and full releases with this system, so we understand the benefits/drawbacks of the approach.
h2. Alternative approaches
One suggestion was to simply rip it all out and manage package versions by hand, relying on the review process to make sure versions get updated.
However, I have to categorically rule out any completely manual process. It's impractical for tasks like making a release candidate to involve updating 12-15 files by hand.
I also dislike the question of "when should the version number be updated" becoming a judgement call instead of purely mechanical, because this makes it that much harder to work backwards from "package version Y was built from commit X".
Some automation is necessary.
It would be nice if an appropriate version number was committed to git automatically so there is a static version string in the package file instead of calling out to code. I am envisioning a single script that could automatically determine all the package versions and update the package files for you.
However, the challenge with committing version numbers to git is that it generates another git commit. So a utility that checks to see if a version number needs to be updated by checking if there are commits to a particular module subdirectory since the last version number update would naively loop forever if it doesn't deal with that.
It also means the scheme that assigns a version number based on the git commit timestamp is forced to take the timestamp from commit N-1 (the last code change) instead of commit N (the one where the version number is actually updated).
If, instead of a time stamp, we use a sequence number, then we know what the next number will be, so it sort of avoids that problem. On the other hand, a nice characteristic of timestamps is since they actually mean something, if you're looking at a log it's easier to figure out when a code change happened.
However, in general I believe this issue of "committing a module version to git changes git" is the reason the current system determines version numbers at build time.
(Just to state the obvious, a large part of the reason to assign version numbers at all is to be able to work backwards from a generated artifact to the code that was used to generate it).
So at this point I don't have a design that is clearly better than what we're currently doing, but we should continue the discussion.