Support #18799
closedStrategy to generate Python SDK docstrings based on API docs
Added by Peter Amstutz almost 3 years ago. Updated over 1 year ago.
Description
write script that
- takes the discovery document
- produces Python stubs with docstrings, type annotations etc corresponding to the google api client
- adds the stub files to the python SDK
- runs pydoc
The goal is for the methods/objects found under arvados.api() (generated on the fly by google api client) to be browsable in pydoc.
Files
GroupsIndexDoc.png (124 KB) GroupsIndexDoc.png | Brett Smith, 01/16/2023 08:39 PM | ||
GroupsIndexReturns.png (213 KB) GroupsIndexReturns.png | Brett Smith, 01/16/2023 08:39 PM | ||
discovery-pydoc-prototype.py (1.71 KB) discovery-pydoc-prototype.py | Brett Smith, 01/16/2023 08:39 PM |
Updated by Peter Amstutz almost 3 years ago
- Related to Support #18263: Plan to document the Python SDK added
Updated by Peter Amstutz almost 3 years ago
- Related to Idea #18800: Update Python SDK documentation added
Updated by Peter Amstutz about 2 years ago
- Target version set to 2022-11-23 sprint
Updated by Brett Smith about 2 years ago
One possible implemention: google-api-python-client already generates docstrings for API methods, based on information in the discovery document. For example:
>>> arvc = arvados.api('v1') >>> print(arv.users().create.__doc__) Create a new User. Args: body: object, The request body. (required) select: array, Attributes of the new object to return in the response. ensure_unique_name: boolean, Adjust name to ensure uniqueness instead of returning an error on (owner_uuid, name) collision. cluster_id: string, Create object on a remote federated cluster instead of the current one. Returns: An object of the form: { # User "uuid": "A String", "etag": "A String", # Object version. "owner_uuid": "A String", "created_at": Unknown type! datetime …
Probably the cheapest implementation is to instantiate an API client as normal, then introspect the generated methods to write the stubs. One major downside of this approach is that the docstring generation seems to be very static. I don't think we could customize it (e.g., to follow our own docstring style) without serious monkeypatching. See every mention of docs
starting from https://github.com/googleapis/google-api-python-client/blob/3bbefc1352bcb2e302f7736643c9363799d5f5df/googleapiclient/discovery.py#L1193
If we want more control over the formatting, we'll probably end up basically rewriting all this ourselves. At which point, yeah, we can just work from the discovery document directly instead of the generated Python objects. (We can still use discovery document deserialization from apiclient.schema
.)
Question: Where should the stubs go? In real code all these methods will be attached to the return value of arvados.api
. Maybe call that result arvados.api.Client
or arvados.api.Resources
, and write the stubs under there?
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-23 sprint to 2022-12-21 Sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Updated by Brett Smith about 2 years ago
- Subject changed from Strategy to tie the Python SDK to the API docs to Strategy to generate Python SDK docstrings based on API docs
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Updated by Peter Amstutz about 2 years ago
- Tracker changed from Idea to Support
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2023-02-01 sprint to Future
Updated by Brett Smith about 2 years ago
- File discovery-pydoc-prototype.py discovery-pydoc-prototype.py added
- File GroupsIndexReturns.png GroupsIndexReturns.png added
- File GroupsIndexDoc.png GroupsIndexDoc.png added
Brett Smith wrote in #note-7:
Probably the cheapest implementation is to instantiate an API client as normal, then introspect the generated methods to write the stubs.
I prototyped this. See the attached script (it's just one page!). Call it like this with an Arvados API configuration in place:
python3 discovery-pydoc-prototype.py >arvados/sdk/python/arvados/api_resources.py
Then generate documentation as normal. The documentation will include this api_resources
stub with information about all the API resources and methods.
The formatting is pretty rough. The docstrings only seem to care about plaintext presentation, so pdoc3 makes relatively big formatting decisions based on small whitespace inconsistencies. See attached for a couple of examples of how it looks.
If we need to do the cheapest thing that could possibly work, this is probably it. But there are definitely noticeable presentation improvements to be found by walking the discovery document ourselves and writing our own docstrings instead of using the ones generated by apiclient
.
Updated by Brett Smith about 2 years ago
Doing it ourselves is a matter of iterating over the method definitions that match:
arv_client._resourceDesc['resources'][resource_name]['methods'][method_name]
For each method, look at description
, parameters
, and response
. For each parameter, look at description
, type
, required
, default
, enum
, and enumDescriptions
. Not every parameter will define every key but those should all be checked. Consider special-casing parameters that have only a single enum
possibility.
Cross-reference response
against arv_client._resourceDesc['schemas'][response_type]
.
Updated by Brett Smith about 2 years ago
- Related to Feature #19929: Improve documentation in the discovery document added
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to Development 2023-04-12 sprint
Updated by Brett Smith almost 2 years ago
Brett Smith wrote in #note-15:
Doing it ourselves is a matter of iterating over the method definitions that match:
Discussed after standup. The goal is to write something along these lines, and at least have a decent prototype by the end of the sprint.
Updated by Brett Smith almost 2 years ago
I have an initial prototype up as 18799-api-pydoc-wip. This generates a skeleton that documents all the API methods, with proper signatures. Improvements that could be made from here:
- Document the schema of returned objects too.
- Skip things we want skipped like the
index
method. The R SDK has a list we use for this purpose.
Updated by Brett Smith almost 2 years ago
Brett Smith wrote in #note-19:
Improvements that could be made from here:
- Document the schema of returned objects too.
- Skip things we want skipped like the
index
method. The R SDK has a list we use for this purpose.
These are done in the branch. I think the big question that's left is, where does this run from?
- We want to include the output in Python packages so people can read the documentation via
pydoc
. (It could also potentially help with type checking with a little more glue, but that glue is out of scope for this ticket.) - We want the output to be included in the Python SDK when we build the
doc/
output.
Given this, I think the tool should live under sdk/python/
, and then it can be hooked into setup.py
and doc/Rakefile
as appropriate. But maybe I'm wrong?
There's also the question of what the Python output should be called. So far I've been calling it api_resources.py
, because that hints it's related to the api
module, and the objects returned by google-api-python-client are Resource
objects. But I'm open to suggestions there too.
Updated by Brett Smith almost 2 years ago
Brett Smith wrote in #note-20:
These are done in the branch. I think the big question that's left is, where does this run from?
Everything I've written about this so far assumes that this needs to be run as part of the build process. But maybe we don't need to go that far. Maybe we can just run this manually whenever we make a change to the discovery document, and commit the result to Git, so the Python output is always in the source tree.
The main pro of this approach is it can save significant immediate engineering effort. Doing this correctly will require spinning up an development API server at the same commit as the Python SDK, then fetching its discovery document to generate the output. While we might be able to reuse some test infrastructure for this, it's non-trivial work, especially to integrate it into an existing process like the Python package build.
The main con is of course this becomes something we forget to do. But honestly we have that problem across a lot of our documentation. Having the generation script at least means it's quick to fix when it does happen.
Updated by Peter Amstutz almost 2 years ago
Brett Smith wrote in #note-21:
Brett Smith wrote in #note-20:
These are done in the branch. I think the big question that's left is, where does this run from?
Everything I've written about this so far assumes that this needs to be run as part of the build process. But maybe we don't need to go that far. Maybe we can just run this manually whenever we make a change to the discovery document, and commit the result to Git, so the Python output is always in the source tree.
The main pro of this approach is it can save significant immediate engineering effort. Doing this correctly will require spinning up an development API server at the same commit as the Python SDK, then fetching its discovery document to generate the output. While we might be able to reuse some test infrastructure for this, it's non-trivial work, especially to integrate it into an existing process like the Python package build.
The main con is of course this becomes something we forget to do. But honestly we have that problem across a lot of our documentation. Having the generation script at least means it's quick to fix when it does happen.
We could have a copy of the discovery document committed for use by the documentation build, along with a test case that checks (at a time when the API server is conveniently available) to see that the committed version matches the live version.
Updated by Brett Smith almost 2 years ago
Peter Amstutz wrote in #note-22:
We could have a copy of the discovery document committed for use by the documentation build, along with a test case that checks (at a time when the API server is conveniently available) to see that the committed version matches the live version.
This sounds good to me, but we can't do the trivial version, because (a) the discovery document includes local URLs and other configuration that are expected to change, and (b) the order of keys in the document doesn't seem to stay consistent. So we need to canonicalize it as part of the saving and comparison process.
I've gone through it and I think this is the list of keys we want, with their usual values. Any concerns about this list or process?
{ "kind": "discovery#restDescription", "discoveryVersion": "v1", "id": "arvados:v1", "name": "arvados", "version": "v1", "revision": "20220510", "title": "Arvados API", "description": "The API to interact with Arvados.", "documentationLink": "http://doc.arvados.org/api/index.html", "protocol": "rest", "basePath": "/arvados/v1/", "servicePath": "arvados/v1/", "batchPatch": "batch", "parameters": {...}, "auth": {...}, "schemas": {...}, "resources": {...} }
Updated by Peter Amstutz almost 2 years ago
I don't know the format of the discovery document off hand so I'll take your word for it.
I agree the easiest way to avoid document churn is to do normalization. Sorting the keys and cleaning up a field fields shouldn't be too complicated.
Updated by Brett Smith almost 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Development 2023-04-12 sprint to Development 2023-04-26 sprint
Updated by Brett Smith almost 2 years ago
18799-api-pydoc @ bce5560c00bc5cb6301da0ddd82f3af8a4b24778 - developer-run-tests: #3604
Tom commented in Matrix that the up-to-date test for the committed discovery document could go in the Python SDK, since it starts up an API server. That's true, but before he said that I already wrote the test as an API server integration test. I think that does have a couple of advantages: it's most likely what we will want if we move the discovery document in the future, and if you're changing the discovery document you're more likely to run the API server tests and encounter this failure sooner, rather than waiting for a full Jenkins run.
If the comment had come before I had written the test I probably would've followed it, but since history went the other way around I figure we might as well keep these advantages.
Updated by Brett Smith almost 2 years ago
Meant to add: This branch writes out the discovery document schemas as TypedDicts. I am not thrilled with the way these are currently rendered in either pydoc or pdoc. However, I still propose to move forward with this version because I think this is skating to where the puck is going: as Python's typing becomes more mature and popular, I think documentation systems are likely to improve their support for TypedDicts, and this also gives us a base to start providing type checking for API clients too. So, we're maybe slightly bleeding edge here, but I think it's a reasonable amount.
Updated by Tom Clegg almost 2 years ago
This looks great. And I agree, having the up-to-date test in railsapi sounds more dev-friendly.
Flipping through the output I see various things that aren't quite right, and in every case it's the discovery doc's fault.
Refer to the API documentation for details about how to retrieve specific keys if you need them
Could we refer to the select
parameter specifically here? And perhaps say "attributes" to match the term used in the select
parameter and the generic API docs?
I like that it's easy to navigate from -> ArvadosAPIClient -> Collections -> Collections.list() -> CollectionList, and then back and forth between CollectionList and Collections.list(). You can almost do the same thing with Collections.get() -> Collection, except there's no link back from Collection to Collections.get(). Perhaps worth adding a phrase like "returned by API calls that work on a single item, like Collection.get()"?
(Collection.get() does accept a select
param, which is almost a convenient way to insert this link... except currently the discovery doc doesn't advertise it.)
The way the classes are sorted (there are several classes between ApiClient and ApiClients) and the fact that the left frame doesn't auto-scroll to keep up with the right frame, seem to make the navigation links a little more valuable.
I don't see any issues with the build glue. Everything worked for me on first attempt.
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-30:
Flipping through the output I see various things that aren't quite right, and in every case it's the discovery doc's fault.
Yeah, you may have seen we have #19929 for that and I'm very interesting in tackling that too. The ordering could've gone either way, but Peter preferred to do this first since it had more unknowns and we wanted to make sure it was actually a workable idea.
Could we refer to the
select
parameter specifically here?
Sure, done.
And perhaps say "attributes" to match the term used in the
select
parameter and the generic API docs?
This is trickier. I like consistency, but "attribute" means something specific in Python, it's a thing you access with .
instead of indexing []
. And because of the way the documentation systems are currently rendering TypedDicts, they already make it look like the data is stored in attributes rather than a dictionary. So on balance I feel like it's more important for all the text we write to use the Python language of dictionaries, keys, and values to try to help steer the target audience (Python developers) in the right direction. Still open to discussing it further but that's how I see the priorities for this documentation.
I like that it's easy to navigate from -> ArvadosAPIClient -> Collections -> Collections.list() -> CollectionList, and then back and forth between CollectionList and Collections.list(). You can almost do the same thing with Collections.get() -> Collection, except there's no link back from Collection to Collections.get(). Perhaps worth adding a phrase like "returned by API calls that work on a single item, like Collection.get()"?
I like the bidirectional link, but calling out the get
method feels too-specific to me, since basically every Collections
method returns a Collection
(and even for the exceptions that return CollectionList
, most users will be most interested in the items
that's just a list[Collection]
). Made it so that the Collection
docstring just links to the whole Collections
resource instead.
(Collection.get() does accept a
select
param, which is almost a convenient way to insert this link... except currently the discovery doc doesn't advertise it.)
THAT EXPLAINS IT! I've tried to do this before but it never worked, and now I know why, it's because the Python SDK doesn't know it's there and doesn't generate the code for it. I'll add a note to the discovery document ticket about this.
Now 18799-api-pydoc @ 499240fcfa25c10bc22277b7d382dcbc31436cbe - developer-run-tests: #3606 (I am not waiting for this to finish because if changing a little docstring text causes the tests to fail for a legitimate reason I will eat my hat)
Updated by Tom Clegg almost 2 years ago
Brett Smith wrote in #note-31:
This is trickier. I like consistency, but "attribute" means something specific in Python
Yeah, I thought you might say that. I can live with "keys".
I sort of want us to say this explicitly ("don't get confused just because you see the word attribute in the parameter descriptions") ... otoh, we're leading with "This is the dictionary object" and people have probably seen several examples by the time they get here, so I expect people will get it just fine.
LGTM, thanks.
Updated by Brett Smith almost 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|7ea5b1b2b78ceaa326d8f21ce1d08df0be1d9cda.