Bug #22660
closeda-c-r can leak credentials if present in the git repo URL
Description
Reported by user: some git services use HTTP basic auth with the username and password or API token embedded in the repo URL. The arvados-cwl-runner feature that records git metadata will leak this by accident. If the URL starts with http or https it should filter out the username/password portion of the URL.
Updated by Peter Amstutz 3 days ago
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Sanitizes the git remote URL so it doesn't contain a username or password.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Added a unit test for the sanitize_url function. Testing the
get_git_info
method is a bit more annoying since it run git in a subprocess and would require more mocking.
- Added a unit test for the sanitize_url function. Testing the
- New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- no change in scale
- Considered backwards and forwards compatibility issues between client and server.
- no compatibility issues, the arv:gitOrigin field is purely informational
- Follows our coding standards and GUI style guidelines.
- yes (I even went and cleaned up util.py a little bit to fit PEP 8 better)
It was privately reported by a user a-c-r was leaking github tokens by copying the remote-url directly into the workflow metadata, as one way of providing a github token is write it as:
https://x-access-token:blahblahblah@github.com/foo/bar.git
This branch adds a sanitize step to remove the username/password section of the netloc part of the URL, if present.
Updated by Brett Smith 2 days ago
Peter Amstutz wrote in #note-4:
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
I just don't understand why we care about the URL scheme. Can you posit a scenario where logging the credentials is ever the right thing to do? Is there any reason the function couldn't just be this?
def sanitize_url(url):
parts = urllib.parse.urlparse(url)
if parts.port is None:
netloc = parts.hostname
else:
netloc = f'{parts.hostname}:{parts.port}'
return urllib.parse.urlunparse(parts._replace(netloc=netloc))
Thanks.
Updated by Peter Amstutz 2 days ago
Brett Smith wrote in #note-7:
Peter Amstutz wrote in #note-4:
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
I just don't understand why we care about the URL scheme. Can you posit a scenario where logging the credentials is ever the right thing to do? Is there any reason the function couldn't just be this? [...]
Mainly I wasn't sure how urlparse
would behave with SSH connection strings. The answer seems to be that they are not valid URLs but it doesn't throw an error either.
>>> urllib.parse.urlparse("git@github.com:arvados/arvados.git") ParseResult(scheme='', netloc='', path='git@github.com:arvados/arvados.git', params='', query='', fragment='')
That's fine because I'm pretty sure you can't hide credentials in a SSH connection string. I also added a test case to make sure ssh connection strings don't get mangled.
22660-acr-cred-leak @ 0bf6de85d8751c5ccc0a26b855bc588f5d68e258
Updated by Brett Smith 2 days ago
Peter Amstutz wrote in #note-8:
22660-acr-cred-leak @ 0bf6de85d8751c5ccc0a26b855bc588f5d68e258
LGTM, thanks.