Bug #22660
closed
a-c-r can leak credentials if present in the git repo URL
Added by Peter Amstutz 3 days ago.
Updated 1 day ago.
Release relationship:
Auto
Description
Reported by user: some git services use HTTP basic auth with the username and password or API token embedded in the repo URL. The arvados-cwl-runner feature that records git metadata will leak this by accident. If the URL starts with http or https it should filter out the username/password portion of the URL.
- Position changed from -944463 to -944451
- Description updated (diff)
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
developer-run-tests: #4694 
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Sanitizes the git remote URL so it doesn't contain a username or password.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- Added a unit test for the sanitize_url function. Testing the
get_git_info
method is a bit more annoying since it run git in a subprocess and would require more mocking.
- New or changed UX/UX and has gotten feedback from stakeholders.
- Documentation has been updated.
- Behaves appropriately at the intended scale (describe intended scale).
- Considered backwards and forwards compatibility issues between client and server.
- no compatibility issues, the arv:gitOrigin field is purely informational
- Follows our coding standards and GUI style guidelines.
- yes (I even went and cleaned up util.py a little bit to fit PEP 8 better)
It was privately reported by a user a-c-r was leaking github tokens by copying the remote-url directly into the workflow metadata, as one way of providing a github token is write it as:
https://x-access-token:blahblahblah@github.com/foo/bar.git
This branch adds a sanitize step to remove the username/password section of the netloc part of the URL, if present.
- Status changed from New to In Progress
Peter Amstutz wrote in #note-4:
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
I just don't understand why we care about the URL scheme. Can you posit a scenario where logging the credentials is ever the right thing to do? Is there any reason the function couldn't just be this?
def sanitize_url(url):
parts = urllib.parse.urlparse(url)
if parts.port is None:
netloc = parts.hostname
else:
netloc = f'{parts.hostname}:{parts.port}'
return urllib.parse.urlunparse(parts._replace(netloc=netloc))
Thanks.
Brett Smith wrote in #note-7:
Peter Amstutz wrote in #note-4:
22660-acr-cred-leak @ cd0639a7128e46709af34c7e56f24d082195dc25
I just don't understand why we care about the URL scheme. Can you posit a scenario where logging the credentials is ever the right thing to do? Is there any reason the function couldn't just be this? [...]
Mainly I wasn't sure how urlparse
would behave with SSH connection strings. The answer seems to be that they are not valid URLs but it doesn't throw an error either.
>>> urllib.parse.urlparse("git@github.com:arvados/arvados.git")
ParseResult(scheme='', netloc='', path='git@github.com:arvados/arvados.git', params='', query='', fragment='')
That's fine because I'm pretty sure you can't hide credentials in a SSH connection string. I also added a test case to make sure ssh connection strings don't get mangled.
22660-acr-cred-leak @ 0bf6de85d8751c5ccc0a26b855bc588f5d68e258
developer-run-tests: #4697 
- Status changed from In Progress to Resolved
Also available in: Atom
PDF