Distinguish pgp-originated and other files on public profiles
There is no way for the viewer of a public profile page to tell the difference between a user-uploaded file and a PGP-uploaded file.
Add a way, either as a column in the table, or by making two different sections in the profile.
- Provide more information in the JSON version of the public genetic data page.
- Bring the data_source field in the JSON version of the public profiles
and the public genetic data page in line with the HTML version.
- Provide more information in the JSON version of the public profiles.
#3 Updated by Phil Hodgson about 5 years ago
- % Done changed from 0 to 80
I think I may have determined the difference, and I'm going to use it as a guess: the list of "user files and datasets" has two basic types of files shown: ones from the UserFile model and ones from the Dataset model (in the case of the Public Profile, also using the "published" scope). I'm not rightly sure I know whether the Dataset model implies that its source is the PGP, but it seems a reasonable guess. I'll make a commit which exposes the model class and may be that way it will be possible to distinguish them. I'm also going to throw in the JSON rendering of the Public Profile information, because it's so easy for us to do this and so helpful to what is needed (i.e. rather than scraping the public profile HTML!).
#4 Updated by Ward Vandewege about 5 years ago
That distinction is correct. I also have another way almost ready to download each set of data, by using workbench. I've created projects for both kinds of data, and am wrapping up some information for Madeleine to get at the data that way. Anyway, what you're doing is fine too; it will be good to indicate on the download page which files originate from Harvard PGP.
#5 Updated by Phil Hodgson about 5 years ago
Okay, that's good.
Meanwhile, for the sake of documenting the process, here's Madeleine's email. I'll start with this, and leave some of the other bits out for another iteration (like CCRs and Samples).
Yes, a REST API representing public profile content would be great! I prefer JSON to XML (pretty sure this is also easy for you to do). That JSON should include distinguishing the source of a file.
We should still also distinguish sources on the HTML version of the page though, it's just user unfriendly not to be clear about that I think...
Lots of public profile data could (should?) be represented in a JSON version. My current focus is on getting: * list of links to genome data files generated by the PGP (i.e. PGP is source), plus some descriptor of file type, e.g. "Complete Genomics var file", "Complete Genomics masterVarBeta file" (you might not have data types recorded in this much detail, so it's ok if I have to continue inferring data type) * JSON-format survey data, including for each: name of survey, timestamp, and a list of question and response values
My current code infers all this from the HTML content. But scraping is fragile, JSON would be great to get. It's okay if I still have to do some inferences (e.g. infer file types based on the file title), just avoiding parsing this from HTML would be an important improvement.