THIS PAGE HAS MOVED
See https://sites.google.com/a/google.com/sage-google/projects/sra-data
Context
- Sage scientific expertise
- Sage expertise in data intensive biological analysis and interest in initiating large-scale sequencing analysis program.
- Google engineering expertise
- Google tools potentially enabling for large scale (tera or petabyte) data query and analysis.
Once we are started we will think hard about the scientific and engineering challenges. However, I am confident that if the data is transparently queryable we can make a "go" decision and commit to pursuing the collaboration. However, data access is a major challenge given regulatory hurdles (e.g. dbGAP) and difficulty in organizing the data. Therefore, I would like to focus this hour on practically determining our ability to interact with SRA data through Google tools which will allow us to decide on next steps.
Meeting objectives:
- What data can we access?
- How to handle dbGAP-type access restrictions if we want to do meta analysis across all of SRA?
- How do we access itdata?
- What formats is it are data in?
- How is data organized including meta-data sample annotations, and how we can interact with it.?
- Is it possible to drill down to something akin to a schema description of how the data are organized and begin exploring the database?
** B. Bot - a few notes - I think we should stay clear of the "can we get our hands on the data" questions - which many of the above seem to elude to. We may want to steer the questions (first) more towards what can we do with these data - and then drill down to the details listed above later. I also think it is important to go into this meeting thinking about the long term - as in "what can we do over time as these data continue to evolve and become even more rich" as apposed to "what can we do right now". There are plenty of other groups that have access to the SRA data who I'm sure that thinking about the 'now'. I believe the long term visions are where the true value of this possible collaboration lie. **
Notes on available sequencing data
...
3) The data model (study/experiment/sample/run) is not intuitive and often does not make sense
DNAnexus SRA: An Evaluation
...
Slightly better search capabilities than NCBI
Bioconductor tools
To do: Check out Bioconductor tools for SRA access: http://bioconductor.org/help/workflows/high-throughput-sequencing/
** B. Bot - The Bioconductor tools are nice for querying the SRA database for information re: studies, samples, etc. as Mette has done through the web UI ... but I'm not sure there is a lot of functionality there as far as accessing the raw data files. This entire project was put on hold when the future of the SRA was in jeopardy earlier this year. I have contact with one of the co-authors of the package, though if we need more specific information. **