How to Organize Data

New projects are set up with a basic structure that data contributors can build upon. This page documents best practices for organizing data and other materials within your NF project. The organization of your data can also affect the later annotation workflow.

Project Folders

A new Synapse Project is initialized using a default structure with these three folders:

Raw Data or Data
Milestone Reports or Reporting
Analysis

While some older projects or independent projects (not sponsored by one of our funders) may not have this exact top-level scheme, this is considered the community standard. There is much flexibility in how to further structure your assets within these core containers, but for community-friendliness and ease of annotation there are additional best practices as explained below.

What if I have something that is not raw data, milestone report, or analysis?

A project may also create an additional folder to house materials that fall outside the scope of these containers, which is usually not an issue.

Raw Data or Data

This is intended to be further partitioned for different types of raw data. For raw data types and formats commonly seen in this location, see How to Format Your Data . In https://sagebionetworks.jira.com/wiki/spaces/NPD/pages/2137326583/How+to+Upload+Data#3.-Create-a-folder-for-your-data , we advise that you create a folder under this location for each data type.

Working Example

The Synodos NF2 project provides a good working example for organization of multiple raw data types within Data. It demonstrates these several guidelines:

Data type is the first and most important grouping factor. Create separate folders for separate data types, e.g. an “RNA-seq” folder that will have .fastq files.

A metadata schema can be applied at the folder level to describe all files within that folder. Since metadata are specific to data types, having the same type within a folder helps keep metadata valid and consistent.

For each data type, the data can be further grouped however makes the most sense for the study. The example above further groups RNA-seq data by release year, but other reasonable factors could be, e.g., by cohort if there were multiple different cohorts.
Original raw data are separated from processed data. A folder can be created to store the processed versions.

Milestone Reports or Reporting

This should house the summary reports that link data files to specific award milestones. Files within this folder usually won’t need further partitioning, unlike Raw Data, and are most relevant to funders rather than data re-users. Sometimes reports placed here are generated by the NF data coordination team.

Analysis

This can house the protocols, code, and derived results that comprise an analysis performed on raw data.

Alongside the Analysis folder, each project has its own Docker Registry to store and distribute Docker images. To make analysis code more reproducible, Docker images can recreate the environment that includes software dependencies and configurations needed for the analysis. See https://help.synapse.org/docs/Synapse-Docker-Registry.2011037752.html.