Contributing from R

Contributing from R

Introduction

Here we describe the contributeData function made available through the rSCR package.  We designed this function to simplify adding study or data entites to the Synapse Commons Repository (SCR) project.  The function can accept either an R list, an R data frame, or a file name.  All inputs need to adhere to a set of standards designed to assure that enough information is available to uniquely identify the project, study, and data, as well as locate the external file that contains the raw data.  To differentiate between relevant information for studies and data the user should use either the study. or data. prefix (note traliing period).  For example, to define names for data use data.name, and to define names for a study use study.name.  Hopefully this will become clearer with the examples below. 

Installing the rSCR package

source("http://depot.sagebase.org/CRAN.R")
pkgInstall("rSCR")

Adding raw data layers from an R list

The project ID for the example project we create below is syn1127233 and can be viewed in the web here.   The following code includes the code to build this project.  Note the createEntity function will not successfully process, but we include it here for completeness.  The remainder of the code will create a study for GEO dataset GSE7765.

library(rSCR)
synapseLogin(userName, password)


## REQUIRED VALUE IS NAME
myProject <- Project(list(name = "Public Example Sandbox", description="A project designed to help users learn how to use Synapse."))
# Comment out line as project already exists:  syn1127233
# myProject <- createEntity(myProject)

# Generate a random study name
studyName = paste("GSE7765_eg_",as.character(round(abs(rnorm(1)),2)), sep="")
contribution <- list(study.name = studyName,
                     species = "Homo sapiens",
                     description = "MCF7 cells were treated with DMSO or 100 nM Dioxin for 16 hr...",
                     numSamples = 12,
                     platform = "hgu133a;hgu133b",
                     cellLine = "MCF7",
                     data.url = "ftp://anonymous:anonymous@ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE7765/GSE7765_RAW.tar",
                     data.status = "raw",
                     data.type = "E",
                     data.lastUpdate = "2007/05/12",                     
                     data.name = "GSE7765 Raw Data Layer from NCBI GEO",
                     data.compound = "DMSO,Dixoin")

entity <- contributeData(contribution,   
			project="syn1127233", # Project to which we wish to add the new data entity.
			keepLocal=FALSE, # Should we keep the raw data after we've downloaded it? FALSE implies it should be deleted
			logFile=TRUE) # Should we keep a logFile.  

The first time we ran this code it created a dataset in the SCR project.  You can view the study here and the raw data here.  Notice that the compound annotation is only present on the data page (you might have to scroll down to see it).  This is because we denoted it using the data. prefix in the R list.  

Adding many datasets from a dataframe or file

We hope to encounter situations were contributors would like to contribute tens to hundreds to thousands of studies.  For these users we allow contributions to be provided in an R data frame or from an external file.  The basic format of these objects is each row contains unique data for a given study and the columns provide specific annotations or properties.  The column headers should denote various properties and annotations, again with the study. and data. prefixes used for fine control over annotations.  We created an example file with contributions from the TG Gate repository.  You can view this file as a google spreadsheet here.  Assuming this file is on your computer and named contributions.txt (note its a tab-delimited text file, not an excel file; also you can name it whatever you want), then you can contribute these studies using the following code:

# First contribute the datasets from the file
entities <- contributeData("tg.txt")

# Then load the file as a data frame and contribute
file <- read.delim("tg.txt",header=TRUE,stringsAsFactors=FALSE)
entities <- contributeData(file)

Useful Points

  • Contributing raw data to Synapse requires the md5 sum of the raw data files.  This can easily be obtained on most operating systems and should be provided to the contributeData function call when available.  To do so simply set the name equal to data.md5.  If the md5 sum is not provided, then the function will download the file (pointed to by data.url), calculate the md5 sum, then contribute the layer to Synapse.  If the file is large, the download time can be considerable, so if you can provide the md5 sum upfront be sure to do it.
  • Note that if the data lives on an ftp site that requires anonymous access, then be sure to stick the username and password in the url as follows (see the example above where we did this for the GEO ftp site): ftp://anonymous:anonymous@ftp.com/files.tar.gz.