We needed to build a new coexpression network with the same patients whose DNA methylation data was used for comethylation networks and other analyses.
Steps:
- Retrieve level 3 data from TCGA, request the same patient barcodes that we have in DNA methylation dataset. Level 3 data contains within array normalized and summarized data (infomation from multiple probes is summarized to a single gene)
- Perform normalization (most likely I will only need to remove the batch effect since from the analyses of DNA methylation data I know that other variables available to us such as day, month and center are correlated with batch)
- Use normalized data to build gene coexpression network
Data retrieval
- Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
- Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. Total number of obtained files is 523.
- Get clinical files for the same 511 patient barcodes. Download to the same folder.
- Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:
exprProcess<-function(){
patientIDs<-list()
data<-list()
files<-list.files(pattern=".txt")
y<-read.table(files[1],skip=1,header=F)
geneSymbols<-y[,2]
n=0
for (i in seq(along=files)){
print(files[i])
patExpr<-read.table(files[i],skip=1,header=F)
patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
data[[length(data)+1]]<-patExpr[,3]
n<-n+1
print(n)
}
return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
}