We needed to build a new coexpression network with the same patients whose DNA methylation data was used for comethylation networks and other analyses.
Steps:
- Retrieve level 3 data from TCGA, request the same patient barcodes that we have in DNA methylation dataset. Level 3 data contains within array normalized and summarized data (infomation from multiple probes is summarized to a single gene)
- Perform normalization (most likely I will only need to remove the batch effect since from the analyses of DNA methylation data I know that other variables available to us such as day, month and center are correlated with batch)
- Use normalized data to build gene coexpression network
Data retrieval
- Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
- Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. Total number of obtained files is 523.
- Get clinical files for the same 511 patient barcodes. Download to the same folder.
- Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:
After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names.
exprProcess<-function(){ patientIDs<-list() data<-list() files<-list.files(pattern=".txt") y<-read.table(files[1],skip=1,header=F) geneSymbols<-y[,2] n=0 for (i in seq(along=files)){ print(files[i]) patExpr<-read.table(files[i],skip=1,header=F) patientIDs[[length(patientIDs)+1]]<-patExpr[1,1] data[[length(data)+1]]<-patExpr[,3] n<-n+1 print(n)} return(list(Data=data,patNames=patientIDs,genes=geneSymbols)) }
Normalization
Next step is to do PCA that I will use for identification of variables that affect the data.
library(corpcor) u<-fast.svd(sweep(expr),1,rowMeans(expr)) #to plot relative variance: plot(u$d^2/sum(u$d^2),main="Relative variance")