...
- Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
- Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. Total number of obtained files is 523.
- Get clinical files for the same 511 patient barcodes. Download to the same folder.
- Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:
Code Block exprProcess<-function(){
...
patientIDs<-list()
...
data<-list()
...
files<-list.files(pattern=".txt")
...
y<-read.table(files[1],skip=1,header=F)
...
geneSymbols<-y[,2]
...
n=0
...
for (i in seq(along=files)){
...
print(files[i])
...
patExpr<-read.table(files[i],skip=1,header=F)
...
patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
...
data[[length(data)+1]]<-patExpr[,3]
...
n<-n+1
...
print(n)}
...
return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
...
After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names.}
Normalization
Next step is to do PCA that I will use for identification of variables that affect the data.
Code Block |
---|
library(corpcor)
u<-fast.svd(sweep(expr),1,rowMeans(expr))
#to plot relative variance:
plot(u$d^2/sum(u$d^2),main="Relative variance") |