Data retrieval and normalization (large set of patients)

We needed to build a new coexpression network with the same patients whose DNA methylation data was used for comethylation networks and other analyses.

Steps:

Retrieve level 3 data from TCGA, request the same patient barcodes that we have in DNA methylation dataset. Level 3 data contains within array normalized and summarized data (infomation from multiple probes is summarized to a single gene)
Perform normalization (most likely I will only need to remove the batch effect since from the analyses of DNA methylation data I know that other variables available to us such as day, month and center are correlated with batch)
Use normalized data to build gene coexpression network

Data retrieval

Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. Total number of obtained files is 522. Types of patients represented in this set:
```
> colnames(expr)[1:4]
[1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
[3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
> sp<-strsplit(colnames(expr),split="-")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D 02A
481  24   2   1  14
```
So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample)
For the comparison, I also looked at the samples in the methylation dataset:
```
> mb<-read.table("methylated_batch.txt",header=T,row.names=1)
> length(colnames(mb))
[1] 511
> colnames(mb)[1:4]
[1] "TCGA.04.1331.01A.01D.0432.05" "TCGA.04.1332.01A.01D.0432.05"
[3] "TCGA.04.1335.01A.01D.0432.05" "TCGA.04.1336.01A.01D.0432.05"
> sp<-strsplit(colnames(mb),"[.]")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D
483  25   2   1
```
So for the expression analysis I need to get rid of all patients that have secondary tumor (02) and I am missing one patient with 01A and 01B in the expression dataset. So the solution is to match the long patient IDs and work with that.
Get clinical files for the same 511 patient barcodes. Download to the same folder.

Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:

exprProcess<-function(){
  patientIDs<-list()
  data<-list()
  files<-list.files(pattern=".txt")
  y<-read.table(files[1],skip=1,header=F)
  geneSymbols<-y[,2]
  n=0
  for (i in seq(along=files)){
    print(files[i])
    patExpr<-read.table(files[i],skip=1,header=F)
    patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
    data[[length(data)+1]]<-patExpr[,3]
    n<-n+1
    print(n)}

  return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
}

After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names.

Normalization

Next step is to do PCA that I will use for identification of variables that affect the data.

library(corpcor)
u<-fast.svd(sweep(expr),1,rowMeans(expr))


#to plot relative variance:
plot(u$d^2/sum(u$d^2),main="Relative variance")


#plot the first 4 eigengenes:
par(mfrow=c(2,2))
plot(u$v[,1],main="1")
plot(u$v[,2],main="2")plot(u$v[,3],main="3")plot(u$v[,4],main="4")

#Identification of the "worst" offenders of the first eigengene:
order(u$v[,1])[1]
order(-u$v[,1])[1]
plot(expr[,c(37,383)],main="37 vs 383")

So it seems to me that the data on overall doesn't look horrible. Now I wanted to see if the first eigengene is correlated with batch (eyeballing it). Batch is the sixth field in the patient barcode. So just split patient barcodes by "-" and get the sixth field. First four eigengenes in relation to the batch effect:

library(lattice)
xyplot(u$v[,1]~factor(batch),main="1")
xyplot(u$v[,2]~factor(batch),main="2")
xyplot(u$v[,3]~factor(batch),main="3")
xyplot(u$v[,4]~factor(batch),main="4")

Remove batch effect:

X<-model.matrix(~factor(batch))
bch <- solve(t(X) %*% X) %*% t(X) %*% t(expr)
resExpr <- expr-t(X %*% bch)

Lets do the same diagnostic tests:

The only thing that concerns me after removing the batch effect is how "the worst" offenders look like. These two arrays were identified as the "worst" offenders of the first eigengene after removing the batch effect and here how they look like (first column). The second column shows the same 2 datasets before removing the batch effect:

- batch effect	+ batch effect

It seems that removing the batch effect completely reversed their relationship.

Now another comparison for the two datasets that were identified as the worst offenders before removing the batch effect and what happened to them after removing the batch effect:

- batch	+ batch

Is it too drastic?