/
Data retrieval and normalization (large set of patients)

Data retrieval and normalization (large set of patients)

We needed to build a new coexpression network with the same patients whose DNA methylation data was used for comethylation networks and other analyses.

Steps:

  1. Retrieve level 3 data from TCGA, request the same patient barcodes that we have in DNA methylation dataset. Level 3 data contains within array normalized and summarized data (infomation from multiple probes is summarized to a single gene)
  2. Perform normalization (most likely I will only need to remove the batch effect since from the analyses of DNA methylation data I know that other variables available to us such as day, month and center are correlated with batch)
  3. Use normalized data to build gene coexpression network

Data retrieval

  1. Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable". 
  2. Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. 
  3. Get clinical files for the same 511 patient barcodes. Download to the same folder.
  4. Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix: 
    exprProcess<-function(){
      patientIDs<-list()
      data<-list()
      files<-list.files(pattern=".txt")
      y<-read.table(files[1],skip=1,header=F)
      geneSymbols<-y[,2]
      n=0
      for (i in seq(along=files)){
        print(files[i])
        patExpr<-read.table(files[i],skip=1,header=F)
        patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
        data[[length(data)+1]]<-patExpr[,3]
        n<-n+1
        print(n)}
    
      return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
    }
    After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names. 

Total number of obtained files is 522. Types of patients represented in this set:

> colnames(expr)[1:4]
[1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
[3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
> sp<-strsplit(colnames(expr),split="-")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D 02A
481  24   2   1  14

So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample) 
For the comparison, I also looked at the samples in the methylation dataset:

> mb<-read.table("methylated_batch.txt",header=T,row.names=1)
> length(colnames(mb))
[1] 511
> colnames(mb)[1:4]
[1] "TCGA.04.1331.01A.01D.0432.05" "TCGA.04.1332.01A.01D.0432.05"
[3] "TCGA.04.1335.01A.01D.0432.05" "TCGA.04.1336.01A.01D.0432.05"
> sp<-strsplit(colnames(mb),"[.]")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D
483  25   2   1

So for the expression analysis I need to get rid of all patients that have secondary tumor (02) and I am missing one patient with 01A and 01B in the expression dataset. So the solution is to match the patient IDs using only the first 4 fields (because the rest of them obviously differ (sample type (D or R))). As the result I got 507 patients for gene expression data (no secondary tumor), 3 missing from the 01A and 1 is missing from 01B group. I think I will stick to this group for the future normalization steps and the analyses. 

Normalization

Next step is to do PCA that I will use for identification of variables that affect the data. 

library(corpcor)
u<-fast.svd(sweep(expr,1,rowMeans(expr)))


#to plot relative variance:
plot(u$d^2/sum(u$d^2),main="Relative variance")

#Identification of the "worst" offenders of the first eigengene:
order(u$v[,1])[1]
order(-u$v[,1])[1]
plot(expr[,c(37,313)],main="37 vs 383")


These two arrays came from different batches (graph title is wrong: it is 37 vs 313). Plot 1st eigengene vs batch and label the outliers:

x<-u507$v[,1]

names(x)<-colnames(exp507)

#Identification of outliers in the boxplot of the first eigengene grouped by batch
> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out#Identification of outliers in the boxplot of the first eigengene grouped by batch
> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out
TCGA-13-0762-01A TCGA-13-0768-01A TCGA-24-1614-01A TCGA-04-1655-01A
     -0.12132315       0.06403801      -0.10078455      -0.05562895
TCGA-04-1649-01A TCGA-04-1652-01A TCGA-09-2049-01D TCGA-29-2425-01A
     -0.07523963      -0.15237043      -0.09206651      -0.05197294
TCGA-13-0762-01A TCGA-13-0768-01A TCGA-24-1614-01A TCGA-04-1655-01A
     -0.12132315       0.06403801      -0.10078455      -0.05562895
TCGA-04-1649-01A TCGA-04-1652-01A TCGA-09-2049-01D TCGA-29-2425-01A
     -0.07523963      -0.15237043      -0.09206651      -0.0519729

> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")
> text(x=boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$group,y=boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out,labels=names(boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out))

 

Kruskal-Wallis test for association of the first 4 egengenes with batch:

> kruskal.test(u507$v[,1],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 1] and shortBatch
Kruskal-Wallis chi-squared = 126.2598, df = 12, p-value < 2.2e-16

> kruskal.test(u507$v[,2],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 2] and shortBatch
Kruskal-Wallis chi-squared = 44.3239, df = 12, p-value = 1.345e-05

> kruskal.test(u507$v[,3],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 3] and shortBatch
Kruskal-Wallis chi-squared = 22.1812, df = 12, p-value = 0.03554

> kruskal.test(u507$v[,4],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 4] and shortBatch
Kruskal-Wallis chi-squared = 62.7275, df = 12, p-value = 7.152e-09

I identified that the first 150 eigengenes account for 80% of the variance, performed Kruskal-Wallis test with all 150 and plotted the P value. Also looked at the correlation of the batch and the center effect and looked at the correlation of the first 150 eigengenes with the center:

Remove the batch effect:

X<-model.matrix(~factor(batch))
bch <- solve(t(X) %*% X) %*% t(X) %*% t(expr)
resExpr <- expr-t(X %*% bch)

Look at the relative variance and the association of the eigengenes with the batch and the center after removing the batch effect (again, looking at the first 150 eigengenes):

I looked specifically at the p values from the Kruskal-Wallis test for association with the center effect: 

> kruskal.test(uRes$v[,1],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 1] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 34.8516, df = 14, p-value = 0.001546

> kruskal.test(uRes$v[,2],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 2] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 16.1882, df = 14, p-value = 0.302

> kruskal.test(uRes$v[,3],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 3] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 16.0633, df = 14, p-value = 0.3095

> kruskal.test(uRes$v[,4],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 4] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 32.3457, df = 14, p-value = 0.003576

> kruskal.test(uRes$v[,5],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 5] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 23.6943, df = 14, p-value = 0.04987

> kruskal.test(uRes$v[,6],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 6] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 21.6685, df = 14, p-value = 0.08569

I guess for some eigengenes these p values might be considered significant but for our analyses I stopped at this step (only the batch effect was removed) and I now will proceed with building a coexpression network using these 507 patients. 

Note: I also performed a few analyses where I removed batch and the center and then also looked at the distribution of Kruskal-Wallis test with day of shipment, month of shipment, year of shipment, concentration, plate column, plate row and amount. Justin suggested that center effect is very minor and not worth removing but I also noticed that removing batch and center also completely removed the day, month and year of shipment effects (which I saw that in DNA methylation normalization these technical factors were highly correlated with the batch effect) and concentration, plate column, plate row and amount are insignificant. The graphs for these analyses can be found here.