Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Code Block
library(corpcor)
u<-fast.svd(sweep(expr,1,rowMeans(expr)))


#to plot relative variance:
plot(u$d^2/sum(u$d^2),main="Relative variance")


#plot the first 4 eigengenes:
par(mfrow=c(2,2))
plot(u$v[,1],main="1")
plot(u$v[,2],main="2")plot(u$v[,3],main="3")plot(u$v[,4],main="4")

#Identification of the "worst" offenders of the first eigengene:
order(u$v[,1])[1]
order(-u$v[,1])[1]
plot(expr[,c(37,383313)],main="37 vs 383")

Image Removed Image Removed Image Removed
So it seems to me that the data on overall doesn't look horrible. Now I wanted to see if the first eigengene is correlated with batch (eyeballing it). Batch is the sixth field in the patient barcode. So just split patient barcodes by "-" and get the sixth field. First four eigengenes in relation to the batch effect:

Code Block
library(lattice)
xyplot(u$v[,1]~factor(batch),main="1")
xyplot(u$v[,2]~factor(batch),main="2")
xyplot(u$v[,3]~factor(batch),main="3")
xyplot(u$v[,4]~factor(batch),main="4")

Image Removed Image Removed Image Removed Image RemovedImage Added Image Added
These two arrays came from different batches (graph title is wrong: it is 37 vs 313). Plot 1st eigengene vs batch and label the outliers:

Code Block
x<-u507$v[,1]

names(x)<-colnames(exp507)

#Identification of outliers in the boxplot of the first eigengene grouped by batch
> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out#Identification of outliers in the boxplot of the first eigengene grouped by batch
> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out
TCGA-13-0762-01A TCGA-13-0768-01A TCGA-24-1614-01A TCGA-04-1655-01A
     -0.12132315       0.06403801      -0.10078455      -0.05562895
TCGA-04-1649-01A TCGA-04-1652-01A TCGA-09-2049-01D TCGA-29-2425-01A
     -0.07523963      -0.15237043      -0.09206651      -0.05197294
TCGA-13-0762-01A TCGA-13-0768-01A TCGA-24-1614-01A TCGA-04-1655-01A
     -0.12132315       0.06403801      -0.10078455      -0.05562895
TCGA-04-1649-01A TCGA-04-1652-01A TCGA-09-2049-01D TCGA-29-2425-01A
     -0.07523963      -0.15237043      -0.09206651      -0.0519729

> boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")
> text(x=boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$group,y=boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out,labels=names(boxplot(split(x,shortBatch),xlab="batch",ylab="1st eigengene")$out))

Image Added  Image Added

Kruskal-Wallis test for association of each egengene the first 4 egengenes with batch:

Code Block
> kruskal.test(u507$v[,1],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 1] and shortBatch
Kruskal-Wallis chi-squared = 126.2598, df = 12, p-value < 2.2e-16

> kruskal.test(u507$v[,2],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 2] and shortBatch
Kruskal-Wallis chi-squared = 44.3239, df = 12, p-value = 1.345e-05

> kruskal.test(u507$v[,3],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 3] and shortBatch
Kruskal-Wallis chi-squared = 22.1812, df = 12, p-value = 0.03554

> kruskal.test(u507$v[,4],shortBatch)

        Kruskal-Wallis rank sum test

data:  u507$v[, 4] and shortBatch
Kruskal-Wallis chi-squared = 62.7275, df = 12, p-value = 7.152e-09

I identified that the first 150 eigengenes account for 80% of the variance, performed Kruskal-Wallis test with all 150 and plotted the P value. Also looked at the correlation of the batch and the center effect and looked at the correlation of the first 150 eigengenes with the center:
Image Added Image Added Image Added
Remove the batch effect:

Code Block
X<-model.matrix(~factor(batch))
bch <- solve(t(X) %*% X) %*% t(X) %*% t(expr)
resExpr <- expr-t(X %*% bch)

Lets do the same diagnostic tests:

Image Removed Image Removed
The only thing that concerns me after removing the batch effect is how "the worst" offenders look like. These two arrays were identified as the "worst" offenders of the first eigengene Look at the relative variance and the association of the eigengenes with the batch and the center after removing the batch effect and here how they look like (first column). The second column shows the same 2 datasets before removing the batch effect:

- batch effect

+ batch effect

Image Removed

Image Removed

It seems that removing the batch effect completely reversed their relationship.

Now another comparison for the two datasets that were identified as the worst offenders before removing the batch effect and what happened to them after removing the batch effect:

- batch

+ batch

Image Removed

Image Removed

Is it too drastic?(again, looking at the first 150 eigengenes):
Image Added Image Added Image Added
I looked specifically at the p values from the Kruskal-Wallis test for association with the center effect: 

Code Block
> kruskal.test(uRes$v[,1],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 1] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 34.8516, df = 14, p-value = 0.001546

> kruskal.test(uRes$v[,2],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 2] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 16.1882, df = 14, p-value = 0.302

> kruskal.test(uRes$v[,3],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 3] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 16.0633, df = 14, p-value = 0.3095

> kruskal.test(uRes$v[,4],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 4] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 32.3457, df = 14, p-value = 0.003576

> kruskal.test(uRes$v[,5],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 5] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 23.6943, df = 14, p-value = 0.04987

> kruskal.test(uRes$v[,6],shortCenter[,3])

        Kruskal-Wallis rank sum test

data:  uRes$v[, 6] and shortCenter[, 3]
Kruskal-Wallis chi-squared = 21.6685, df = 14, p-value = 0.08569

I guess for some eigengenes these p values might be considered significant but for our analyses I stopped at this step (only the batch effect was removed) and I now will proceed with building a coexpression network using these 507 patients. 

Note: I also performed a few analyses where I removed batch and the center and then also looked at the distribution of Kruskal-Wallis test with day of shipment, month of shipment, year of shipment, concentration, plate column, plate row and amount. Justin suggested that center effect is very minor and not worth removing but I also noticed that removing batch and center also completely removed the day, month and year of shipment effects (which I saw that in DNA methylation normalization these technical factors were highly correlated with the batch effect) and concentration, plate column, plate row and amount are insignificant. The graphs for these analyses can be found here.