M value

M value: why I decided to split the probes into "red" and "green"

SVD on the entire 27k by 511 patients M value matrix, plot 1st eigenarray (u matrix) and "color" the points according to the dye with with each CpG was labeled:

M value: analysis of red probes, identification of adjustment variables

Percent variance explained of the "red" dataset:

Potential adjustment variables: batch (sixth field in the patient ID barcode); center (second field in the patient ID barcode); day, month, year of shipment; concentration, amount, plate row, plate column. Potential biology: tumor stage, tumor grade, age.

Correlation of the red probes, first four eigengenes with the adjustment and bio variables:

(Kruskal-Wallist test for categorical variables and Spearman correlation for age)

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	2.2e-16	2.2e-16	2.2e-16	2.2e-16	2.2e-16	2.2e-16	0.0004486	5.882547e-02	6.881028e-02	0.32	0.27	0.1071
2	0.5383	0.3486	0.5577	0.9876	0.2710	0.04873	0.6482	0.2862026	0.4786892	0.31	0.10	0.006634
3	0.04048	0.05258	0.03756	0.01480	0.1233	0.1786	0.5335	0.55585676	0.25289498	0.50	0.35	0.5131
4	0.0003439	0.01709	0.0008948	0.0001387	0.5725	0.7225	0.5267	0.0516508987	0.1404578746	0.43	0.43	0.02168

It looks like removing the batch and the plate row did help some with the center effect but the plate column effect is still significantly higher. Need to remove that.

From my previous work with Brig we identified that day, month and year of shipment and center are highly correlated with batch. Therefore start by removing the batch effect. Percent variance explained after removing the batch effect:

> X<-model.matrix(~factor(batch))
> Xbc<-solve(t(X) %*% X) %*% t(X) %*% t(red)
> redB<- red-t(X %*% Xbc)

Looks like the first eigengene now explains ~20% of the variance. Correlation with the adjustment and bio variables after removing the batch effect:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	0.7748	9.445e-05	6.8e-01	8.2e-01	8.0e-01	5.4e-01	1.1e-01	1.3e-07	9.4e-05	0.541	0.093	0.6379
2	1	0.4722	1.00	1.00	0.98	0.78	0.76	0.26	0.45	0.18	0.11	0.01303
3	1	0.03917	1.00	1.00	0.94	0.98	0.45	0.59	0.15	0.60	0.24	0.06834
4	1	0.3475	1.000	1.000	0.998	0.969	0.428	0.055	0.104	0.56	0.73	0.1463

Removing the batch effect didn't completely remove the center and the row/column effect. Day, month and year of shipment have been taken care of. Next remove the batch the the plate row effect. We identified with Justin the the entire 0652 batch doesn't have any plate row or column information:

> table(batch, adj$plate_row)

batch  A B C D E F G H
  0359 5 5 5 5 5 5 4 4
  0402 5 4 4 5 5 5 4 4
  0432 6 5 5 5 6 5 6 5
  0460 6 6 6 6 6 6 6 5
  0475 6 6 6 6 5 6 6 5
  0501 3 3 3 3 3 3 2 2
  0536 6 6 6 6 6 6 6 5
  0563 6 6 6 6 6 6 6 5
  0581 6 6 6 6 6 6 6 5
  0652 0 0 0 0 0 0 0 0
  0667 6 6 6 6 6 6 5 5
  0708 6 6 6 6 5 6 5 5
  0807 1 0 0 2 0 1 0 0
> table(batch, adj$plate_column)
batch  1 2 3 4 5 6
  0359 8 8 8 8 6 0
  0402 8 8 6 8 6 0
  0432 5 8 7 8 8 7
  0460 8 8 8 8 8 7
  0475 8 8 7 8 8 7
  0501 8 8 6 0 0 0
  0536 8 8 8 8 8 7
  0563 8 8 8 8 8 7
  0581 8 8 8 8 8 7
  0652 0 0 0 0 0 0
  0667 8 7 8 8 8 7
  0708 8 7 7 8 8 7
  0807 2 1 1 0 0 0

We decided to remove the patients from the batch 0652 from further analyses (43) and work with a smaller dataset (468 patients):

> mask<-batch!="0652"
> length(mask)
[1] 511
> table(mask)
mask
FALSE  TRUE
   43   468
> X<-model.matrix(~factor(batch[mask]) + adj$plate_row[mask])
> Xbcrw<-solve(t(X) %*% X) %*% t(X) %*% t(red[,mask])
> redBR<- red[,mask] - t(X %*% Xbcrw)

Percent variance explained after removing the batch and the plate row effects:

The first principal component is smaller but not significantly so. Lets look again at the variables:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	0.7905	0.0001809	6.9e-01	8.4e-01	8.2e-01	6.4e-01	9.7e-02	9.9e-01	1.2e-05	0.41	0.53	0.6716
2	1	0.7522	1.00	1.00	0.96	0.93	0.75	0.96	0.46	0.36	0.30	0.02475
3	1	0.03907	1.00	1.00	0.93	0.97	0.52	1.00	0.17	0.30	0.27	0.1425

Now remove the batch, plate row and plate column, look at the percent variance explained:

Now the first principal component explains a little less than 15% of the overall variability. Correlation with the adjustment and bio variables, see if the center effect still present:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	0.936	7.104e-05	0.881	0.915	0.938	0.780	0.027	0.998	0.757	0.36	0.22	0.6967
2	1	0.7801	1.00	1.00	0.96	0.92	0.75	0.97	0.45	0.36	0.30	0.02479
3	1	0.04068	1.00	1.00	0.93	0.97	0.51	1.00	0.18	0.30	0.27	0.1425

Finally, the center effect needs to go. Variables to adjust for: batch, center, plate row, plate column. Percent Variance explained after the adjustment:

Under 14%! Still large effect, look at the variables:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	0.9986	0.9998	0.993	0.987	0.963	0.897	0.042	0.999	0.433	0.39	0.34	0.8608
2	1	1	1.00	1.00	0.95	0.96	0.86	0.97	0.61	0.27	0.36	0.02406
3	1	1	1.00	1.00	0.84	0.96	0.50	1.00	0.75	0.33	0.45	0.3325

Now lets take a look at the outliers of the first eigengene (patients 6 vs 367):

I guess it doesn't look too terrible. I also tried to remove all the listed variables as well as the first principal component and here is what I got in terms of the percent variance explained and the outliers:
To me it looks worse than with the first principal component. Final: remove the batch, center, plate row and plate column from the data.

M value: analysis of green probes, identification of adjustment variables

Percent Variance explained:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	2.2e-16	2.2e-16	1.6e-61	5.7e-39	3.8e-31	1.6e-19	6.9e-04	3.2e-02	2.2e-01	0.36	0.17	0.2778

Remove the batch, center, plate row, plate column (also mask that one batch), looks at the first eigengene and the eigenarray:

P values after the adjustment:

PCs	Batch	Center	Day	Month	Year	Amount	Concentr.	Row	Column	Stage	Grade	Age
1	0.999	0.9999	0.995	0.991	0.956	0.863	0.074	0.997	0.626	0.60	0.26	0.6206

Look at the outliers:

Weird!
Do one more test and remove the first eigengene together with the variables above:

Conclusion

Now scale (center=TRUE, scale=TRUE) bot datasets (red and green probes), combine them together for network construction.