Forcing clusters in colorectal cancer

Since in the analysis of colon cancer described here I wasn't able to identify the clusters that would be correlated with the mutations I asked another question: is it possible that the reason for not being able to identify the CIMP is because I am dealing with colon rather than colorectal cancer. To test this hypothesis:

Download rectal cancer 27k data from TCGA
Normalize together with the colon cancer data
Test whether we can identify CIMP clusters using either Shen's or Laird's definitions

Normalization

Colon cancer data: 166 patients, rectal cancer data: 70 patients. Mutation data is available for only 128 patients (using Qingying's all_mut.txt table)

Combined the datasets into one, analysed batch effect. Numbers above each boxplot represent the number of patients in each batch. Since there is a huge correlation of the batch with PCs (PC1: 2.2e-16) as well as with the tissue source (rectal and colon: p-value < 2.2e-16) I decided to select the most similar batches (all boxplots which median is above the line), because it gave more patients (149). After selecting these patients I found that the PC1 is not highly correlated with the batch but PC2 is: p-value = 1.499e-06. Remove the batch effect from the data. Removed 5 outliers.

Shen markers

The analysis was done in the same way as it is described here.

	BRAF	KRAS	TP53
Shen 3 clusters, CRC	0.8204	0.001973	0.1588

Laird markers

M value, 2 clusters

	BRAF	KRAS	TP53
Laird, 2 clusters, M value	0.2114	0.01141	0.5281

With Beta value I see a very similar trend in P values, some improvement for the association with KRAS