Page Comparison

...

Important

...

update

...

(January

...

20th,

...

2011):

...

the

...

data

...

below

...

have

...

been

...

corrected

...

for

...

the

...

BCR

...

batch

...

which

...

is

...

not

...

necessarily

...

the

...

processing

...

batch.

...

The

...

dataset

...

needs

...

to

...

be

...

reanalyzed.

...

Correlation between BCR batch and the processing batch for 27k arrays in COAD (January 20, 2012)

Correlation between BCR batch and the processing batch for 27k arrays in READ (January 20, 2012):

Batch vs clinical traits, data collection

Correlation of batch with clinical traits (READ: can be found here, COAD: can be found here). READ, number of batches: 11, COAD, number of batches: 15

Additional technical variables to be considered in normalization: batch (6th field in the patient's barcode), center (2nd field in the patients' barcode), amount, barcode bottom (?), concentration, day, month, year (to be concatenated and considered as a single variable), plate row, plate column. This information is available from clinical_aliquot_public_CANCERTYPE.txt files.

DNA methylation: 27k, level1 downloaded on December 19th, 2011.

Total number of COAD patients (tumor/matched normal/unmatched normal) = 212

Total number of READ patients (tumor/matched normal/unmatched normal) = 81

Place them in the same directory and combine in a single matrix of M values (log2(methylated/unmethylated))

Total number of tumor samples: 237

Extracted batch and center information, relationship between them:

Code Block

> table(center,batch)
      batch
center 0820 0825 0904 1020 1110 1116 A004 A00B A081
    A6    4    2    0    04    0     AG   12   10    8    0    0   17    9    9    0     AY    0
   0 AA   0  26  0  5  2 31   044    04    0   15 0 {code} 17 For the combined13
DNA methylation dataset (237 patients)AF batch vs clinical traits4 significant correlations (P value0 = 0.05) are listed.0 Note: traits 1 and0 2 should be used for0 the calculation of the1 survival and then Cox0 model should be used0 to show that batch0
is predictive of survival. CompleteAG list can be12 found [here|^BatchClinicalInfoCorrelationsCOLON.csv] 10 {csv}"COLON_clinical traits","DataType","NumberOfNAs","Test","Pvalue" "days_to_last_followup","integer",0,"Kruskal-Wallis rank sum test",1.19E-23
"days_to_last_known_alive","integer",0,"Kruskal-Wallis rank sum test",2.63E-19
"days_to_form_completion","integer",0,"Kruskal-Wallis rank sum test",1.32E-14
"tumor_tissue_site","factor",2,"Pearson's Chi-squared test",3.41E-14
"histological_type","factor",7,"Pearson's Chi-squared test",9.69E-11
"year_of_initial_pathologic_diagnosis","integer",0,"Kruskal-Wallis rank sum test",3.77E-09
"anatomic_organ_subdivision","factor",2,"Pearson's Chi-squared test",8.00E-07
"vascular_invasion_present","factor",27,"Pearson's Chi-squared test",8.85E-06
"vital_status","factor",1,"Pearson's Chi-squared test",3.20E-05
"lymphatic_invasion_present","factor",12,"Pearson's Chi-squared test",3.53E-05
"residual_tumor","factor",6,"Pearson's Chi-squared test",5.04E-04
"number_of_lymphnodes_examined","integer",2,"Kruskal-Wallis rank sum test",1.48E-03
"kras_gene_analysis_performed","factor",1,"Pearson's Chi-squared test",5.44E-02
"gender","factor",0,"Pearson's Chi-squared test",6.25E-02
"person_neoplasm_cancer_status","factor",0,"Pearson's Chi-squared test",8.27E-02
"tumor_stage","factor",27,"Pearson's Chi-squared test",8.89E-02
{csv}

h4. Correlation with survival

Survival object, relevant clinical variables: days to death (8), days to the last follow up (11), days to the last know alive (12), vitalin status (48). Summaries:
{code} 8    0    0   17    9    9    0
    AY    0    0    0    0    2    0    0    0    0

For the combined DNA methylation dataset (237 patients) batch vs clinical traits significant correlations (P value = 0.05) are listed. Note: traits 1 and 2 should be used for the calculation of the survival and then Cox model should be used to show that batch is predictive of survival. Complete list can be found here

Correlation with survival

Survival object, relevant clinical variables: days to death (8), days to the last follow up (11), days to the last know alive (12), vitalin status (48). Summaries:

Code Block

> table(bio[,48]) # vital status

DECEASED   LIVING
      22      214
> summary(bio[,12]) # days to the last known alive
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.0     0.0    31.0   204.3   332.0  1581.0
> summary(bio[,8]) # days to death
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
   0.00   37.25  105.50  374.10  535.00 1581.00  215.00
> summary(bio[,11]) # days to the last follow - up
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.0     0.0    30.0   197.1   332.0  1581.0{code}

It

...

is

...

interesting

...

to

...

see

...

that

...

days

...

to

...

the

...

last

...

follow

...

up

...

and

...

the

...

days

...

to

...

the

...

last

...

know

...

alive

...

are

...

almost

...

identical.

...

In

...

some

...

cases

...

when

...

days

...

to

...

the

...

last

...

known

...

alive

...

has

...

a

...

number

...

(not

...

0)

...

days

...

to

...

the

...

last

...

follow

...

up

...

is

...

showing

...

0.

...

I

...

decided

...

to

...

use

...

days

...

to

...

the

...

last

...

known

...

alive

...

instead

...

of

...

the

...

days

...

to

...

the

...

last

...

follow

...

up

...

for

...

the

...

construction

...

of

...

the

...

survival

...

object.

...

Image Added
There are no patients that would have no information for the days to the last follow up and vital status. Create the survival object:

Code Block

collapse	true

> head(x)
             VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath
TCGA-A6-2674           2            523              523        NA
TCGA-A6-2677           2            541              541        NA
TCGA-A6-2678           2            437              437        NA
TCGA-A6-2683           2            472              472        NA
TCGA-AA-3514           2             31                0        NA
TCGA-AA-3517           2             31                0        NA
> dim(x[is.na(x[, 3]) & is.na(x[, 4]), ])
[1] 0 4
> status<-rep(1,237)
> status[which(is.na(x[, 4]))] <- 0
> x[is.na(x[, 4]), 4] <- x[which(is.na(x[, 4]), 4), 2]
> x1<-cbind(x,status)
> head(x1)
             VitalStatus DaysKnownAlive DaysLastFollowUp DaysDeath status
TCGA-A6-2674           2            523              523       523      0
TCGA-A6-2677           2            541              541       541      0
TCGA-A6-2678           2            437              437       437      0
TCGA-A6-2683           2            472              472       472      0
TCGA-AA-3514           2             31                0        31      0
TCGA-AA-3517           2             31                0        31      0
> surv <- Surv(as.numeric(x1[, 4]), as.numeric(x1[, 5]))


# Kaplan Meier curve:
> plot(survfit(surv ~ 1), xlab = "days to death", ylab = "Probability", main = "Kaplan-Meier survival curve for TCGA \n COLON (COAD+READ) (237 patients)")



# Correlation of survival with batch:
> plot(survfit(surv ~ k[, 10]), xlab = "days to death", ylab = "Probability", col = 1:9, main = "TCGA COLON survival correlation with batch")
> legend(750, 0.4, levels(as.factor(k[,10])), text.col = 1:9,ncol=3)
> summary(coxph(surv ~ k[,10]))
Call:
coxph(formula = surv ~ k[, 10])

  n= 237, number of events= 22

                  coef  exp(coef)   se(coef)      z Pr(>|z|)
k[, 10]0825 -1.802e+01  1.499e-08  1.522e+04 -0.001   0.9991
k[, 10]0904  2.521e+00  1.244e+01  1.238e+00  2.036   0.0417 *
k[, 10]1020  1.686e+00  5.395e+00  1.170e+00  1.441   0.1497
k[, 10]1110  2.491e+00  1.207e+01  1.228e+00  2.028   0.0425 *
k[, 10]1116  1.584e+00  4.875e+00  1.429e+00  1.109   0.2676
k[, 10]A004 -1.790e+01  1.684e-08  4.977e+03 -0.004   0.9971
k[, 10]A00B  7.077e-01  2.029e+00  1.109e+00  0.638   0.5233
k[, 10]A081  4.812e-01  1.618e+00  1.197e+00  0.402   0.6876
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


exp(coef) exp(-coef) lower .95 upper .95
k[, 10]0825 1.499e-08  6.671e+07    0.0000       Inf
k[, 10]0904 1.244e+01  8.038e-02    1.0989    140.86
k[, 10]1020 5.395e+00  1.853e-01    0.5447     53.44
k[, 10]1110 1.207e+01  8.282e-02    1.0874    134.07
k[, 10]1116 4.875e+00  2.051e-01    0.2962     80.22
k[, 10]A004 1.684e-08  5.940e+07    0.0000       Inf
k[, 10]A00B 2.029e+00  4.928e-01    0.2309     17.83
k[, 10]A081 1.618e+00  6.181e-01    0.1550     16.89

Concordance= 0.743  (se = 0.082 )
Rsquare= 0.083   (max possible= 0.521 )
Likelihood ratio test= 20.5  on 8 df,   p=0.008608
Wald test            = 9.14  on 8 df,   p=0.331
Score (logrank) test = 20.77  on 8 df,   p=0.007792{code}

Kaplan

...

-

...

Meier

...

curve

...

and

...

correlation

...

with

...

batch:

Image Added Image Added

Data normalization

For the normalization I didn't split the probes into "red" and "green".

...

Approach:

...

perform

...

SVD,

...

identify

...

tech

...

variables

...

that

...

are

...

highly

...

correlated

...

with

...

the

...

first

...

PCs

...

and

...

remove

...

the

...

variables.

...

Use

...

M

...

value

...

to

...

express

...

methylation

...

status

...

of

...

each

...

probes.

...

The

...

M

...

value

...

was

...

calculated

...

in

...

the

...

following

...

way

...

(the

...

code

...

was

...

take

...

from

...

Bioconductor

...

lumi

...

package,

...

functions

...

for

...

the

...

analysis

...

of

...

Illumina

...

Human

...

Methylation

...

27k

...

data):

...

SVD

...

on

...

the

...

whole

...

matrix,

...

relative

...

variance

...

before

...

normalization:

...

Image Added

Created a file with technical variables, merged day, month and year into a single variable. Tech file, head:

Code Block

> head(k)
  row.names bcr_sample_barcode          bcr_aliquot_barcode  amount biospecimen_barcode_bottom
1        63   TCGA-A6-2674-01A TCGA-A6-2674-01A-02D-0820-05 26.7 uL                 0096500431
2       150   TCGA-A6-2677-01A TCGA-A6-2677-01A-01D-0820-05 26.7 uL                 0096500430
3       182   TCGA-A6-2678-01A TCGA-A6-2678-01A-01D-0820-05 26.7 uL                 0096500407
4       334   TCGA-A6-2683-01A TCGA-A6-2683-01A-01D-0820-05 26.7 uL                 0096500406
5      1186   TCGA-AA-3514-01A TCGA-AA-3514-01A-02D-0820-05 26.7 uL                 0096500383
6      1254   TCGA-AA-3517-01A TCGA-AA-3517-01A-01D-0820-05 26.7 uL                 0096500382
  concentration  shipment plate_column plate_row batch      shortID
1    0.14 ug/uL 22-2-2010            1         B  0820 TCGA-A6-2674
2    0.15 ug/uL 22-2-2010            1         C  0820 TCGA-A6-2677
3    0.15 ug/uL 22-2-2010            1         D  0820 TCGA-A6-2678
4    0.16 ug/uL 22-2-2010            1         E  0820 TCGA-A6-2683
5    0.15 ug/uL 22-2-2010            1         F  0820 TCGA-AA-3514
6    0.13 ug/uL 22-2-2010            1         G  0820 TCGA-AA-3517{code}

Ignore

...

bcr_sample_barcode,

...

bcr_aliquot_barcode,

...

these

...

are

...

not

...

tech

...

variables.

...

Here

...

are

...

the

...

correlation

...

of

...

the

...

tech

...

variables

...

with

...

the

...

first

...

4

...

PCs.

...

Code Block

> k
   bcr_sample_barcode bcr_aliquot_barcode     amount biospecimen_barcode_bottom concentration     shipment
V1          0.4700026           0.4877315 0.04672452                  0.3017151  1.601134e-07 8.334984e-34
V2          0.4510048           0.4877315 0.48950693                  0.5316963  3.392774e-01 5.699460e-03
V3          0.4510048           0.4877315 0.18777379                  0.6925323  1.141026e-01 7.733551e-02
V4          0.4513158           0.4877315 0.58958142                  0.6705972  6.211816e-01 6.664406e-02
   plate_column   plate_row        batch   shortID
V1 3.009061e-05 0.001491552 2.491659e-33 0.4702436
V2 1.947964e-01 0.989450736 1.349671e-04 0.4511082
V3 9.369050e-01 0.463424193 1.437189e-01 0.4511082
V4 3.395272e-01 0.688070623 1.012497e-01 0.4514159
{code}

Outliers of the first principal component:
!Colon_Mval_noNorm_PC1outliers.png|thumbnail!
Seems that the batch has the biggest effect on the data. Remove the batch, relative variance after that:

!Colon_RelativeVariance_noBatch.png|thumbnail!
Calculate&nbsp;correlation with the clinical variables after removing the batch effect:
{code}

Outliers of the first principal component:
Image Added
Seems that the batch has the biggest effect on the data. Remove the batch, relative variance after that:

Image Added
Calculate correlation with the clinical variables after removing the batch effect:

Code Block

> k
   bcr_sample_barcode bcr_aliquot_barcode    amount biospecimen_barcode_bottom concentration  shipment
V1          0.4691690           0.4877315 0.6106733                  0.5489680     0.4263794 0.9956323
V2          0.4534001           0.4877315 0.8880134                  0.7069529     0.4187061 0.9950998
V3          0.4536645           0.4877315 0.8589996                  0.7271536     0.4000974 0.9856611
V4          0.4543147           0.4877315 0.8880134                  0.6459688     0.9405449 0.9999056
   plate_column plate_row     batch   shortID
V1   0.04191285 0.5627797 0.9981536 0.4690814
V2   0.97317059 0.6821160 0.9999954 0.4536160
V3   0.55783761 0.0703874 0.9992998 0.4537400
V4   0.52248494 0.8718225 0.9999849 0.4543834{code}

Seems

...

that

...

this

...

took

...

care

...

of

...

all

...

the

...

problems,

...

there

...

is

...

still

...

a

...

borderline significant correlation of the PC1 with the plate column, not going to worry about it.

Look at the outliers of the first principal component after removing the batch:

Image Added

Conclusion: consider the data to be normalized. eSet object for this data set is available.

Version	Old Version 11	New Version Current
Changes made by	Vitalina Komashko	Vitalina Komashko
Saved on	Jan 20, 2012	Jan 20, 2012

Versions Compared

Key

Correlation between BCR batch and the processing batch for 27k arrays in COAD (January 20, 2012)

Correlation between BCR batch and the processing batch for 27k arrays in READ (January 20, 2012):

Batch vs clinical traits, data collection

Correlation with survival

Data normalization