Page Comparison

...

Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable".
Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011.
Get clinical files for the same 511 patient barcodes. Download to the same folder.

Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix:

Code Block

exprProcess<-function(){
  patientIDs<-list()
  data<-list()
  files<-list.files(pattern=".txt")
  y<-read.table(files[1],skip=1,header=F)
  geneSymbols<-y[,2]
  n=0
  for (i in seq(along=files)){
    print(files[i])
    patExpr<-read.table(files[i],skip=1,header=F)
    patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
    data[[length(data)+1]]<-patExpr[,3]
    n<-n+1
    print(n)}

  return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
}

After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names.

Total number of obtained files is 522. Types of patients represented in this set:

Code Block

> colnames(expr)[1:4]
[1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
[3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
> sp<-strsplit(colnames(expr),split="-")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D 02A
481  24   2   1  14

So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample)
For the comparison, I also looked at the samples in the methylation dataset:

Code Block

> mb<-read.table("methylated_batch.txt",header=T,row.names=1)
> length(colnames(mb))
[1] 511
> colnames(mb)[1:4]
[1] "TCGA.04.1331.01A.01D.0432.05" "TCGA.04.1332.01A.01D.0432.05"
[3] "TCGA.04.1335.01A.01D.0432.05" "TCGA.04.1336.01A.01D.0432.05"
> sp<-strsplit(colnames(mb),"[.]")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D
483  25   2   1

So for the expression analysis I need to get rid of all patients that have secondary tumor (02) and I am missing one patient with 01A and 01B in the expression dataset. So the solution is to match the patient IDs using only the first 4 fields (because the rest of them obviously differ (sample type (D or R))). As the result I got 507 patients for gene expression data (no secondary tumor), 3 missing from the 01A and 1 is missing from 01B group. I think I will stick to this group for the future normalization steps and the analyses

...

Code Block

exprProcess<-function(){
  patientIDs<-list()
  data<-list()
  files<-list.files(pattern=".txt")
  y<-read.table(files[1],skip=1,header=F)
  geneSymbols<-y[,2]
  n=0
  for (i in seq(along=files)){
    print(files[i])
    patExpr<-read.table(files[i],skip=1,header=F)
    patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
    data[[length(data)+1]]<-patExpr[,3]
    n<-n+1
    print(n)}

  return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
}

...

.

Normalization

Next step is to do PCA that I will use for identification of variables that affect the data.

...

Version	Old Version 8	New Version 9
Changes made by	Vitalina Komashko (Unlicensed)	Vitalina Komashko (Unlicensed)
Saved on	Sept 29, 2011	Oct 03, 2011

Versions Compared

Key

Normalization