Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Get the patient barcodes from DNA methylation data (511 patients). Truncate the barcodes so they have only 3 fields ("-" separated), request data for download from TCGA: OV cancer, level 3, expression-gene, copy paste 511 truncate barcodes. Platform of choice: BI HT_HG-U133A (Affymetrix). For 3 barcodes Affymetrix data wasn't "applicable". 
  2. Get the download link, download the data to belltown: /work/DAT_002__TCGA_Ovarian/01_2011/vita_work/data/AffyExpression_Sep2011. 
  3. Get clinical files for the same 511 patient barcodes. Download to the same folder.
  4. Data is stored as one file per patient with the following header: barcode, gene symbol, value. The whole first column represents patient barcode. The code to extract the data in a single list and then combining it in a single matrix: 
    Code Block
    exprProcess<-function(){
      patientIDs<-list()
      data<-list()
      files<-list.files(pattern=".txt")
      y<-read.table(files[1],skip=1,header=F)
      geneSymbols<-y[,2]
      n=0
      for (i in seq(along=files)){
        print(files[i])
        patExpr<-read.table(files[i],skip=1,header=F)
        patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
        data[[length(data)+1]]<-patExpr[,3]
        n<-n+1
        print(n)}
    
      return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
    }
    After that I combined the list of datasets from each patient to a single matrix, geneSymbols were used as row names and patientIDs as column names. 

Total number of obtained files is 522. Types of patients represented in this set:

Code Block
> colnames(expr)[1:4]
[1] "TCGA-04-1331-01A-01R-0434-01" "TCGA-04-1332-01A-01R-0434-01"
[3] "TCGA-04-1335-01A-01R-0434-01" "TCGA-04-1336-01A-01R-0434-01"
> sp<-strsplit(colnames(expr),split="-")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D 02A
481  24   2   1  14

So all the patients are represented by the tumor samples (01-09: tumor types; 01 - solid tumor; 02 - recurrent solid tumor; A-D: vial count as pertains to an individual patient_sample) 
For the comparison, I also looked at the samples in the methylation dataset:

Code Block
> mb<-read.table("methylated_batch.txt",header=T,row.names=1)
> length(colnames(mb))
[1] 511
> colnames(mb)[1:4]
[1] "TCGA.04.1331.01A.01D.0432.05" "TCGA.04.1332.01A.01D.0432.05"
[3] "TCGA.04.1335.01A.01D.0432.05" "TCGA.04.1336.01A.01D.0432.05"
> sp<-strsplit(colnames(mb),"[.]")
> samType<-sapply(sp,"[",4)
> table(samType)
samType
01A 01B 01C 01D
483  25   2   1

So for the expression analysis I need to get rid of all patients that have secondary tumor (02) and I am missing one patient with 01A and 01B in the expression dataset. So the solution is to match the patient IDs using only the first 4 fields (because the rest of them obviously differ (sample type (D or R))). As the result I got 507 patients for gene expression data (no secondary tumor), 3 missing from the 01A and 1 is missing from 01B group. I think I will stick to this group for the future normalization steps and the analyses

...

Code Block
exprProcess<-function(){
  patientIDs<-list()
  data<-list()
  files<-list.files(pattern=".txt")
  y<-read.table(files[1],skip=1,header=F)
  geneSymbols<-y[,2]
  n=0
  for (i in seq(along=files)){
    print(files[i])
    patExpr<-read.table(files[i],skip=1,header=F)
    patientIDs[[length(patientIDs)+1]]<-patExpr[1,1]
    data[[length(data)+1]]<-patExpr[,3]
    n<-n+1
    print(n)}

  return(list(Data=data,patNames=patientIDs,genes=geneSymbols))
}

...

Normalization

Next step is to do PCA that I will use for identification of variables that affect the data. 

...