Procedure of fundamental data analysis
From CelliP-en
Here, we will describe the procedures to construct a pipeline that normalize, identify differentially expressed genes by the t-test, and cluster the data. The example file can be obtained at http://xip.hgc.jp/samples/prostateTudoPronto.csv . This data sample is composed of 57 columns (each column represents one subject) and 24,000 rows (each row represents one microarray probe). The first 32 columns (from the left) are gene expressions obtained from normal people, while the other 25 are from people with prostate tumor. This pipeline uses the R server, thus please set up previously the R server.
Contents |
Normalization
Usually, before any microarray data analysis, the data is normalized. Here, the normalization is carried out by the Fast Loess algorithm. The component Normalize fast Loess is used in this step.
XML file of the entire pipeline
Assembling pipeline
As shown in the figure above, connect the following components: Input EDF to R, Normalize fast Loess, Export RMatrix to JMatrix, and General JData viewer & editor. The parameters of each component are set up as shown below.
Inputting of the parameters to each component
- Input EDF to R
- EDF File Name: http://xip.hgc.jp/samples/prostateTudoPronto.csv
- Normalize fast Loess
- EDF: edf
- Export RMatrix to JMatrix
- Matrix Name: normalized$expr
- General JData viewer & editor
- no change
Run and Result
The normalized data is composed of a large matrix of 24,000~ rows and 57 columns.
t test
Then, we will apply the t-test to the normalized data. We will apply the t-test to one probe. The component that will be used is the T-test.
XML file of the entire pipeline
Assembling pipeline
Delete the component Normalize fast Loess from the pipeline and add the following components T-test, R evaluated result to log, Export RPrimitive to JPrimitive x 2, Export RVector to JVector x 2, General JData viewer & editor x 4 in the canvas. Connect them as illustrated in the figure above. The components R evaluated result to log, Export RPrimitive to JPrimitive x 2, Export RVector to JVector x 2, General JData viewer & editor x 4 are used only to facilitate the visualization of the results. The parameters of each component are displayed below.
Inputting of the parameters to each component
- Input EDF to R
- EDF File Name: http://xip.hgc.jp/samples/prostateTudoPronto.csv (already typed above)
- Normalize fast Loess
- EDF: edf (already typed above)
- T-test
- x: normalized$expr[,1] (the 1st column data for a normal sample)
- y: normalized$expr[,33] (the 33rd column sample data for a tumor sample)
- result: ttest
- R evaluated result to log
- X: ttest
- 1st Export RPrimitive to JPrimitive
- Primitive Name: ttest$statistic
- 2nd Export RPrimitive to JPrimitive
- Primitive Name: ttest$p.value
- JType: String
- 1st Export RVector to JVector
- Vector Name: ttest$conf
- 2nd Export RVector to JVector
- Vector Name: ttest$estimate
- 1st General JData viewer & editor
- Tab Name: t value
- 2nd General JData viewer & editor
- Tab Name: p-value
- 3rd General JData viewer & editor
- Tab Name: 95 percent confidence interval
- 4th General JData viewer & editor
- Tab Name: mean of x mean of y
Run and Result
The results of the t-test is visualized at the following tabs: "t value", "p-value", "95 percent confidence interval", and "mean of x, mean of y".
Clustering
Now, we will cluster the data using the hierarchical clustering algorithm in the normalized data. Due to the high processing time required to cluster 24,000 probes, here, we will limit to 1,000 probes. The file containing 1,000 probes can be obtained at http://xip.hgc.jp/samples/test1000.csv . The components that will be used are the Hierarchical clustering and the Hierarchical clustering plot.
XML file of the entire pipeline
Assembling pipeline
Construct the pipeline as illustrated in the figure above by using the following components: Edit R script x 2, Hierarchical clustering, and Hierarchical clustering plot.
Inputting of the parameters to each component
- Input EDF to R
- EDF File Name: http://xip.hgc.jp/samples/test1000.csv
- Normalize fast Loess
- EDF: edf (already typed above)
- 1st Edit R script
- R script: colname <- normalized$attribute[1,]; collength <- length(colname); colname <- colname[2:collength];
- 2nd Edit R script
- R script: nn <- normalized; nn$expr <- t(normalized$expr); rownames(nn$expr) <- t(colname)
- Hierarchical clustering
- x: dist(nn$expr)
- Hierarchical clustering plot
- x: result
Run and Result
By running the pipeline, the result of the clustering algorithm is displayed in the dialog window. The labels 0 to 31 represent normal samples while the labels 32 to 56 are the tumor samples. By analyzing the plot in the right side, it is possible to notice that the tumor and normal samples are clustered.