Home

ttest

Synopsis

Perform various two-sided statistical analyses of data that is divided into two groups, including the Student's t-test. Each row (gene) in the data file is given a p value.

 ttest
	[-r: format line needs to be removed
   -w: use Welch approximate t
   -m; do mann-whitney U (a.k.a. Wilcoxon) test instead (this is experimental)
   -rank: use rank transformation of the data for ttest
   -l: log transform the data]  
	<data> <layout>

ttest -r -rank affydatafile.txt affydatafile-layout | /usr/local/bin/sort -gk 3 >! test.rank.out
Produces tab delimited output:
labelstatpfold
160901_at7.385489458759967.50313817077242e-070.354838709677419
94821_at7.385489458759967.50313817077242e-070.354838709677419
104139_at6.37330377272515.29481385225239e-060.372549019607843
93974_at6.37330377272515.29481385225239e-060.372549019607843
98579_at6.37330377272515.29481385225239e-060.372549019607843
104312_at5.618629836825492.48391733865816e-050.390728476821192
102362_i_at5.612817659996382.514365670514e-050.390728476821192
94490_at5.612817659996382.514365670514e-050.390728476821192

(The -rank option typically yields many pvalues that are equal)

Inputs

  • data: A tab-delimited data file, where each row represents a set of measurements to be analyzed. A p value is generated for each row in the file. See the detailed instructions for the format.
  • layout: A simple file describing the experimental design. See the documentation of the format.

Outputs

The following columns

  • The gene identifier
  • The statistic ('t', or 'u' if the M-W test was used)
  • The two-sided p value.
  • The 'fold change' between the two groups. This is provided as a convenience and is not directly used in the analysis.

Options

  • -r: The data file includes an extra line after the first line. (See the data format page for an explanation)
  • -rank: Use the rank-transformation of the data. The ranks are used instead of the raw data. (a nonparametric version of the t-test)
  • -m: EXPERIMENTAL: Do the Mann-Whitney (Wilcoxin) test instead (a non-parametric test). Note that in the current implementation the pvalues this yields are not very accurate for small numbers of samples.
  • -l: Use the log transformation of the data. Do not use this if your data includes non-positive values.
  • -w: Use the Welch 'approximate t' (applied when the variance in the two groups are not equal).

Dependencies

  • Stats.pm
  • This isn't a dependency, but gnu sort is useful for processing the output.

Problems/bugs

  • The U test is approximate and is not very accurate for small numbers of samples (less than 20 or so)

Version history

Script

References

Many of the methods were implemented with help from Zar (Biostatistical analysis)
Numerical recipes in C is an invaluable book for programming statistical distributions, even when programming in another language.