Home

Pattern or template definition

Patterns, or templates, are simple files that define an expression pattern of interest. This page explains how to define templates and set up the files. Templates are used as inputs to patternmatch.

A template file is just a one-line list of numbers, and might look like this:

0 0 0 1 1 1 0 0 0 .5 .5 .5

where each number represents one column in the data file.

Important points to remember

  • The terms template and pattern are used interchangably.
  • The template file must be a plain text file. You can create one in Excel or Word, or any other program with the option to "Save as text". For instruction on how to do this in excel, follow this link.
  • The template must be on the first line of the file.
  • There must be a number in the template file for each and every data column. Missing columns will yield undefined results.
  • The numbers in a template refer to arbitrary expression values.
  • If you use the 'absolute value' option in patternmatch, genes matching either the pattern or its "opposite" will be given equal scores.
  • If you change the order of columns in your data file, you must also change the templates you are using to analyze it.
  • Only the shape of the template is important when using the correlation coefficient (as patternmatch does). The correlation of the gene expression pattern with the template is a measure of "how much the gene looks like the template" (loosely speaking).
  • A t-test between two groups in your data will give equivalent results to a template having only two expression levels.

A simple example

We start with a data file that looks something like this:

genemutantmutantmutantwildtypewildtypewildtype
100001_at-36.377.864.489.4126.686.2
100002_at1504.21512944.51157.916521358.9
100003_at845.9966.51057.4987.4764.1878.5

(etc, for many lines)

We wish to define 'idealized' expression patterns that we are interested in. In this simple example, we are probably interested in genes which are different between mutant and wild type animals. Let's assume that genes which are overexpressed in the mutant are as interesting as ones which are underexpressed. So we can define one template that corresponds to either expression pattern.

If we plotted the expression values, an ideal kind of gene we are looking for would have an expression profile like this:


A less ideal example might be like this one:


We want our method to give the first gene a higher score than the second one.

We're equally interested in genes like this, which has the "opposite" expression pattern:


The template or pattern we use to define this situation is:

0   0   0   1   1   1

0 and 1 stand for two different arbitrary expression levels. You can read this template as "A gene which is expressed at one level in the first three samples, and at another level in the next three samples." Another way to put this is to think of the "expression" arrow at the left of the pictures above pointing in the opposite direction. The correlation coeffient will give equally high absolute scores to each situation.

If we wanted to select ONLY genes like the first one, and do the third type (the "opposite" pattern) separately, be sure to NOT use the absolute value of the correlation coefficient. You would then run a separate template to identify genes like the third one shown above:

1   1   1   0   0   0

Otherwise, patternmatch considers these two template to be equivalent.

A final note: this case is so simple, that the template matching problem reduces to a t-test between the two groups. In fact, any time you have only two expression levels in your template, you can do a t-test to get equivalent results. The only advantage of the template matching method is that it is helpful to conceptualize the problem as one of finding expression patterns. In more complex cases, the t-test does not apply. That's what the next example shows.

A more complex example

In this example, we have multiple experimental groups. We decide (for some reason) that we want genes that show three levels of expression. Here is what our data file looks like.

genesample1sample2sample3sample4sample5sample6sample7sample8sample9sample10sample11sample12
gene194.2227.7308.348.8170.1154.3160.4106.240.722.5184.498.9
gene273.4172.5242.1-8.8148.8190.3155.1205.3337.8276-64.2295.4
gene330893.630933.328685.933768.527873.828285.630592.333936.931307.92796931922.828736.9
gene42106.8514.11918.7213.6783.3878.41060.4678.1659.91133.72090.8469.6
gene538821.638903.13968343102.339869.736460.737030.140379.537286.537800.638389.937619.6
gene6103.4-100.5306.5204.2391.6141270.1125.5195477.9153.2124.3

And here is the expression pattern we are after (defined by the experimenter):


And here is the template we might use:

0 0 0 1 1 1 0 0 0 -1 -1 -1

We could have equivalently used the following template, with identical results:

1 1 1 2 2 2 1 1 1 0 0 0

Note that this may not be the best way to start to look for complex but 'nonspecific' patterns like this. If you are really interested in a very specific pattern like this one (including the relative levels of expression it implies), then it should work well. A better method might be to use ANOVA and then maybe apply the templates, or even clustering.

References

For more details on correlation measures, look at any statistics text book.