-
Notifications
You must be signed in to change notification settings - Fork 3
Background model
PEnG-motif searches sequence sets for enriched patterns. The following sections will explain what enrichment means in the context of PEnG-motif, how it is calculated and how you can influence the background model to obtain better results.
Enriched patterns are kmers that occur significantly more often than expected by random chance. For each pattern, we can test the significance by calculating a p-value or z-score under a poissonian null distribution. The only parameter we need to estimate is the expected frequency of the pattern in unbound sequences.
Given a large enough set of negative (=unbound) sequences, we could estimate the expected frequencies directly by the relative frequencies of the negative set. This is usually not feasible due to two reasons:
- If we assume patterns of length 10, there are approximately 1 million different patterns. In order to estimate their expected counts accurately, we would need a huge amounts of sequences.
- Often there is no way to obtain a representative set of unbound sequences.
PEnG-motif tackles these problems by modeling the pattern occurrences by a Markov model. With a background model of order 2 (=trimers), we approximate the probability of the pattern ACGTAC by
P(ACGTAC) = P(ACG) * P(T|CG) * P(A|GT) * P(C|TA)
This directly solves problem (1), because we reduce the number of parameters from 1 million to 4^3 = 64. 64 parameters can be accurately estimated even for small sequence set sizes!
If the user provides a negative set of unbound sequences with the --background-sequences
flag, PEnG-motif learns a Markov model of order --bg-model-order
(default 2) on the negative set and uses this model for calculating the expected counts for the null model.
If the user does not specify background sequences with the --background-sequences
, PEnG-motif will learn the background model on the positive set. While paradoxical at first, learning on the positive set works: when a long pattern such as ACGTTGCA is overrepresented, the 3-mer parameters of a second order background model will not react strongly, because the estimation of the pattern's 3-mers (ACG, CGT, GTT, TTG, TGC, GCA) are dominated by counts that do not come from the motif.
It is important to keep in mind that the approximation works well only if the enriched pattern is much longer than the background model. It is e.g. not possible to learn a pattern of length 4 with a background model of order 3. For very short patterns and no background sequences provided, you have to reduce the background model order!
If you expect the negative sequences to have the same biases (e.g. GC content and CpG frequencies, ...) as the unbound sequences they will make a great negative set and should be passed to PEnG-motif. If you are not sure, it may be safer to train the negative model on the positive sequences.
If the background sequences are strongly enriched for long patterns such as repeats or micro satellites, PEnG-motif may sometimes report repetitive patterns as enriched, although they are in fact not enriched. This is due to the assumption that the negative sequences can be well approximated by a Markov model.