Information Theoretic Alignment Analysis


Overview

In the world of bioinformatics, information is an absolute measure of sequence conservation whose units are called bits. Relative entropy is a similar measure of sequence conservation that is relative to the background sequence and thus has no units. Using this IGS webtool you can easily submit an alignment, calculate several information measures for the alignment, and then test other novel sequences against the alignment.

If you would like to know more about information theory and its applications to biology we suggest checking out Tom Schneider's website.

For help with the different aspects of this tool see the sections below.


Submitting A Multiple Sequence Alignment

Before you can do anything else, you must either enter a multiple sequence alignment into the text field or upload a file containing an alignment.

Specify the format of the alignment. The format must be one of the following:

Blank lines and any lines starting with a # will be ignored.
There cannot be more than 200000 residues of sequence data in an alignment.
Sequence headers/names will be truncated if they are longer than 100 characters.

Choose an alignment type (either DNA/RNA or protein).

Recognized DNA/RNA residue symbols are: A, C, G, T and U.
Note: All U's will be converted to T's.

Recognized protein residue symbols are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y.

Lower case residue symbols will be converted to upper case.
Any sequence character that is not a recognized residue symbol will be considered a gap.

You can adjust the background frequencies of the different residues. This will affect all relative entropy calculations. Leaving the background frequencies at their default values will simply cause all relative entropy calculations to be equal to their information counterparts.


Information Profile

The information profile contains one row for each position in the alignment. Check/uncheck the boxes to show/hide the corresponding columns (i.e. the attributes of the positions) in the information profile. The columns are as follows:

pos The position in the alignment.
countX The number of occurences of residue X at this position.
freqX This is equal to countX / sampleSize.
sampleSize The number of sequences that have a residue (not a gap) at this position.
correction The entropy correction for this position (increases as sample size decreases).
info The information content as this position (measured in bits).
relEntropy The relative entropy at this position (uses provided background frequencies).
correctedInfo This is equal to info - correction.
correctedRelEntropy This is equal to relEntropy - correction.

Note: If the number of residue symbols is 4 (i.e. DNA/RNA) and the sample size is <= 125 then the sample size correction used is the expected entropy assuming a mutlinomial distribution. Otherwise, the correction used is an asymptotic approximation (which is, unfortunately, a poor approximation for small sample sizes).


Cumulative Information

The cumulative information/relative entropy of an alignment is simply the sum of the information/relative entropy of all of the positions. You can break up the positions into windows (one row per window), calculate the sums for each of these windows, and sort the windows by any of several different columns (i.e. attributes). The columns are as follows:

startPos The first position in the alignment window.
endPos The last position in the alignment window.
sumInfo The sum of the information content of all the positions in the window.
sumRelEntropy The sum of the relative entropy of all the positions in the window.
sumCorrectedInfo The sum of the corrected information content of all the positions in the window.
sumCorrectedRelEntropy The sum of the corrected relative entropy of all the positions in the window.


Individual Information

The indivual information/relative entropy score of a sequence from the alignment can be thought of as a measure of the degree to which the sequence contributes to the overall information content/relative entropy of the alignment. High scoring sequences are a good fit to rest of the alignment and vice versa. You can choose to use only a portion of the alignment in calculating sequence scores. Sequences can then be sorted by any of the following columns (i.e. attributes):

seqNum The order in which the sequence appeared in the submitted test set.
seqName The name/header of the sequence.
startPos The first position in the sequence segment.
endPos The last position in the sequence segment.
indInfo The individual information score for the sequence.
indRelEntropy The individual relative entropy score for the sequence.
indCorrectedInfo The corrected individual information score for the sequence.
indCorrectedRelEntropy The corrected individual relative entropy score for the sequence.

Note: All individual sequence scores are calculated using only the chosen positions from the alignment.


Testing Sequences Against The Alignment

Sequences not present in the alignment can be submitted and tested against the alignment. We can calculate individual information/relative entropy scores for these sequences in much the same way that we do for sequences in the alignment (see above). If a test sequence is longer than the sequences in the alignment it will be split up into segments of equal length and each seqment will be scored indivually. This is sometimes referred to as sequence walking; walking along the sequence, continually recalculating the scores as you go.

You can choose to use only a portion of the alignment in calculating the sequence/segment scores. The results can then be sorted by any of the following columns (i.e. attributes):

seqNum The order in which the sequence appeared in the submitted alignment.
seqName The name/header of the sequence.
indInfo The individual information score for the sequence.
indRelEntropy The individual relative entropy score for the sequence.
indCorrectedInfo The corrected individual information score for the sequence.
indCorrectedRelEntropy The corrected individual relative entropy score for the sequence.

Note: Test sequences need not all be the same length as each other or the same length as the sequences in the alignment.

Note: All individual sequence scores are calculated using only the chosen positions from the alignment.

Note: For scoring test sequences we use psuedo frequencies of residues in caclulating the scores. Psuedo frequencies of residues are slightly adjusted from the observed frequencies seen in the alignment. This adjustment allows us to calculate scores for sequences that contain residues at positions having counts of zero in the alignment. Calculating scores for such sequences would otherwise, not be possible. This will however, cause the individual scores for sequences in the alignment to be slightly different if submitted as test sequences.


Design and Implementation: Robert Stewart, Wesleyan University, 2005.