Thursday, February 21, 2008

A forest of trees

How much recombination occurs in H. influenzae? Does more recombination occur in strains that are more competent?

The answers to these questions can be partly addressed by inferring the phylogenetic relationships of H. influenzae strains, something that I have been working on for a few months now. The first way I did this was to take different sized fragments of the genome, align the same fragment from all strains, and infer a topology for each fragment. When I determine which clades all trees topologies have in common, I find that none of the clades are present in a majority of topologies. What this means is that if we take a the same section of the genome from each strain, the evolutionary history of this section will differ from another section. Recombination is a likely culprit here and statistical tests I've done so far do support recombination.

But there is a problem with this consensus tree approach. Not all of the genome fragments are the same length. This means that some topologies are based on an alignment that is >200 kbp while others are based on an alignment that is <200 bp. Clearly, the topology inferred using the lengthier dataset should be weighted more than the topology inferred using the shorter dataset. But weighting individual topologies when determining a consensus tree is not possible, as far as I know (if anyone knows differently, I would love to hear about it).

To get around this problem concatenating all of the genome fragment alignments into one large alignment can be done. I have done this and am currently inferring the topology using different methods such as parsimony, distance, and maximum likelihood. The parsimony and distance analyses produce a well-supported fairly resolved phylogeny but the two methods do not agree on all clades. Now I am waiting for the outcome of the maximum likelihood analysis that may take a few days, given the alignment is over 1.8 Mbp. If all three methods produce different topologies I suppose I can believe the clades all of them agree upon, if there are any. Then I can have confidence in these clades and use them to map competence onto the phylogeny. But I am also using a completely different method implemented in ClonalFrame that will exclude regions that have been influenced by recombination and infers the phylogeny based on the clonal frame only. We will see what all these methods agree upon, if anything. Nevertheless the outcome will be interesting.