Tuesday, July 22, 2008

Recombination headache

I have really slacked with my blog posts lately and I would like to blame this on being busy in the lab, but I know that is only an excuse. Besides the lab work, I am still working on the manuscript detailing my attempts to infer the phylogenetic relationships of the H. influenzae strains so that I can map DNA uptake and transformation ability onto the phylogeny. In previous posts, I've whined about the lack of resolution in the topology, which is most likely due to recombination between strains. One recombination-detection program in particular, GENECONV, is causing me some stress right now. The program is easy and straightforward to use but my results are difficult to understand.

GENECONV takes pairs of aligned sequences and compares them to find regions that are more similar to each other than expected. These regions are considered to be candidate recombinant fragments. But the worrisome thing is how GENECONV determines the divergence expected between sequences. The reason this is a concern is because my output shows, for some strains, over half of the genome is more similar than expected. And to make matters worse, the strains with these large numbers of "recombinant" regions are those that phylogenetic analyses identifies as close relatives. These results make me think that instead of finding recombinant fragments, GENECONV is finding regions of vertical descent.

It could be the case that these putative recombinant regions are fooling the phylogenetic reconstruction methods, such that these strains are placed together as close relatives only because they share so many regions due to recombination. But if the majority of the genome is the result of homogenization of sequence between two strains, aren't these each others closest relatives anyway, whether through vertical or horizontal descent? Nevertheless, I need to sort out exactly what is happening during GENECONV runs, starting with how it determines what similarity is to be expected between two sequences. Does the method used to measure expected similarity take phylogenetic structure into account?

