Saturday, July 26, 2008

Back in the classroom

I think it is wise for "educated" people to have to return to the other side of the classroom now and then. When we do this we are reminded that, even though we went to college for almost a decade, we still only know a teeny little bit about one subject. This is how I feel this week as I attend a course on Computational Phyloinformatics at NESCent in Durham, NC. The goal of this course is to teach biologists how to use large datasets and write our own scripts to improve our abilities to infer phylogenies. The prerequisites for the course are to have a solid understanding of phylogenetic inference methods and to have some programming skills.

I have some experience programming in PERL, mostly to write/edit portions of the USS evolution code Rosie blogs about often, but also to facilitate my laziness when dealing with many files and their contents. I've also written a few functions for statistical analysis in R. Even though I still struggle with programming, I was just beginning to feel as though I had a solid grasp on it. That was until I started this course. We are only two days into it, and only just finished the refresher, and my head is full of new information, so full that I am having difficulty digesting it all. I've learned many new things about the PERL language and hope to learn even more. It sure hasn't been easy, and will probably only get more difficult, but I will take solace in the fact that my skills will be better after the course than they were before, even if I still have a lot to learn.

Tuesday, July 22, 2008

Recombination headache

I have really slacked with my blog posts lately and I would like to blame this on being busy in the lab, but I know that is only an excuse. Besides the lab work, I am still working on the manuscript detailing my attempts to infer the phylogenetic relationships of the H. influenzae strains so that I can map DNA uptake and transformation ability onto the phylogeny. In previous posts, I've whined about the lack of resolution in the topology, which is most likely due to recombination between strains. One recombination-detection program in particular, GENECONV, is causing me some stress right now. The program is easy and straightforward to use but my results are difficult to understand.

GENECONV takes pairs of aligned sequences and compares them to find regions that are more similar to each other than expected. These regions are considered to be candidate recombinant fragments. But the worrisome thing is how GENECONV determines the divergence expected between sequences. The reason this is a concern is because my output shows, for some strains, over half of the genome is more similar than expected. And to make matters worse, the strains with these large numbers of "recombinant" regions are those that phylogenetic analyses identifies as close relatives. These results make me think that instead of finding recombinant fragments, GENECONV is finding regions of vertical descent.

It could be the case that these putative recombinant regions are fooling the phylogenetic reconstruction methods, such that these strains are placed together as close relatives only because they share so many regions due to recombination. But if the majority of the genome is the result of homogenization of sequence between two strains, aren't these each others closest relatives anyway, whether through vertical or horizontal descent? Nevertheless, I need to sort out exactly what is happening during GENECONV runs, starting with how it determines what similarity is to be expected between two sequences. Does the method used to measure expected similarity take phylogenetic structure into account?