Tuesday, September 9, 2008

sxy differences

One of the major regulators of competence development in Haemophilus influenzae is the Sxy protein (also known as TfoX). We think it might interact with Crp to induce transcription of genes required for DNA uptake. A lot of work has gone on in the lab to investigate how the sxy gene is regulated, and all of this work has been done in H. influenzae strain Rd. Now that I have measured DNA uptake in 30+ strains, I am comparing the sequence of the sxy gene, and its promoter region, from 12 additional strains whose genomes have been sequenced. I have found a few interesting things, although none of them seem to be a smoking gun.

First, the region preceding the start codon. I've aligned the region between the start codon of sxy and the start codon of the upstream gene, recA; this region is approximately 372 bp. One of the most striking differences between strains is that Rd seems to have a 52 bp deletion about 185 bp before the sxy start codon (I say this was a deletion in Rd because it is unique to Rd and all other 12 strains are almost identical at this region). Looking at an old paper mapping the transcription start site and other juicy details of this region (Zulty and Barcak 1995), it seems as though this may be a binding site for the integration host factor (IHF) protein, which facilitates the integration of phage genomes. However, my not-so-thorough literature search on IHF binding sites in H. influenzae didn't come up with much so I will leave this alone for now. And it is so far away from the start codon that I don't know if it would have implications for the transcription of sxy. Another noteworthy result: some strains differ in the sequence of their -35 site, which isn't that surprising as the consensus for this site is not that strict if I remember correctly.

Second, the coding region. Overall, there is very little variation between strains in the coding region. Strains only differed from one another by ~10 positions at most. But what is most interesting is that there seems to be two alleles segregating. And even more interesting is that the majority of nucleotides that define each allele would cause a different amino acid to be incorporated into the protein (i.e. are nonsynonymous). This may not seem so surprising, as there are more nonsynonymous sites in a gene than there are synonymous; so if mutations happen at random, they will most often occur at a nonsynonymous site. Because most of these mutations will change the amino acid sequence, they are likely to be deleterious and removed from the population by natural selection. So one could argue that there hasn't been enough time for selection to purge these nonsynonymous mutations from the H. influenzae. But I know from the analysis of sxy and other genes that there has been plenty of time for selection to purge such mutations from the population, if they are indeed harmful. Therefore, this finding raises several questions: are the two alleles functionally different? Are they associated with differences in competence? And has selection maintained both alleles, or is one on the rise?

It would be great to address some of these questions. I have data for the second one, but the other two are a little more difficult, especially since sxy is not an easy gene to work with.

Thursday, August 14, 2008

Manuscript reviews

A few weeks ago Rosie and I received our reviews for the variation-in-uptake/transformation paper. One of the reviews is really good and suggests that we do a few experiments to clarify our findings. This will not be a problem and should be done by the end of the month. The second reviewer wanted to complain about Rosie's hypothesis of DNA as a nutrient more than they wanted to critically review the content of our submitted paper, so this review will be easy to deal with.

In addition to doing the experiments requested by the first reviewer, we have decided to add some genetic data to the paper to make it stronger. I have done some analysis with competence gene sequences from the strains with genome sequences available. There are many differences in sequence between strains and some genes appear to be pseudogenes. Therefore, there are plenty of potential causes for the differences in competence between strains, now it is time to try and narrow down which ones are the cause.

Saturday, July 26, 2008

Back in the classroom

I think it is wise for "educated" people to have to return to the other side of the classroom now and then. When we do this we are reminded that, even though we went to college for almost a decade, we still only know a teeny little bit about one subject. This is how I feel this week as I attend a course on Computational Phyloinformatics at NESCent in Durham, NC. The goal of this course is to teach biologists how to use large datasets and write our own scripts to improve our abilities to infer phylogenies. The prerequisites for the course are to have a solid understanding of phylogenetic inference methods and to have some programming skills.

I have some experience programming in PERL, mostly to write/edit portions of the USS evolution code Rosie blogs about often, but also to facilitate my laziness when dealing with many files and their contents. I've also written a few functions for statistical analysis in R. Even though I still struggle with programming, I was just beginning to feel as though I had a solid grasp on it. That was until I started this course. We are only two days into it, and only just finished the refresher, and my head is full of new information, so full that I am having difficulty digesting it all. I've learned many new things about the PERL language and hope to learn even more. It sure hasn't been easy, and will probably only get more difficult, but I will take solace in the fact that my skills will be better after the course than they were before, even if I still have a lot to learn.

Tuesday, July 22, 2008

Recombination headache

I have really slacked with my blog posts lately and I would like to blame this on being busy in the lab, but I know that is only an excuse. Besides the lab work, I am still working on the manuscript detailing my attempts to infer the phylogenetic relationships of the H. influenzae strains so that I can map DNA uptake and transformation ability onto the phylogeny. In previous posts, I've whined about the lack of resolution in the topology, which is most likely due to recombination between strains. One recombination-detection program in particular, GENECONV, is causing me some stress right now. The program is easy and straightforward to use but my results are difficult to understand.

GENECONV takes pairs of aligned sequences and compares them to find regions that are more similar to each other than expected. These regions are considered to be candidate recombinant fragments. But the worrisome thing is how GENECONV determines the divergence expected between sequences. The reason this is a concern is because my output shows, for some strains, over half of the genome is more similar than expected. And to make matters worse, the strains with these large numbers of "recombinant" regions are those that phylogenetic analyses identifies as close relatives. These results make me think that instead of finding recombinant fragments, GENECONV is finding regions of vertical descent.

It could be the case that these putative recombinant regions are fooling the phylogenetic reconstruction methods, such that these strains are placed together as close relatives only because they share so many regions due to recombination. But if the majority of the genome is the result of homogenization of sequence between two strains, aren't these each others closest relatives anyway, whether through vertical or horizontal descent? Nevertheless, I need to sort out exactly what is happening during GENECONV runs, starting with how it determines what similarity is to be expected between two sequences. Does the method used to measure expected similarity take phylogenetic structure into account?

Thursday, June 12, 2008

New Direction

I have spent the last 8 months attempting to infer the phylogenetic relationships of H. influenzae strains. All methods and datasets used have converged on one answer: if there is a phylogeny it is not resolvable by our current methods. It is possible that recombination between strains has obliterated phylogenetic signal. But it is also possible that there is a tree-like structure underlying the relationships of strains but there are not enough informative sites for phylogenetic inference. I find the latter possibility very unlikely given the average pairwise difference in nucleotide sequence between strains is 3%. Furthermore, the independent evidence I have for recombination between strains is consistent with recombination preventing resolution of the tree.

What does this tell us about the evolution of competence in particular, and the evolution of H. influenzae in general? Certainly the alleles of competence genes are mixed between strains (I also have evidence supporting this), an observation consistent with the phenotypic variation we observe in DNA uptake and transformation. This tells us that the ability to take up DNA and recombine it changes relatively quickly during the evolution of strains. This also tells us that strains of H. influenzae are not diverging from one another but that they form a cohesive group, linked by the sharing of alleles.

But even though my inability to infer a resolved tree tells us interesting things about the evolution of competence and H. influenzae, it is still unsatisfying. And I keep asking myself what other methods or datasets could be used. But I have to stop somewhere and move on from here.

Monday, April 28, 2008

Skeptical

Transformational recombination in bacteria occurs by a gene conversion mechanism, where incoming ssDNA replaces one of the DNA strands in the chromosome. Sometimes a new allele will be introduced and this could result in the converted region having a different evolutionary history than the rest of the genome. Thus, the evolutionary history of the locus will be mosaic; when used in phylogenetic analysis there will be conflicting signal and sites in the converted region will have a different evolutionary history than sites outside of the converted region, i.e. they are incompatible (Phi test, Bruen et al. 2006).

I now have two sources of evidence supporting recombination in H. influenzae. First, the inability to resolve the phylogenetic relationships of strains, even when using whole genome sequences. Second, pairs of sites have independent evolutionary histories, indicative of mosaicism within a locus (determined using the Phi test of Bruen et al. 2006). Star phylogenies can be a sign of recombination and/or population expansion. But taken together with the independent evolution of pairs of sites, there is good evidence that recombination occurs between strains of H. influenzae. But despite the evidence from phylogenetic methods and the Phi test, I would still like to see at least one signature of recombination.

To do this, and to quantify the number of recombination events that have occurred during the divergence of each strain, I have been using a program called GENECONV (Sawyer 1999). This program identifies regions in an alignment where pairs of strains are more similar to each other than would be expected, given the distribution of polymorphism throughout the rest of the gene/locus/alignment. These are examples of allele exchange between strains in the alignment (or between a strain not in the alignment and both strains in the alignment). I will refer to these as INs. The program also identifies regions where a particular strain is more different from other strains than expected, given the distribution of polymorphism throughout the rest of the gene/locus/alignment. These are examples of allele exchange between a strain in the alignment and a strain not in the alignment (OUTs). When I used an alignment of the whole genomes of 13 strains as input, and counted how many nucleotides are recombinant from the output, I got very similar numbers for all strains. For example, for the INs I got numbers that ranged from 444,214 nt to 538,306 nt. This means that this many nucleotides in one particular strain were involved in a recombination event (either as donor or recipient, we will never know which). And for the OUTs I got numbers that ranged from 122,124 nt to 128,546 nt. This means that this many nucleotides in one particular strain were involved in a recombination event (probably as a recipient). As a control, when I randomized the order of nucleotides in the input alignment, no recombinant fragments were found. And when I have looked at regions of the alignment identified by the program, I do see evidence for recombination.

Basically what these results mean is that all strains have similar numbers of nucleotides that have entered the genome through recombination events. But I am bothered by how similar the numbers are for each strain, and therefore find this result hard to believe. I think I need to learn a lot more about the GENECONV program.

Friday, March 7, 2008

Equivalent competence phenotypes?

I am working on the Discussion of our competence manuscript and am trying to think about whether all DNA uptake phenotypes might be functionally equivalent. This means that cells are equally happy if they each take up 100 molecules or only one molecule. Given what we know about competence I don't think that this is the case but we evolutionists like to consider neutral explanations along with selective ones. So here is the paragraph:

"An additional ultimate cause of variation in competence is that the majority of the variation is functionally neutral. This would be the case if has been beneficial for cells to take up some DNA with the precise quantity being unimportant. Under this scenario, the high DNA uptake phenotype of strain Rd would be functionally equivalent to the low DNA uptake phenotype of strain Eagan. If we assume that all cells in a culture are competent then in our assays each Rd cell took up an average of 89 molecules of donor DNA whereas in Eagan each cell took up only one molecule of DNA. Each double-stranded donor molecule was 444 nucleotides (222 bp) so Rd and Eagan would have taken up 444 x 44 = 19,536 and 444 x 1 = 444 nucleotides, respectively. Considering the genome sizes of these bacteria are greater than 3.6 x 10^9 nucleotides (1.8 Mb), the DNA taken up by both strains would contribute little to whole genome replication. However, DNA is taken up by most strains when replication is needed for DNA repair and not whole chromosome copying. Depending on the extent to which cells need to use DNA replication during repair, every nucleotide taken up may help so the differences in competence between strains are likely to have a strong influence on the function of competence."

I am trying to think of obvious flaws in this reasoning. I am sure they are there but my brain is tired today.

Thursday, February 21, 2008

A forest of trees

How much recombination occurs in H. influenzae? Does more recombination occur in strains that are more competent?

The answers to these questions can be partly addressed by inferring the phylogenetic relationships of H. influenzae strains, something that I have been working on for a few months now. The first way I did this was to take different sized fragments of the genome, align the same fragment from all strains, and infer a topology for each fragment. When I determine which clades all trees topologies have in common, I find that none of the clades are present in a majority of topologies. What this means is that if we take a the same section of the genome from each strain, the evolutionary history of this section will differ from another section. Recombination is a likely culprit here and statistical tests I've done so far do support recombination.

But there is a problem with this consensus tree approach. Not all of the genome fragments are the same length. This means that some topologies are based on an alignment that is >200 kbp while others are based on an alignment that is <200 bp. Clearly, the topology inferred using the lengthier dataset should be weighted more than the topology inferred using the shorter dataset. But weighting individual topologies when determining a consensus tree is not possible, as far as I know (if anyone knows differently, I would love to hear about it).

To get around this problem concatenating all of the genome fragment alignments into one large alignment can be done. I have done this and am currently inferring the topology using different methods such as parsimony, distance, and maximum likelihood. The parsimony and distance analyses produce a well-supported fairly resolved phylogeny but the two methods do not agree on all clades. Now I am waiting for the outcome of the maximum likelihood analysis that may take a few days, given the alignment is over 1.8 Mbp. If all three methods produce different topologies I suppose I can believe the clades all of them agree upon, if there are any. Then I can have confidence in these clades and use them to map competence onto the phylogeny. But I am also using a completely different method implemented in ClonalFrame that will exclude regions that have been influenced by recombination and infers the phylogeny based on the clonal frame only. We will see what all these methods agree upon, if anything. Nevertheless the outcome will be interesting.

Tuesday, January 29, 2008

Scientist doping

I am currently sitting in on an ethics class because NIH requires that I do so if I want to accept their funding. I had an ethics class as a graduate student and we talked about all of the usual things like stem cell research, gene therapy, experimentation with animals, etc. The ethics class I am in now is a bit different, which is nice because I can think about some new issues. The issue up for discussion yesterday was sports doping, which is when athletes take drugs to enhance their natural abilities. Most people find this unethical because athletes should perform using only what is available to them naturally. Enhancement is considered to be unfair and a sign of bad sportsmanship. But I couldn't help but compare the enhancement of athleticism using steroids to the enhancement of brain power/alertness using caffeine. Every day I enhance my alertness by drinking tea or coffee. Does this mean that I have an unfair advantage in science because I am doping my brain with caffeine? I would say no but only because the majority of scientists I know rely on caffeine themselves. Basically, we are all enhancing our natural abilities. But no one has a problem with this. So I guess caffeine enhancement of scientists is not as unethical as steroid enhancement of athletes. Why is this?

Thursday, January 17, 2008

Recombination

In my previous post I eluded to the lack of resolution in the H. influenzae phylogeny. I have now attempted to use three different methods of phylogenetic reconstruction (maximum likelihood, distance, and parsimony) and, due to some characteristics of the datasets, only parsimony was feasible. From the whole genome alignments I had 101 alignment files. A phylogeny was inferred for each of these alignment files using parsimony and 100 bootstrap pseudoreplicates. Most clades in each of these topologies had poor bootstrap support. Then, a consensus tree was inferred from each of the 101 trees. Guess what? Absolutely no resolution of any clades, nothing, a "star" phylogeny.

Is it due to recombination? Probably. I tested for recombination using a parsimony based approach (Bruen et al. 2006. Genetics 172:2665) in each of the 101 alignment files (so within each alignment region, not between alignment files) and found that only 11 alignment files did NOT have evidence of recombination. The lack of recombination in all but one of these is probably due to the small size, most of the 11 were under 1 kb.

What does this all mean? Recombination was previously thought to be rare in H. influenzae, based on linkage between phenotypes and multi-locus enzyme electrophoresis types. But sequence data is changing what we know about the population structure of H. influenzae. MLST studies have already shown modest incongruence between gene trees and fancy Bayesian statistics have shown that the recombination rate is moderate. My work is now showing that recombination occurs often enough to obliterate phylogenetic signal. My next question is: is it mostly due to transformational recombination?