Meiotic recombination is a fundamental biological event and one of the principal evolutionary forces responsible for shaping
genetic variation within species. In addition to its fundamental role, recombination is central to several critical applied
problems. The most important example is “association mapping” in populations, which is widely hoped to help find genes that
influence genetic diseases [3, 4]. Hence, a great deal of recent attention has focused on problems of inferring the historical
derivation of sequences in populations when both mutations and recombinations have occurred. In the algorithms literature,
most of that recent work has been directed to single-crossover recombination. However, gene-conversion is an important, and more common, form of (two-crossover) recombination which has been much less investigated in the algorithms
literature.
In this paper we explicitly incorporate gene-conversion into discrete methods to study historical recombination. We are concerned
with algorithms for identifying and locating the extent of historical crossing-over and gene-conversion (along with single-nucleotide
mutation), and problems of constructing full putative histories of those events. The novel technical issues concern the incorporation
of gene-conversion into recently developed discrete methods [20, 26] that compute lower and upper-bound information on the amount of needed recombination without gene-conversion. We first examine the most natural extension of
the lower bound methods from [20], showing that the extension can be computed efficiently, but that this extension can only
yield weak lower bounds. We then develop additional ideas that lead to higher lower bounds, and show how to solve, via integer-linear
programming, a more biologically realistic version of the lower bound problem. We also show how to compute effective upper
bounds on the number of needed single-crossovers and gene-conversions, along with explicit networks showing a putative history
of mutations, single-crossovers and gene-conversions.
We validate the significance of these methods by showing that they can be effectively used to distinguish simulation-derived
sequences generated without gene-conversion from sequences that were generated with gene-conversion. We apply the methods
to recently studied sequences of Arabidopsis thaliana, identifying many more regions in the sequences than were previously identified [22], where gene-conversion may have played
a significant role. Demonstration software is available at www.cs. ucdavis.edu/~gusfield.