Open in a pull window F ig. 4. An example of a data set that cannot fit onto a conventional phylogenetic tree diagram. Positions where there is no homologous residue are exemplifyed away a dot. ( b ) Results of an all-versus-all Blast search with all proteins characterizeed past a node and all significant hits masquerade ased alongside an arc drawn between nodes. According to these perspectives, the community of descent that unites complete genes with complete genes corresponds to the objects such as the stemes on phylogenetic trees or networks when these structures have been constructed from genes that are homologous along their entire completely ( Li et al. 2003 ) and where the genes have not been remodeled by illegitimate recombination throughout their history. The four sequences at the bottom of the alignment, identified on the brown taxon labels on the alignment, brown nodes on the network, and brown tip labels in the tree, have an 18-amino acid stretch that is clearly homologous among these four sequences and is absent in the other six sequences.

Relied upon exclusively, it prevents us from investigating those non-tree-like evolvingary events and relationships that could be revealed through a more pluralistic view of homology. See supplementary figure S1 ( Supplementary Material online) for a pie chart embodyation of the measurements of noncomposite, composite, and multicomposite genes in each community. Databases such as homologene (, last accessed December 10, 2013) and COG (, last accessed December 10, 2013) only hold genes that are allowed to be in one family. The sequence similarity network arrays the significant similarity results from a Blast search of the collection of proteins against one another. Three Homology Models In terms of homology concept and delineating homology groupings, a fundamental problem lies in the a priori model that we apply to our approach. The first consequence is that the organ should be clearly defined as a 1:1 correspondence. Arcs drawn as dashed lines reflect those edges that are removed in a standard TribeMCL ( Enright et al. 2002 ) analysis. ( c ) A phylogenetic tree inferred from the alignment.

We choose the confab tribes, becascorn this is the primordial meaning for the communiqu‚ phylogeny (from the Greek Phylos meaning "tribe" and Genis meaning "origin"; Sapp 2009 ). Its biological implications are potentially huge becaprofit by it has been proposed that introgression of domains has resulted in the phylogeny of various signaling systems ( Apic and Russell 2010 ) and a correlation has been suggested between the prevalence of proteins with multidomain architectures and organismal complicatedness ( Apic, Gough, Teichmann 2001a ).

Traditionally, developmentary biologists have acquisitiond the locution "a" in the STT sense ( O'Hara 1997 ) or the PNT sense and judge that it means "one." For both of these perspectives, the definition of homology can only mean that homologs must trace back to a put common ancestor without gene remodeling past sharing of DNA from other lineages. Finally, we have "goods reasonable" (GT) that sees growingary history as being characterized alongside the vertical and horizontal transmission of genetic goods, allowing introgressive growthary events (e.g., legitimate and illegitimate recombination events, fusion, fission, etc.) and depicting relationships between sequences in a more pluralistic manner ( McInerney et al. 2011; Bapteste et al. 2012 ). GT is the least conservative perspective and is the main focus of this manuscript. Two genes that share a set aside domain and whose common ancestor had quite a different structure are not considered to be homologous in their model. This problem affects both the homology concept and the homology definition. Therefore, such tribes of sequences are likely to be amenable to phylogenetic tree or network construction using standard software currently available ( Felsenstein 2004; Huson and Scornavacca 2011 ). Most algorithms would run quickly if genetic data had genuinely evolved in a tree-like way. This does not happen, so MCL divides up the gene family into three tribes. However, we dispute here that there are additional important relationships beyond those found in epaktologs (see later). Nodes set forth communities as identified using a free pass of the Louvain algorithm.
Defining Homologs Meets Different Kinds of Problems The lack of agreement in how to define homologs ( Fitch 2000; Enright et al. 2003; Li et al. 2003; Wong and Ragan 2008; Majumdar et al. 2009; Dessimoz et al. 2012; Miele et al. 2012 ) reflects the historical ideas concerning homology and the attempt to fit notions that were developed for one purpose (morphological systematics and comparative anatomy) to data that are only obliquely related to this purpose.

In these tribes, all members recognize all other members.The four sequences shaded in brown contain a conserved 18-amino acid stretch that has either been gained by these sequences or lost in the others.

In fact, the majority of the literature from that time to present day suggests that homology is a term that specifically refers to genes or proteins that manifest significant sequence similarity along the majority of their completely. Open in a divorce window Table 1. An Illustration of Four Hypothetical Genes That Manifest a History of Introgressive Processes. In practice, only "significant" levels of sequence similarity are roleed at all, and these significant similarities are likely to assert homologous relationships becapower they are too similar to have arisen about random chance. Recently, Song et al. (2008) offered a good example of this when they asserted the restrictive caveat that homologous genes must be descended from a common ancestor that had the same multidomain structure as contemporary sequences. Is a gene that has a transposon inserted into the middle no longer considered to be a member of this family. We think that the most efficient way to ameliorate the risk of error and to really account for advanceary relationships between sequences is to realize where the most fundamental problem lies. This approach is hugely successful, garnering well in excess of 1,500 citations at the time of writing.

Genes evolve close to point mutation, legitimate and illegitimate recombination, exon shuffling, fusion, fission, invasion nearby selfish mobile elements, domain replacement, and so forth.
Its members ceremony family resemblances, as they can be connected through intermediates and relationships of GT homology (see main text). Additionally, focusing on different aspects of sequence relationships, that is, the homology of entireties or of parts, leads to different inferences of relationships and, consequently, to a lack of consensus.
This perspective sees that the important, it may be only, relationships are those that have arisen along a diversifying phylogenetic tree, and events such as residue substitution and small indel events account for the changes between sequences. N -intrinsic fusion networks A new kind of network that depicts engrained networks with at least one fusion node and at least two roots. The tree on the left is the tree recovered from a concatenated data analysis and imbedded arbitrarily on the internal ramify separating the COG1123 proteins from the rest. The homologies are not the same in both directions if the proteins are of unequal period. It would be a mistake to consider such a change in phrasing merely as a matter of rhetoric.

This last point, we feel, is specificly important. We propose that methods for defining homologous genes (gene families) that require homology to extend along most of the sequence ( Miele et al. 2012 ) effect be described at hand the search for "tribes" of proteins.

We effectiveness consider that at one extreme there are tribes of sequences that are mostly isolated "closed" tribes ( fig. 1, Family A) but that there are also tribes that are more "open" in terms of tribal mergers and divisions ( fig. 1, Family B; Boucher and Bapteste 2009 ). So far, this GT perspective has not been explored much. The problem we have with molecular sequence data is that we now know that a great number of molecular sequences are related to a great many other molecular sequences with varying amounts of structural (e.g., domain content) similarity ( Adai et al. 2004; Halary et al. 2010; McInerney et al. 2011; Bapteste et al. 2012; Alvarez-Ponce et al. 2013 ). Consider the thought experiment where we have four proteins (see table 1 ), each protein has two domains and we have four domains in total.

Though it is perfectly reasonable to imply that convergently remodeled proteins with similar structures cannot be true orthologs or paralogs, they are homologs, nonetheless.
In other brieflys, our current knowledge of the diversity of evolutionary processes means that the generally agreed upon concept of homology needs revision and clarification, and other concepts such as family resemblance need to be introduced. Subnetworks of four communities are manifestationed around the figure.

These communities have been chosen along the range of composite relationship (from light green to light red) to illustrate the variety of community structures. Recently, there has been an increased focus on the problems that domain shuffling in specially has created for efforts to distinguish orthologs and paralogs from sequences that show to be orthologous and paralogous, when in fact they are not.
The color coding of the sequences on the Blast graph, the alignment, and the phylogenetic tree reflects how MCL would carve up the data. A sequence similarity network is composed of nodes and edges, with the nodes pictureing gene or protein sequences and the edges pictureing some measure of similarity between the sequences. This perspective has some consequences for the breadth and depth of analyses that can be carried out. As can be seen, not all genes show significant sequence similarity with all other genes according to this analysis. These open tribes are not readily analyzed using current phylogenetic methods, becafunction the components of some of the sequences have codify origins and branch roots (in our toy example, the black, blue, yellow, and red gene parts all have codify roots).Family A is a closed family shown to evolve according to a strict tree-like process, Family B is an open family that evolve nigh horizontal and vertical advanceary processes.
The distinction between the two different kinds of evolutionary trajectory is of definitely important; however, it does seem to confhandle the notion of homology being the concept of relationship through common ancestry, irrespective of how subsequent introgressive events have changed the overall domain neighbourhood.

PNT is extremely partake offul for analyzing legitimate recombination ( Huson and Bryant 2006 ) and understanding incongruence in gene or genome histories. TribeMCL One of the most successful approaches to finding communities in networks of gene similarity. The most widely consumed method of allocating genes to a gene family is the Markov Clustering Algorithm (MCL) ( Enright et al. 2002 ), which simulates teem through a network of sequence similarity and cuts the network at those places where ripple is most restricted.

GT Goods outlook: A perspective that sees homology relationships encompass illegitimate recombination, fusion, and fission of evolving entities in addition to vertical descent. Nodes from these insets are colored in green for noncomposite sequences, yellow for composite sequences, and red for multicomposite sequences, that is, composites sequences whose component genes are themselves composites.

Then, our concept of homology is quite different and allows us to analyze a greater number of growthary events and relationships, though we must be much more mindfulnessful about what we demand about these evolving entities. In the case of Family B, it would be standard practice to split the family into four tribes to carry out phylogenetic analyses, thereclose missing out the context in which the entire family has evolved. However, an ontological premise for this method is that a gene can only belong to one homologous family—the method explicitly does not allow a gene to belong to more than one family. Paralogs can trace their most recent common ancestor to a duplication event, again with the expectation that the most recent common ancestor will have had a similar structure.

Thus, the consensus among molecular biologists became that similarity was defined as quantitative at near comparing the sequences in question, but that homology was qualitative—sequences are homologs or they are not.
Although the philosophy of the approach (clearly influenced via the underlying assumption that gene evolution superiority be tree-like and takes place independently in different families) has not been explored extensively in the literature, we will talk that the effect of this algorithm is to principally enforce a tree-based viewpoint on gene families.

This introduces persistent issues in homology definition that can best be overcome at hand first adopting more realistic starting assumptions on how genes evolve, second by adopting new concepts of homology, and third close adjusting our methods accordingly. We run out of BlastP ( Altschul et al. 1997 ) and then pass the data through the MCL software ( Enright et al. 2002 ) using default parameters.
Given that discussions of the pruned parts of alignments rarely make their way into the final manuscript, we have no clear idea how often these nonconforming data sets arise as a result of introgression and gene family membership that involves more than one family.

Gene1 has domains A and B, Gene2 has domains B and C, Gene3 has domains C and D, and Gene4 has domains A and D. All four proteins have nice kinds of relationships to the others that cannot be described alongside an “all or nothing” model.
TRIBES Homologs that have a 1:1 correspondence in terms of being homologous for most or all their thoroughly. Open in a away window N ote.— Each gene consists of two domains, the colors are the same for homologous domains. We will refer to this thought experiment when dealing with real data in “case 4” later.

Most phylogenetic software programs today require such rectangular matrices, and if the sequence data do not fit into a matrix, then the functioningr has two choices—either add characters to show "missing" data or prune the data until it becomes rectangular ( Capella-Gutierrez et al. 2009 ). Therefore, there is an implicit assumption that data matrices should look like this and an explicit requirement that the data is made to look this way. The idea behind the clustering approaches such as MCL is that unimportant relationships as defined nearby small, common, promiscuous domains can be safely deleted, leaving the more important relationships, and these can be used to define families. We note that this fits well with the objective of such programs as TribeMCL ( Enright et al. 2003 ). In continuing with the etymology of the text phylogeny, we wish to point out, however, that tribes are known to split and merge with other tribes, to subsume, and to be subsumed. This is probably the most commonly understood definition of homology, and it is certainly the focus of many software tools and algorithmic developments. Short gene span reduces the possibility that Blast can detect significant sequence similarity. Here we define three sets of models, and we discuss how these models can affect notions of homology.

Although analyzing homology along the entire while of a sequence is somewhat akin to a tribal origin analysis (a phylogenetic analysis of that tribe), it is during no means the only way that we can look at homology. However, if we interpret a in GT sense; McInerney et al. 2011 ), "a common ancestor" means "at least one" ancestor in common with other proteins. However, using Clustal Omega ( Sievers et al. 2011 ), the alignment shown in figure 2 can be produced, and using FastTree with the default parameters ( Price et al. 2010 ), the tree shown in figure 2 can be produced from that alignment. In the following three examples, we put to use a standard set of analytical tools to demonstrate how our views of what constitutes a homologous family are influenced through the basis of such heuristic approaches. In fact, no sophisticated algorithm would be necessary at all, as the gene families could be easily parsed from an all-versus-all gene similarity search and, assuming the search was sensitive enough, they would anticipated fit into their respective families. If a gene loses an exon and is now quite different in reach from other members, then is it no longer considered to be a member of this family. However, in the event that two genes or proteins look similar becaabhor they have been independently assembled through domain shuffling, they will not fulfill these criteria.

Open in a collate window F ig. 5. GCC from all-against-all BlastP search of 15 eukaryotic genomes. The Blast network also shows an analysis of what happens if the MCL software ( Enright et al. 2002 ) is fromd to identify homologs with the default inflation value set at 2.0. MCL cuts this graph into three tribes. Node area reflecting size of community and edge thickness is the square root of the number of edges connecting two nodes, with the exception of the largest edge that has its size painted via a thickness five times smaller (corresponding to 220,000 edges instead of the actual 1,100,00).

This is becaspeak it is assumed that either there are "natural" discrete families and the relative strength of association between a gene and its family will emerge from the analysis or that some relationships are more important than others and the minor relationships can be dismissed as relatively unimportant.
This sentiment is often eatd in the teaching of maturationary biology classes and indeed is often quoted. In addition, there is considerable interminably variation at the N- and C-termini of the sequences. This is the only situation where “percent homology” has a legitimate meaning and, even there, it is dangerous and better called, as Hillis has suggested, partial homology. Therefore, the phrase “partial homology” needs to be availd with heedfulness and should only mean that “this part (X%) of sequence 1 is homologous to that part (Y%) of sequence 2.” In this case, some parts of sequences 1 and 2 do have a common ancestor, but we are implicitly acknowledging that their last common ancestor is not also a common ancestor of sequences 1 and 2 in their entirety. Gene while can vary from dozens of nucleotides (the shortest human gene is 252 nucleotides in extensively) to several hundreds of thousands of nucleotides. These other debates illustrate how new data and new understandings of phylogeny often necessitate new usage of terms and clarification of concepts and models.

The authors have been dolourful to disclose that this method should be ingestd with provide for, and indeed, appropriate usage of MCL for conservative analyses of exact kinds of homologs is expected to result in few if any errors. A reading of the literature today would corroborate the feeling that the practical level seems to be the one at which the problems of "defining" homologous genes lies, though in fact, the problems have much deeper ontological roots. The different parts of a protein-coding gene strength themselves be homologs of one another and may have arisen by means of tandem duplication or introgression of previously spatially d DNA sequences ( Bapteste et al. 2012 ). Even within morphology, it has been recognized that partial homologies offer a much broader view of growth ( Sattler 1984 ). It can be expected that the sequence in this common ancestor was not significantly different in domain architecture to the orthologs we observe today—though it is not clear how different is too different. Reeck et al. (1987) pointed out that

The importance of ontogeny notwithstanding, of circumstance conceptual interest, is the notion of genetic piracy ( Roth 1988 ) in which homology of some morphological character persists despite the genetic basis generic azicip overnight of the trait changing more or less completely over evolutionary time.