^{1}

^{2}

^{1}

^{3}

^{1}

^{*}

Performed the experiments: MD CY RLH. Analyzed the data: MD CY RLH. Wrote the paper: MD CY. Conceived and designed method: SSTY. Normalizations of method: MD. Programming: MD QL. Helped write the paper: RLH SSTY.

The authors have declared that no competing interests exist.

Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.

To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses.

Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.

Computational and statistical methods to cluster the DNA or protein sequences have been successfully applied in clustering DNA, protein sequences and microarray data

In this paper we propose a method of characterizing DNA sequences, which uses a specific mathematical description of distributions of nucleotides in a DNA sequence that represents the biological information in the sequence. To each DNA sequence we associate a natural sequence of parameters, called a natural vector, describing the numbers and distributions of nucleotides in the sequence. We show that the correspondence between a natural vector and a DNA sequence is one-to-one. A natural distance between two genes is the distance between their corresponding natural vectors. This creates a genome space with biological distance, which allows us to do phylogenetic analysis in the most natural and easy manner. This alignment-free method is much faster than conventional multiple sequence alignment methods. Multiple sequence alignment (MSA) can be seen as a generalization of pair-wise sequence alignment, in which, instead of aligning two sequences, k sequences are aligned simultaneously. MSA is the most powerful method to analyze the genetic sequences and most of state-of-art algorithms are constructed based on it

For our first application, we analysed the new influenza A (H1N1) virus based on the whole genome (

We apply our method to analyze 59 influenza viruses based on their whole genomes. The natural vector and the hierarchical clustering methods are used to reconstruct the phylogenetic tree for nucleotide sequences of the whole genome sequences of selected influenza viruses. The selected viruses are chosen to be representative from among all available relevant sequences in GenBank. Sequences have both high and low divergence to avoid biasing the distribution of branch lengths. Strains are representative of the major gene lineages from different hosts. The robustness of individual nodes of the tree is assessed using a bootstrap resampling analysis with 1000 replicates shown in

All HRV data are provided in

We applied our method to analyze 31 mammalian mitochondrial genomes. From our clustering analysis, we can see that all 31 genomes are correctly clustered into 7 known clusters: Erinaceomorpha (cluster 1), Primates (cluster 2), Carnivore (cluster 3), Perissodactyla (cluster 4), Cetacea and Artiodactyla (cluster 5), Lagomorpha (cluster 6), Rodentia (cluster 7). Data are provided in

It is clear that swine flu viruses are not clustered correctly using the ML method (

Number | Genome name on the tree | GenBank ID |

1 | Human | V00662 |

2 | pigmy chimpanzee | D38116 |

3 | common chimpanzee | D38113 |

4 | Gibbon | X99256 |

5 | Baboon | Y18001 |

6 | vervet monkey | AY863426 |

7 | Macaca thibetana | NC 002764 |

8 | bornean orang-utan | D38115 |

9 | sumatran orang-utan | NC 002083 |

10 | Gorilla | D38114 |

11 | Cat | U20753 |

12 | Dog | U96639 |

13 | Pig | AJ002189 |

14 | Sheep | AF010406 |

15 | Goat | AF533441 |

16 | Cow | V00654 |

17 | Buffalo | AY488491 |

18 | Wolf | EU442884 |

19 | Tiger | EF551003 |

20 | Leopard | EF551002 |

21 | indian rhinoceros | X97336 |

22 | white rhinoceros | Y07726 |

23 | black bear | DQ402478 |

24 | brown bear | AF303110 |

25 | polar bear | AF303111 |

26 | giant panda | EF212882 |

27 | Rabbit | AJ001588 |

28 | Hedgehog | X88898 |

29 | Dormouse | AJ001562 |

30 | Squirrel | AJ238588 |

31 | blue whale | X72204 |

As an application, we first use our method to analyze the new influenza A (H1N1) virus. Recent reports of widespread transmission of swine-origin influenza A (H1N1) viruses in humans in Mexico, the United States, and elsewhere, highlighted this ever-present threat to global public health

In addition, we applied our approach to study another group of viruses, human rhinovirus (HRV). Infection by HRV is a major cause of upper and lower respiratory disease worldwide and displays considerable phenotypic variation. Recently, Palmenberg et al.

As another biological application, we consider the phylogeny of mitochondrial genomes. Mitochondrial DNA is not highly conserved and has a rapid mutation rate, thus it is very useful for studying the evolutionary relationships of organisms

In the

In order to compare the computation time of the natural vector method and state-of-the-art methods ClustalW2, MUSCLE and MAFFT

In this paper, we report a new mathematical method to characterize a genetic sequence as a natural vector so we can perform clustering analysis and create a phylogenetic tree based on it. A natural vector system to represent a DNA sequence is introduced, and the correspondence between a DNA sequence and its natural vector is mathematically proved to be one-to-one. With this natural vector system, each genome sequence can be represented as a multidimensional vector. Genomes with a close evolutionary relationship and similar properties are plotted close to each other when we construct the phylogenetic tree. Thus, it will provide a new powerful tool for analyzing and annotating genomes and their phylogenetic relationships. Our method is easier and quicker in handling whole or partial genomes than multiple alignment methods. There are four major advantages to our method: (1) once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for any subsequent application, whereas in multiple alignment methods, realignment is needed for adding new sequences. (2) One can perform global comparison of all genomes simultaneously, which no other existing method can achieve. (3) Our method is quicker than alignment methods and easier to manipulate, because not all dimensions of natural vectors are needed for computing. Instead, the first several dimensions of natural vectors are good enough to cluster DNA sequences or genomes. Generally, we select the first

Although the natural vector method can be used to reconstruct the phylogenetic trees of DNA sequences, genes and whole genomes, this method may not be a suitab le substitute for local multiple sequence alignment when one wants to identify the similarity of genomic subsequences and does not know a priori which subsequences to identify.

Let us first introduce the definition of normalized central moments which is the most important part of natural vector method. Normalized central moments are defined as follows:

Our method described below is to give a complete understanding of the distribution of four nucleotides A, C, G and T.

The quantities of the four nucleotides: A, C, G and T of a DNA sequence are chosen as the first four parameters of the natural vector. Four integers

The second group of numerical parameters which are a part of the natural vector are the mean values of total distance, one for each of the four nucleotide bases:

As a simple illustration for the DNA sequence GTTCAATACT: The total distance of A is

The final group of parameters that we include in the natural vector are composed of normalized central moments. The first normalized central moment is:^{th} moment will be <0, 0,

Obviously, higher moments converge to 0 for a random generated sequence since for any given ^{th} moment will converge to 0.

We have used natural vector to obtain a good numerical characterization of DNA sequence. We now discuss the construction of natural vectors of genomes. Generally, for a linear single-strand genome, we treat it as a linear DNA sequence while we treat every point as the starting point and then take average for circular single-strand genomes. For general double-strand genomes, we treat them as two single-strand genomes and then take average. More details are discussed in

One of the most important things in this paper is that we can prove that the correspondence between a DNA sequence and its natural vector is one-to-one (see

The Euclidean distance between two sequences

We used Matlab to calculate the natural vectors of genes and genomes. The package HCLUST of R language (for algorithmic details, please refer to

Supporting information S1 contains the complete proof of the correspondence theorem, the bootstrapping analysis on A H1N1 genomes, distribution of the distance between each pair of random shuffled genomes under simulation, clustering of the segmented gene PB2, computational time chart of natural vector method, ClustalW2, MUSCLE and MAFFT, and the Genbank ID of the data used in this paper.

(PDF)

We thank Dr. Max Benson for critically reading and editing the manuscript. We also thank the editor and anonymous reviewers for thorough review and constructive comments.