On the origin of COVID-19.
This article aims at providing an introduction into the similarities between different viruses from the Coronaviridae family to build a phylogenetic tree. For more complete information, I highly recommend this article from Nature Medicine.
Phylogenetic Tree:
A phylogenetic tree (sometimes called evolutionary tree), is a branching diagram (tree) showing the relationship between different species or individuals based on observed differences. Traditional models were essentially based on quantified morphological or phenotypic traits (such as beak length in birds) while more recent approach mostly focus on nucleotide or amino acid sequences (DNA or protein sequences).
Here we are going to use the genetic variations to build this tree from the RNA sequences of different samples of viruses from the same familly. The samples are isolate taken from different host organisms so we expect some differences.
Coronaviruses-Coronaviridae-SARS-CoV.
Before we start, we should discuss the taxonomy of SARS-CoV-2, the virus responsible for the disease called COVID-19. This Virus is actually part of the much bigger Coronaviridae family, itself part of the Nidovirales Order, the next figure illustrate this relationship and it’s equivalents for modern humans:
Another important fact about the Coronaviviridae family is that it’s member tend to “jump” from one specie to another, when this jump occurs from a non-human host to a human host it is called zoonosis. The jump is called “spillover event”.
This occurred for example in 2002–2003 when the SARS-CoV-1 was transmitted from Horseshoe Bats to human (possibly through the intermediary of civets cats), and again in 2012 when the MERS-CoV jumped from Egyptian Tomb Bats to camels and then to Humans.
When an event like those “jumps” happens it can be seen through the close proximity of the (new) human strain and the (old) strain from it’s native host as is illustrated from MERS-CoV on the following figure:
To create the phylogenetic tree of SARS-CoV-2, we will need the genetic sequences of those viruses (in this case RNA) and then need to find which viruses of the same family (Coronaviridae) from different hosts are the most closely related.
Getting the Data:
For this example, we will use the sequence data from the public library:
You can the select the Virus family: Coronaviridae and the host species you are interested in. For this example, we will only consider complete genome and we will explore a wide variety of hosts listed bellow:
+------------+--------------------+-------------+
| Code: | host: | |
+------------+--------------------+-------------+
| MN908947 | Homo Sapiens | SARS-CoV-2 |
| | | |
| NC_019843 | Homo Sapiens | SARS-CoV-1 |
| | | |
| MT121216 | Manis javanica | |
| | | |
| MN996532 | Rhinolophus A. | RaTG13 |
| | | |
| JQ065048 | Mareca penelope | |
| | | |
| NC_034972 | Apodemus chevrieri | |
+------------+--------------------+-------------+
Fasta files are read using the Biopython library as shown bellow (for fasta files containing only one sequence):
For more details on Biopython, see the official documentation.
Alignment:
Alignment is performed either directly on the online library shown above (before downloading the files) or can be performed using the binary file downloaded from http://www.clustal.org/omega/
From the sequences alignment we also get the distance matrix (dm) that measures the distances between the different sequences according to their divergences.
The following command calls the clustral-binary to perform the alignment on the all.fasta file containing the sequences we wish to align together. The result is then saved in the aligned.fasta file.
!./clustal-omega-1.2.3-macosx -i all.fasta -o aligned.fasta — auto — force -v
Note: Use the binary file name corresponding to your download. In my case the name is ‘clustal-omega-1.2.3-macosx’. The ‘-force -v’ command specify that if a file name ‘aligned.fasta’ already exists it should be overwritten.
The alignment file is read (using Bio-python library) with the following command:
align = AlignIO.read("aligned.fasta", "fasta")
Calculate the distance Matrix:
We then need to calculate the distance matrix containing the distance between each sequences and the other sequences in the alignment file:
Phylogenetic Tree:
We can now draw the tree:
In this case we use the the UPGMA method. Note that this method assume a constant rate of evolution, which is not usually the case. It’s also important to note that we construct the tree based on whole sequence alignement and that recombination events (which are quiet common among coronaviruses) can lead us to over estimate the genetic distance between two samples.
Result:
It seems that ones again a bat is to blame for our troubles, which is corroborated by the academic research on the subject. It seems however unlikely that the virus made the jump directly from bats to humans and (as for MERS-CoV and SARS-CoV), a intermediate host was possibly involved.
Note: if you wonder why bats seems to be the original reservoir hosts for many of the new emerging disease, you are not the only one. It seems that some of the bat’s physiological and behavioral traits make them ideals incubators for pathogens. Here is a short video from sci-show that summarize some of the research conducted on the subject.
Full code and sample data available here.
Thank you for reading, feedback Welcome!