Page 5 of 14
bioRxiv preprint doi: https://doi.org/10.1101/2020.01.30.927871; this version posted January 31, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. We then translated the aligned genome and found that these inserts are present in all Wuhan 2019- nCoV viruses except the 2019-nCoV virus of Bat as a host [Fig.S4]. Intrigued by the 4 highly conserved inserts unique to 2019-nCoV we wanted to understand their origin. For this purpose, we used the 2019-nCovV local alignment with each insert as query against all virus genomes and considered hits with 100% sequence coverage. Surprisingly, each of the four inserts aligned with short segments of the Human immunodeficiency Virus-1 (HIV-1) proteins. The amino acid positions of the inserts in 2019-nCoV and the corresponding residues in HIV-1 gp120 and HIV-1 Gag are shown in Table |. The first 3 inserts (insert 1,2 and 3) aligned to short segments of amino acid residues in HIV-1 gp120. The insert 4 aligned to HIV-1 Gag. Thesinsert 1 (6 amino acid residues) and insert 2 (6 amino acid residues) in the spike glycoprotein of 2019-nCoV are 100% identical to the residues mapped to HIV-1 gp120. The insert 3 (12 amino acid\residues) in 2019- nCoV maps to HIV-1 gp120 with gaps [see Table 1]«The insert 4 (8. amino acid residues) maps to HIV-1 Gag with gaps. Although, the 4 inserts represent discontiguous short stretches of amino acids:in spike glycoprotein of 2019-nCoV, the fact that all three of them share amino acid identity-or similarity with HIV-1 gp120 and HIV-L, Gag (among all annotated’ virus proteins),suggests that this is not a random fortuitous finding. In other, words, one may sporadically expect a fortuitous match for a stretch of 6-12, contiguous amino acid residues in an;unrelated protein. However, it is unlikely that all 4 insertsin the 2019-nCoV spike glycoprotein fortuitously match with 2 key structural proteins of an unrelated virus (HIV-1). The amino acid residues of inserts 1, 2 and 3 of 2019-nCoV spike glycoprotein that mapped to HIV-1 were a part of the V4, V5 and V1 domains respectively in gp120 [Table 1]. Since the 2019- nCoV inserts mapped to variable regions of HIV-1, they were not ubiquitous in HIV-1 gp120, but were limited to selected sequences of HIV-1 [ refer S.File1] primarily from Asia and Africa. The HIV-1 Gag protein enables interaction of virus with negatively charged host surface (Murakami, 2008) and a high positive charge on the Gag protein is a key feature for the host-virus interaction. On analyzing the pI values for each of the 4 inserts in 2019-nCoV and the corresponding stretches of amino acid residues from HIV-1 proteins we found that a) the pI values were very similar for each pair analyzed b) most of these pI values were 10+2 [Refer Table 1] . Of note, despite the gaps in inserts 3 and 4 the pI values were comparable. This uniformity in the pI values for all the 4 inserts merits further investigation. As none of these 4 inserts are present in any other coronavirus, the genomic region encoding these inserts represent ideal candidates for designing primers that can distinguish 2019-nCoV from other coronaviruses.