In numerous instances, tracking the biological significance of a nucleic acid sequence can be augmented through the identification of environmental niches in which the sequence of interest is present. Many metagenomic data sets are now available, with deep sequencing of samples from diverse biological niches. While any individual metagenomic data set can be readily queried using web-based tools, meta-searches through all such data sets are less accessible. In this brief communication, we demonstrate such a meta-metagenomic approach, examining close matches to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in all high-throughput sequencing data sets in the NCBI Sequence Read Archive accessible with the “virome” keyword. In addition to the homology to bat coronaviruses observed in descriptions of the SARS-CoV-2 sequence. While analysis of these reads indicates the presence of a similar viral sequence in pangolin lung, the similarity is not sufficient to either confirm or rule out a role for pangolins as an intermediate host in the recent emergence of SARS-CoV-2. In addition to the implications for SARS-CoV-2 emergence, this study illustrates the utility and limitations of meta-metagenomic search tools in effective and rapid characterization of potentially significant nucleic acid sequences.
Meta-metagenomic searching can provide unique opportunities to understand the distribution of nucleic acid sequences in diverse environmental niches. As metagenomic data sets proliferate and as both the need and capability to identify pathogenic agents through sequencing increase, meta-metagenomic searching may prove extremely useful in tracing the origins and spreading of causative agents. In the example we present in this paper, such a search identifies a number of niches with sequences matching the genome of the SARS-CoV-2 virus. These analyses raise a number of relevant points for the origin of SARS-CoV-2. Before describing the details of these points, however, it is important to stress that while environmental, clinical, and animal-based sequencing is valuable in understanding how viruses traverse the animal ecosphere, static sequence distributions cannot be used to construct the full transmission history of a virus among different biological niches. So even if the closest relative of a virus-causing disease in species X were to be found in species Y, we cannot define the source of the outbreak or the direction(s) of transmission. As some viruses may move more than once between hosts, the sequence of a genome at any time may reflect a history of selection and drift in several different host species. This point is also accentuated in the microcosm of our searches for this work. When we originally obtained the SARS-CoV-2 sequence from the posted work of Wu et al. (3), we recapitulated their result that bat-SL-CoVZC45 was the closest related sequence in NCBI’s nonredundant (nr/nt) database. In our screen of metavirome data sets, we observed several pangolin metavirome sequences, which were not in the NCBI nr/nt database at the time, that are more closely related to SARS-CoV-2 than bat-SL-CoVZC45. An assumption that the closest relative of a sequence identifies the origin would at that point have transferred the extant model to zoonosis from pangolin instead of bat. To complicate such a model, an additional study from Zhou et al. (4) described a previously unpublished coronavirus sequence, designated RaTG13 with much stronger homology to SARS-CoV-2 than either bat-SL-CoVZC45 or the pangolin reads from Liu et al. (15). While this observation certainly shifts the discussion (legitimately) toward a possible bat-borne intermediate in the chain leading to SARS-CoV-2, it remains difficult to determine if any of these are true intermediates in the chain of infectivity.
The match of SARS-CoV-2 to the pangolin coronavirus sequences also enables a link to substantial context on the pangolin samples from Liu et al. (15), with information on the source of the rescued animals (from smuggling activity), the nature of their deaths despite rescue efforts, the potential presence of other viruses in the same whole-lung tissue, and the accompanying gross pathology. The pangolins appear to have died from lung-related illness, which may have involved a SARS-CoV-2-homologous virus. Notably, however, two of the deceased pangolin lungs had much lower SARS-CoV-2 signals, while seven showed no signal, with sequencing depths in the various lungs roughly comparable. Although it remains possible that the SARS-CoV-2-like coronavirus was the primary cause of death for these animals, it is also possible (as noted by Liu et al. ) that the virus was simply present in the tissue, with mortality due to another virus, a combination of infectious agents, or other exposures.
During the course of this work, the homology between SARS-CoV-2 and pangolin coronavirus sequences in a particular genomic subregion was also noted and discussed in an online forum (“Virological.org”) with some extremely valuable analyses and insights. Matthew Wong and colleagues bring up the homology to the pangolin metagenomic data sets in this thread and appear to have encountered it through a more targeted search than ours (this study has since been posted online on bioRxiv ). As noted by Wong et al. (19), the spike region includes a segment of ∼200 nucleotides encompassing the RBD where the inferred divergence between RaTG13 and SARS-CoV-2 dramatically increases. This region is of interest, as it is a key determinant of viral host range and under heavy selection (20). The observed spike region divergence indeed includes a substantial set of nonsynonymous differences (Fig. S2 and S3). Notably, while reads from the pangolin lung data sets mapped to this region do not show a similar increase in variation relative to SARS-CoV-2, we also did not observe a significant drop in variation between SARS-CoV-2 and pangolin sequences in this region (Fig. S2 and S3). Instead, variation in the region is comparable to numerous other conserved regions of the spike and to the viral genome as a whole. While Wong et al. (19) and others (21–28) raised the model that recombination occurred in the RBD region in the derivation of SARS-CoV-2, the lack of a singular dip in the landscape of pangolin-SARS-CoV-2 variation in the region would seem counterintuitive were SARS-CoV-2 a result of a localized recombination between a close relative of RaTG13 and a close relative of the putative pangolin coronaviruses under consideration. Thus alternative models for the observed sequence variation seem important to consider and indeed parsimonious, including that of selection acting on the RaTG13 sequences in bats or another intermediate host resulting in a rapid variation of the amino acids at the highly critical virus-receptor interface. Overall, definitive conclusions regarding the origins of SARS-CoV-2 or other coronaviruses will remain difficult with limited sequencing data and without knowledge of evolutionary trajectories in different lineages.
A number of literature contributions now discuss the potential role for bats, pangolins, and other possible progenitor/intermediate species in derivation of SARS-CoV-2 from different approaches and perspectives, with a diversity of approaches and interpretations in understanding the origin of the virus. In particular, there has been extensive discussion and debate about the possible pangolin origin of SARS-CoV-2. These studies provide useful insights into the evolution of SARS-CoV-2 but have limitations and uncertainty in drawing conclusions regarding the viral origin, as most studies were mainly performed through sequence-based comparison and simulation. Thus, better understanding of the current pandemic requires additional information on investigational, experimental, and epidemiological levels that may resolve questions of origin and of preventing the reemergence of SARS-CoV-2 and other pathogens. Nevertheless, the availability of numerous paths (both targeted and agnostic) toward identification of natural niches for pathogenic sequences, including our meta-metagenomic search, will remain useful to the scientific community and to public health, as will vigorous sharing of ideas, data, and discussion of potential origins and modes of spread for epidemic pathogens.
Reference & Source information: https://msphere.asm.org/
Read More on: