We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure, and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread.
ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By 15 analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by 16 comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, 17 Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions 18 of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic 19 tree-based analysis, such as is done in the Nextstrain  project. The developed pipeline dynamically 20 generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic 21 spatiotemporal dynamics.
The global SARS-CoV-2 pandemic highlights the importance of tracking 26 viral transmission dynamics in real-time. Through June 2020, researchers have obtained genetic sequences of 27 SARS-CoV-2 from over 47,000 samples from infected individuals worldwide. Since the virus readily mutates, 28 each sequence of an infected individual contains useful information linked to the individual’s exposure 29 location and sample date. But, there are over 30,000 bases in the full SARS-CoV-2 genome—so tracking 30 genetic variants on a whole-sequence basis becomes unwieldy. We describe a method to instead efficiently 31 identify and label genetic variants, or “subtypes” of SARS-CoV-2. Applying this method results in a 32 compact, 11 base-long compressed label, called an Informative Subtype Marker or “ISM”. We define viral 33 subtypes for each ISM, and show how regional distribution of subtypes track the progress of the pandemic. 34 Major findings include (1) covarying nucleotides with the spike protein which has spread rapidly and (2) 35 tracking emergence of a local subtype across the United States connected to Asia and distinct from the 36 outbreak in New York, which is found to be connected to Europe.
In this paper, we present a pipeline for subtyping SARS-CoV-2 viral genomes based on short sets of highly informative nucleotide sites (ISMs). Our results demonstrate the following key features of ISM-based subtyping. First, the ISM of a sequence preserves important nucleotide positions that can help to resolve different SARS-CoV-2 subtypes. ISMs provide a quick and easy way to track key sets of SNVs which are covarying as the SARS-CoV-2 pandemic spreads throughout the world. The SNVs which consistently covary with the spike protein variant has rapidly become prevalent throughout the world and may be a potential link to increased viral transmission [4, 19, 47]. Second, ISM-based subtypes are able to capture the majority of phylogenetic relationships between viral genomes that are represented in Nextstrain tree clades. ISM analysis shows promise as a complement to phylogenetic classification, particularly given the limits of phylogenetics at early stages in the pandemic (e.g., due to uncertainty regarding key assumptions, such as the rate of the molecular clock and confidence in branches) – while also being more computationally efficient. Third, ISM subtyping can provide robust and informative insight regarding the geographic and temporal spread of the SARS-CoV-2 sequences, as well potentially be a way to identify phenotypic variants of the virus. For example, in this paper, we show that the distribution of ISMs is an indicator of the geographical distribution of the virus as predicted by the flow of the virus from China, the initial European outbreak in Italy, and subsequent development of local subtypes within individual European countries as well as interregional differences in viral outbreaks in the United States. An important caveat of all viral analyses, including subtyping, is that they are limited by the number of viral sequences available. Small and/or non-uniform sampling of sequences within and across populations may not accurately reflect the true diversity and distribution of viral subtypes. However, the ISM-based approach has the advantage of being scalable as sequence information grows, and with more information, it will become both more accurate and precise for different geographic regions and within subpopulations. Using ISM subtyping pipeline on continuously updated sequencing data, we are capable of updating subtypes as new sequences are identified and uploaded to global databases.We have made the pipeline and 6updated analyses available on Github at https://github.com/EESI/ISM and an interactive website at 650 https://covid19-ism.coe.drexel.edu/. In the future, as more data becomes available, ISM-based 651 subtyping can be employed on subpopulations within geographical regions, demographic groups, and groups 652 of patients with different clinical outcomes. Efficient subtyping of the massive amount of SARS-CoV-2 653 sequence data will therefore enable quantitative modeling and machine learning methods to develop 654 improved containment and potential therapeutic strategies against SARS-CoV-2. Moreover, the ISM-based 655 subtyping scheme and associated downstream analyses for SARS-CoV-2 are directly applicable to other 656 viruses, enabling efficient subtyping and real-time tracking of potential future viral pandemics.
Reference & source information: https://www.biorxiv.org/
Read More on: