The 2019 novel coronavirus outbreak has significantly affected global health and society. Thus, predicting biological function from pathogen sequence is crucial and urgently needed. However, little work has been performed to identify viruses by the enzymes that they encode, and which are key to pathogen propagation.
Results We built a comprehensive scientific resource, SARS2020, that integrates coronavirus-related research, genomic sequences, and results of anti-viral drug trials. In addition, we built a consensus sequence-catalytic function model from which we identified the novel coronavirus as encoding the same proteinase as the Severe Acute Respiratory Syndrome virus. This data-driven sequence-based strategy will enable rapid identification of agents responsible for future epidemics.
Platform construction and data curation: We systematically collected reports of coronavirus-related research, genomic sequences, biochemical reactions, government policies, media public opinion, and anti-viral drugs in clinical trial (Table S1, Hu, et al., 2011; Khan, et al., 2020; Shu and McCauley, 2017). This information was used to build SARS2020, an integrated scientific resource about 2019-nCoV, to provide foundation data for researchers in various fields. For data quality, we imposed strict evaluation and validation criteria. All 2019-nCoV related data were checked one-by-one to ensure authenticity. In addition, we integrated a consensus sequence-function model (Zhang, et al., 2020), a genome browser (Ham, et al., 2012), and a catalytic function annotation tool (Dawson, et al., 2017) into the platform to assist in the research of novel viruses.
Sequence-function model: We adopted a consensus strategy to annotate enzymatic functions of biological sequences. For sequence function annotation, the family classification method captures common properties from the samples and extracts their feature vectors using machine learning algorithms, then merges the sequences into clusters or families. This consensus strategy enables efficient integration of these computational resources to maximize the accuracy and comprehensiveness of enzyme function prediction.
Identification of 2019-nCoV: We obtained the coding sequences of 2019-nCoV from NCBI (NC_045512) and constructed a gene model from sequence based on an interpolated Markov model. We used the long-orfs tool from Glimmer3 (Delcher, et al., 2007) to identify the coding regions of bacterial, archaeal, and viral genomes. Protein translation of coding regions was performed with Biopython (Cock, et al., 2009). Then we used a consensus sequence-catalytic function model provided by SARS2020 to analyze the pathogen sequence for likely catalytic functions.
The SARS2020 system is an integrated scientific resource platform about 2019-nCoV. At present, the system includes ~60,000 units of information. It provides powerful assistance for scientists to grasp the progress of 2019- nCoV research and to share data. SARS2020 is also a platform to assist in the identification of new viruses. We analysed the 2019-nCoV genome by the method described above. All predicted catalytic functions were derived from orf1ab (GeneID: 43740578), which seems to encode multiple proteins (Fig 1). The most likely predicted catalytic activity was SARS coronavirus main proteinase, which Enzyme Commission (EC) number is 22.214.171.124. This prediction suggested that 2019-nCoV was most likely a SARS virus, and this result was consistent with the conclusion of the International Committee on Taxonomy of Viruses. At the same time, we also predicted other possible catalytic functions in the 2019-nCoV genome, including RNA-directed RNA polymerase (EC: 126.96.36.199), dolichyl-phosphate-mannose—protein mannosyltransferase (EC: 188.8.131.52), NAD+ ADP-ribosyltransferase (EC: 184.108.40.206), and Ubiquitinyl hydrolase 1(EC: 220.127.116.11). These predicted functions will provide valuable reference for further study of biological activity and pathogenesis of the 2019-nCoV.
We built an integrated platform to assist 2019-nCoV research, and we proposed a novel consensus sequence-function model for using genome sequence data to identify unknown species. Our data-driven sequence- based strategy will enable rapid identification of constantly emerging pathogens.
Reference & Source information: https://academic.oup.com/
Read More on: