We aimed to utilize data analysis tools that display the relative positions of data points in fewer dimensions while keeping the variation of the original data set as much as possible, and cluster countries according to their scores on the formed dimensions.
Principal component analysis (PCA) and Partitioning around medoids (PAM) clustering algorithms were used to analyze data of 56 countries, 82 countries and 91 countries with COVID-19 at three time points, eligible countries included in the analysis are those with total cases of 500 or more with no missing data.
After performing PCA, we generated two scores: Disease Magnitude score that represents total cases, total deaths, total actives cases, and critically ill cases, and Mortality Recovery Ratio score that represents the ratio between total deaths to total recoveries in any given country.
Accurate multivariate analyses can be of great value as they can simplify difficult concepts, explore and communicate findings from health datasets, and support the decision-making process.
An overwhelming number of studies shed the light on COVID-19 from various dimensions: medical, biological, and epidemiological dimensions, its social correlates and its implication, its impact on economic status worldwide and even on micro-level. A few studies focused on tracking COVID-19 data, for the purpose of summarizing and organizing these data and to find solutions for how this huge amount of data should be visualized and presented into one or two representative graphs. Among the initial descriptive mathematical models for COVID-19 was that introduced by N. E. Huang and F. Qiao. They aimed at tracking the disease course with detecting the efficacy of the local interventions made for disease containment. Despite being robust, it did not provide real-time comment on the disease burden and progression across countries Q. Lin et al. developed a conceptual model based on 1918 influenza pandemic modeling framework in London, UK, taking into consideration the governmental actions and individual reactions trying to to forecast the disease behavior patterns of COVID-19 under different scenarios. The model functioned well in forecasting COVID-19 behavior when applied to data from Wuhan, China, but it was built on a unidimensional dependent variable, total confirmed cases (Lin et al., 2020). Dey and colleagues exerted valuable efforts to gather and analyze epidemiological data on COVID-19 outbreak from many open datasets. They utilized visual exploratory data analysis procedures on the available datasets for certain provinces of China and outside China, from 22 January to 16 February 2020. The datasets contained number of confirmed cases, deaths, and recovered cases. They draw heat-maps and heat-bar graphs for china and outside, this was done for each indicator separately and comparisons were done in a univariate manner of analysis (Dey et al., 2020). Another research aimed to develop predictive model for predicting COVID-19 cases, deaths, and recoveries. The researchers utilize SEIR modelling to forecast COVID-19 outbreak inside and outside China based on the daily observations. According to the developed model, they assumed that the outbreak would reach its peak in late May 2020 and would start to drop around early July 2020. They also found that negative sentiments about the virus are more prevailed than positive ones. Positive sentiments were mainly reflected through articles about “collaboration and strength of individuals in facing this epidemic’, while negative articles were related to “uncertainty and poor outcomes of the disease such as deaths” (Binti Hamzah et al., 2020). Another modelling study tried to identify individuals at high risk of severe COVID-19 and how this varies between countries. The identification process was based on individual's age, sex, country-disease prevalence data, multimorbidity fractions, and infection–hospitalization ratios. This study concluded that men are at higher risk compared to women, elder people are at highest risk categories and at the macro-level, the share of the population at highest risk categories in countries with older populations, countries with high prevalence of HIV/AIDS, Chronic kidney disease, Diabetes, Cardiovascular disease, and Chronic respiratory disease (Clark et al., 2020). It is clearly noticeable that all of the previous studies analyzing COVID-19 data items were using univariate analysis techniques in order to forecast future outcomes or relate to any other individual features/variable in a one to one basis. In other words, none of those studies dealt with COVID-19 data items using multivariate analysis techniques.
A real challenge has emerged, which is how to identify the proper time to escalate or deescalate the nationwide intervention measures along the course of the pandemic. A current need for a robust tool incorporating the at-hand variants based on the available data in a one multivariate analysis, our current work presented here is an example of how visual representation can be enhanced using multivariate analysis techniques. The available visual graphs on the websites tracking COVID-19 status utilize the univariate presentation of data, presenting the progression of confirmed cases or deaths as a function of time (CDC, 2020; Worldometer, 2020). Despite being informative in a way, advanced inference for better decision making needs a more advanced methodology to reproduce high dimensional data into less dimensions, which should facilitate description and comparison of countries. Serving that purpose, we developed multivariate models aiming at studying and visualizing the current situation of every affected country by COVID-19 using PCA and cluster analysis. This was in terms of disease burden against mortality/recovery ratio at a certain time point. This will help further inference by governments and non-governmental organizations (NGO's) committed to respond to COVID-19 burden in their countries, to implement priority public health measures to support national plans and interventions.
In the current study, the affected countries had two numerical variables, in which the information within the original five variables are efficiently stored. The PCA algorithms were performed on the calculated Z-scores of the original variables. That is why the averages of the PC scores on the formed dimensions were consistently equal to zero (Table 3). Hence, countries with positive values of disease magnitude score (PC-1 score >0) had relatively higher confirmed cases, deaths, active cases and/or critically ill cases. Similarly, countries with positive values of mortality recovery ratio score had a relatively higher ratio of mortality to recovered cases, while negative values of disease magnitude or mortality recovery ratio scores indicated a relatively controlled status. This can be explained with the PC scores of USA at the first wave of cluster analysis (16.263, -0.113), despite being far in terms of disease magnitude (presented by PC-1 score, 16.263), the mortality recovery ratio was relatively controlled (presented by PC-2 score, -0.113). This is strongly indicating a well-established healthcare system that could absorb the relatively high disease magnitude without increasing the ratio of mortality compared to recovered cases.
On 25 April, the first wave of cluster analysis detected a meaningful number of noise clusters. USA was solely representing cluster 1 with the maximum disease magnitude score, Italy (3.416, 0.171) was the medoid of cluster, having relatively higher disease magnitude score compared to the main cluster 3 (84 countries of 91 countries in total). Norway (-0.224, 8.546) was solely representing cluster 4 by far in terms of high score on mortality recovery ratio (presented on PC-2). Of note, the second cluster whose medoid is Italy represents a group of countries with shared borders between Italy, Germany, France, and Spain, which may partly account for the grouping in one cluster.
Further PCA was performed on data of countries in cluster 3 in the previous model, followed by PAM cluster analysis. The detected changes in the correlations between the tested variables and the subsequent changes in loading scores on the principal component denoted that noise reduction was needed to extract more data overlapped by the noise clusters in the previous PCA. The number of clusters in this step was 2. Iran (7.801, -0.823) was medoid of cluster 1 which contained Turkey, Iran, Russia, and Belgium while Finland (-0.618, -0.346) was medoid of the rest of 80 countries. Again, geographical proximity does appear to contribute to data explanation by our model. The final multivariate analysis for data of the 80 countries in cluster 2 of the previous model showed significant weak to moderate correlations between mortality recovery ratio and rest of variables on PC-1, it also showed a subsequent changes in contributions to each PC; denoting changes compared to the model performed initially on 91 countries. The 80 countries were further optimally clustered into 2 groups. Romania (1.249, 0.165) was medoid of the first group which contained 24 countries, Cameroon (-1.015, -0.184) was medoid of 56 countries. The change in correlations between mortality recovery ratio and variables on PC-1 along with an encountered pattern of signal homogeneity in both PC-1 and PC-2 simultaneously and reciprocally in cluster 1 and cluster 2 in this wave of multivariate analysis revealed that our model has reached a logical outreach point. Each cluster finally represents a disease pattern where PC-1 representing disease magnitude is changing in the same direction of PC-2 representing mortality recovery ratio. This means that successive waves of PCA and cluster analysis were needed to properly group countries with similar disease patterns for better visualization and subsequent data extraction and projection. Moreover, results interpretation in this last step that showed significant weak to moderate correlations between mortality recovery ratio on PC-2 and rest of variables on PC-1. This may indicate that mortality recovery ratio is more influenced by the disease magnitude in the major 80 country cluster. Meaning that health care systems in these countries are beginning to be inadequately accomodative to the increase in disease magnitude, or may mean that these countries need augmentation of their capacity to regain independence of PC-2 from PC-1 and subsequently more disease control.
The methodology of multivariate analysis utilized in this study represents a powerful tool to describe and visualize data at certain time points to study the disease burden in terms of disease magnitude and outcome in each country by terms of readily available data in the light of the dynamic disease attributes. The formed PCs are more convenient and informative upon proper utilization as dependent variables in further predictive regression models. Using this methodology will enable both the scientific and the policy making communities to better organize, analyze, and visualize these growing data.
Reference & Source information: https://www.cell.com/
Read More on: