QQ plot of residuals.

International Journal of Industrial and Operations Research

Int J Ind Operations Res

2633-8947

VIBGYOR Online Publishers

313 Kd Tower, Cotterells, Hemel Hempstead, Hertfordshire, England, HP1 1AU

Predicting Total Number of Deaths using COVID-19 World Data: Application of Linear Regression Model

10.35840/2633-8947/6508

Siegrist M

Research Article

3 2

OPEN ACCESS

Predicting Total Number of Deaths using COVID-19 World Data: Application of Linear Regression Model

Author-1 Micaela Siegrist Department of Mathematics and Statistics, Florida International University, Miami, USA Author-2 BM Golam Kibria Department of Mathematics and Statistics, Florida International University, Miami, USA

Corresponding-Author BM Golam Kibria

Department of Mathematics and Statistics, Florida International University, Miami, FL 33199, USA.

04 November 2020

06 November 2020

Siegrist M, Kibria BMG

2020

Predicting Total Number of Deaths using COVID-19 World Data: Application of Linear Regression Model

Int J Ind Operations Res

2020

Siegrist M, et al

© This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Coronavirus Disease 19 (COVID-19) is a new deadly disease which made its appearance at the end of 2019 in China and it quickly spread worldwide. This paper analyzes which regressors influence the deaths caused by this disease. The variables that were considered were total deaths per million, population density, median age, people age 70 and older, GDP per capita, CVD death rate, diabetes prevalence, smokers- tobacco prevalence, hospital beds per 100,000 people, and total cases per million. After fitting three multiple linear regression models, we found that the variables that are significant when analyzing the deaths by COVID-19 are median age, people 70 and older, tobacco prevalence, hospital beds per 100k people, and total cases of COVID-19 per million.

Covid-19, LSE, MSE, MAPE, Prediction, Regression model

Coronavirus Disease 19 (COVID-19) is a virus that became visible at the end of 2019 when it was first announced in Wuhan, China. The spreading of this new and unknown but deadly disease was very quickly and by March 11th the World Health Organization declared it as a pandemic status [1,2]. The virus had people infected in more than 114 countries with 180,000 cases and over 4000 deaths. This situation caused many schools and businesses to close, many countries decided to close their borders, and many people in different countries started living under a quarantine.

Even though, the rate of mortality among those infected is 2.3 percent, which is not high, we do not actually have a vaccine to stop its spreading worldwide which is what more frightens people. It is clear that COVID-19 is very contagious and that everyone should be held responsible for their own safety. However, it has been found that not every infected person has been admitted into hospitals and is important to identify the factors that can cause the disease to worsen. There is a little information known about this disease, but what professionals have informed is that the symptoms are fever, dry cough, chest distress among others. The study, "Prediction of Number of Cases of 2019 Novel Coronavirus (COVID-19) Using Social Media Search Index" (Qin) shows how web and social media can be used by tracking keywords, such the previous mentioned symptoms, to predict the new or suspected cases of COVID-19. This information is very helpful to prepare health institutions for possible outbreaks, to give the opportunity to governments to implement new policies as a stricter quarantine, or to educate the population living in high risk areas. This would be another method used to predict. Our analysis will focus on prediction as well for the number of death cases that will be caused by COVID-19 according to the data provided.

This paper will consider 126 countries as they have complete data for all variables and represent the world populations very nicely. We will focus on determining what variables most influence the death by COVID-19. The variables and a brief description are provided below:

Total deaths per million (Y): Testing policies differ in each country and it is not possible to use the total number of cases as a dependent variable. In this paper, the dependent variable is the number of deaths per million because of COVID-19 in each country.

Population density (X1): Population density is a measurement of population per unit area. Considering that the virus is spread from person to person, how crowded a population is may influence the transmission of the disease.

Median age (X2): Age seems to be an important factor respecting the seriousness of the illness. There were a few children hospitalized because of COVID-19 and many of older adults. The median age of a population shows where it is a young or old population.

People age 70 and older (X3): The mortality of the virus tends to increase among people who are 70 and older who are considered one of the risk groups.

GDP per capita (X4): Gross Domestic Product measures the total income of a country’s economy in US dollars. A higher GDP is related to a higher quality of life, more years of education, and better health services.

CVD death rate (X5): Pre-existing conditions may influence the seriousness of the virus. Mortality increases to a 10 percent among COVID-19 patients with previous cardiovascular diseases.

Diabetes prevalence (X6): Patients who have diabetes and may not have their blood levels controlled have a weak immune system. Therefore, it is harder for them to get rid of the virus.

Smokers-tobacco prevalence (X7): Since COVID-19 is a respiratory virus, those patients who already have their lungs and respiratory systems in poor conditions may have a hard time while fighting the disease.

Hospital beds per 100,000 people (X8): COVID-19 is a very contagious disease which infects people at a very high rate and has made hospitals collapse in some countries. How well a country is prepared to hospitalize a large number of people at the same time will not make doctors choose who to assist.

Total cases per million (X9): COVID-19 is a disease with low mortality rates but the amount of total cases may help predict the number of deaths. More cases will be related to more deaths.

Ghosal, et al. [3] consider SARS-CoV-2 at 6 weeks from day 0 data to predict the number of deaths in India. They explain that this virus has the ability to undergo genetic recombination and the susceptibility to natural selection explains why COVID-19 is spread very quickly. An effective prediction may help to prevent future catastrophes. They consider total number of infected cases, active cases, and recovery numbers, as regressors and total deaths and case fatality rates as a response variable. Lin, et al. [4] consider COVID-19 data to predict the number of cases using social media search index data. The literature on the fitting regression model to predict the total number of deaths using COVID-19 data is limited. There are many researches for various purposes available to fit multiple regression models in literature, to mention a few, Motulsky and Christopoulos [5], Montgomery, Douglas, et al. [6], James, et al. [7], and very recently Guzzman and Kibria [8] and Saleh, et al. [9] among others.

COVID-19 is a very severe type of disease and the main objective of this paper is to identify some significant variables that will contribute towards the death rate caused by coronavirus. The organization of this paper is as follows: The data and the descriptive statistics are given in Section 2. Regression models are developed in Section 3. Cross validation and evaluation of the fitted model are outlined in Section 4. This paper will end with some concluding remarks in Section 5.

We started data collection by extracting the publicly available data until May 29, 2020 for COVID-19 for our analysis. First, we selected the last day of data for each country and then decided to delete some islands or small countries that were missing a lot of information (in this case regressors) and finally we ended up with 126 countries with nine regressors. Then, we verified that we had countries the five continents. Those countries are Albania, United Arab Emirates, Argentina, Armenia, Australia, Austria, Azerbaijan, Belgium, Benin, Burkina Faso, Bangladesh, Bulgaria, Bahrain, Bahamas, Bosnia and Herzegovina, Belarus, Brazil, Barbados, Brunei, Botswana, Canada, Switzerland, Chile, China, Colombia, Comoros, Cape Verde, Costa Rica, Cyprus, Czech Republic, Germany, Djibouti, Denmark, Dominican Republic, Algeria, Ecuador, Egypt, Eritrea, Spain, Estonia, Ethiopia, Finland, Fiji, France, United Kingdom, Georgia, Ghana, Gambia, Greece, Croatia, Haiti, Hungary, Indonesia, India, Ireland, Iran, Iceland, Israel, Italy, Jamaica, Japan, Kazakhstan, Kenya, Kyrgyzstan, Cambodia, South Korea, Kuwait, Laos, Lebanon, Liberia, Sri Lanka, Lithuania, Luxembourg, Latvia, Morocco, Moldova, Mexico, Mali, Malta, Myanmar, Montenegro, Mongolia, Mozambique, Mauritius, Malawi, Malaysia, Niger, Netherlands, Norway, Nepal, New Zealand, Oman, Pakistan, Panama, Philippines, Poland, Portugal, Paraguay, Qatar, Romania, Russia, Saudi Arabia, Singapore, El Salvador, Suriname, Slovakia, Slovenia, Sweden, Swaziland, Seychelles, Togo, Thailand, Timor, Tunisia, Turkey, Tanzania, Uganda, Ukraine, Uruguay, United States, Uzbekistan, Vietnam, Yemen, South Africa, Zambia, and Zimbabwe.

For this study we consider the following variables: total deaths per million (Y), population density (X1), median age (X2), people age 70 and older (X3), GDP per capita (X4), CVD death rate (X5), diabetes prevalence (X6), smokers-tobacco prevalence (X7), hospital beds per 100,000 people (X8), and total cases per million (X9). Table 1 shows the descriptive statistics for the dependent and independent variables. The range for the number of deaths because of COVID-19 per million is large which means that the virus did not attack every country the same way.

The objective of this study is to determine if any of the nine regressors influence the number of deaths because of COVID-19. In order to determine if any of the factors are significant, we will construct a multiple linear regression model that relates the number of deaths to the nine regressors in the section follow.

We will do the regression analysis in this section. We will consider the top 80% of our data (approximately the first 101 countries). The 20% at the bottom will be used later to evaluate the adequacy of the linear regression model. Therefore, the sample size of 126 was reduced to 101.

Now, we will consider the following linear regression model:

y = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6 + β7x7 + β8x8 + β9x9 + ε (1)

where, y = total deaths because of COVID-19 per million, x1 = population density, x2 = median age, x3 = people age 70 and older, x4 = GDP per capita, x5 = CVD death rate, x6 = diabetes prevalence, x7 = smokers- tobacco prevalence, x8 = hospital beds per 100,000 people, and x9 = total cases of COVID-19 per million. In order to fit the model, we will assume that all regressors are independent and that the residuals are normally distributed with mean 0 and variance σ2.

After fitting a regression model, from SPSS we get the results shown in Table 2.

Using Table 2, the first full fitted model is:

y = 19.808 + 0.011x1 + 2.682x2 + 12.602x3 - 0.001x4 - 0.177x5 - 6.218x6 - 0.282x7 - 14.044x8 + 0.025x9 (2)

We obtain the value of R Square as 0.423 (adjusted R Square 0.37), which means that almost 45% in total variation of deaths has been explained by the nine variables. We can see from Table 2 that some of the regressors are not significant for the model.

The normal Q-Q plot and Residuals vs. Fitted plot are shown in Figure 1 and Figure 2 respectively.

We can see from Figure 1 that the residuals are approximately normal, while Figure 2 shows that the constant variance assumption has not been met.

In order to get an adequate model, we have tried various transformations on the dependent variable (Y). However, the log of Y transformation gave the better model, which is stated below.

y* = log (Y) = β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x5 + β6x6 + β7x7 + β8x8 + β9x9 + ε (3)

The normal Q-Q plot (Figure 3) shows how the data now follows a normal distribution. Also, the Residuals vs. Fitted plot (Figure 4) shows a scatter plot distributed approximately even around 0. This indicates that the constant variance assumption has been satisfied. The regression analysis by SPSS for transformed model is provided in Table 3.

Using Table 3: The transformed fitted model is given below:

y = -1.025 + 0.000x1 + 0.067x2 + 0.039x3 + 0.000x4 + 0.000x5 - 0.004x6 - 0.008x7 - 0.067x8 + 0.000x9 (4)

We obtain the value of R Square as 0.587 (Adjusted R Square 0.544), which means that almost 60% in total variation of deaths has been explained by the nine variables. Since the F-test statistic is 13.600 and its corresponding p-value is 0.000, we can reject the null hypothesis that the regressors are not significant. Therefore, we can assume that at least one variable is significant to the model. Now, we would like to reduce the model by backwards elimination. Using SPSS, we come with the following reduced model (R2 = 0.576) Table 4.

We have decided that, even though the p-value for X7 in Table 5 is greater than 0.05, we will keep it in the model. Variable X7 is tobacco prevalence and has a relationship with deaths by COVID-19 [10]. Our model resulting from regressor elimination:

y = -0.919 + 0.054x2 + 0.057x3 - 0.008x7 - 0.069x8 + 0.000x9 (5)

Our final model has only five variables left (x2, x3, x7, x8, and x9) that are significant with the deaths by COVID-19 (R2 = 0.581).

Multicollinearity

Some assumptions need to be met in a multiple linear regression model and one of them is that variables should be independent from each other. This means that there should not be any relationship between the regressors. To check if there is no multicollinearity in our model, we will check Table 5. The VIF for x2, x3, x7, x8, and x9 are 7.130, 5.819, 1.648, 2.120, and 1.239 respectively. We can see that the VIF for all variables are less than ten, which means that multicollinearity is not a problem.

Cross Validation is used to justify whether a model is adequate or not. In this case, we are selecting the last 20% of our data, from 102 to 126 to predict the corresponding y values using cross validation. To justify its accuracy, we will determine which model would be best suited for prediction. Using this test set, we will compare those predicted values to their corresponding original values. If there are slight differences, it can be said that the model is adequate enough to predict future accurate results. We will consider the last two models, one with variables x2, x3, x7, x8, and x9 and the other without x7. Then we calculate the following statistics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Range-normalized RMSE (NRMSE), and Mean Absolute Percentage Error (MAPE) to compare the performance of the model.

For model 3.4 with variables x2, x3, x7, x8, and x9 we get:

MAE = 0.34819

RMSE = 0.45537

NRMSE = 0.00108

MAPE = 0.31924

For model 3.5 with variables x2, x3, x8, and x9 we get:

MAE = 0.35409

RMSE = 0.47142

NRMSE = 0.00112

MAPE = 0.31710

The MAE, RMSE, and NRMSE are a little bit higher in the second model which indicates that the first model is better for our data. Thus, the final model would be:

Log (y) = -0.919 + 0.054x2 + 0.057x3 - 0.008x7 - 0.069x8 + 0.000x9

Using the above final model, we will predict the total number of deaths of COVID-19 per million for the last 24 countries and provided them in Table 6.

From Table 6, it appears that the difference between the predicted and the original number is not very large. Therefore, we can conclude that our model predicts the total number of deaths per million pretty accurately.

This paper considers developing a predictive model for the total deaths of COVID-19 per million citizens data. There were considered nine regressors: population density, median age, people 70 and older, GDP per capita, CVD death rate, diabetes prevalence, tobacco prevalence, hospital beds per 100k people, Total cases of COVID-19 per million in the model. After fitting a full model, a transformed model, and a reduced model using backward elimination, we concluded that only five variables were significant when analyzing the deaths because of COVID-19. Those variables were median age (2), people 70 and older (3), tobacco prevalence (7), hospital beds per 100k people (8), and total cases of COVID-19 per million (9). This paper considers COVID-19 data until May 29, 2020. However, one can extend this paper with updated data with the same or different models.

Authors are thankful to three referees for their valuable comments and suggestions, which certainly improved the presentation and quality of the paper. They wish to dedicate this paper to those who have lost their lives due to COVID-19 in USA.

Figure 1 QQ plot of residuals.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-001.gif

Figure 2 Plot of residuals vs. Fitted values.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-002.gif

Figure 3 QQ plot of residuals.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-003.gif

Figure 4 Plot of residuals vs. Fitted values.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-004.gif

Table 1 Descriptive statistics.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table1.html

Table 2 Regression analysis of the COVID-19 data.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table2.html

Table 3 Regression analysis on transformed variable.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table3.html

Table 4 Regression analysis on transformed variable.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table4.html

Table 5 Regression analysis on transformed variable including x7.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table5.html

Table 6 Original and predicted values for the last 24 countries.

https://www.vibgyorpublishers.org/content/ijior/ijior-3-008-table6.html

References Reference-1

Francesco Di Gennaro, Damiano Pizzol, Claudia Marotta, Mario Antunes, Vincenzo Racalbuto, et al. (2020) Coronavirus diseases (COVID-19) current status and future perspectives: A narrative review. Int J Environ Res Public Health 17: 2690. https://pubmed.ncbi.nlm.nih.gov/32295188

Reference-2

Hannah Ritchie, Esteban Ortiz-Ospina, Diana Beltekian, Edouard Mathieu, Joe Hasell, et al. (2020) Coronavirus pandemic (COVID-19) - statistics and research. Our World in Data. https://ourworldindata.org/coronavirus

Reference-3

Ghosal S, Sengupta S, Majumder M, Sinha B (2020) Linear regression analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases-March 14^th, 2020). Diabetes & Metabolic Syndrome: Clinical Research & Reviews 12: 311-315. https://www.sciencedirect.com/science/article/abs/pii/S1871402120300576

Reference-4

Lei Qin, Qiang Sun, Yidan Wang, Ke-Fei Wu, Mingchih Chen, et al. (2020) Prediction of the number of new cases of 2019 novel coronavirus (COVID-19) using a social media search index. Int J Environ Res Public Health 17: 2365. https://pubmed.ncbi.nlm.nih.gov/32244425

Reference-5

Motulsky HJ, Christopoulos A (2003) Fitting models to biological data using linear and nonlinear regression. A practical guide to curve fitting. GraphPad Software Inc., San Diego CA. https://www.facm.ucl.ac.be/cooperation/Vietnam/WBI-Vietnam-October-2011/Modelling/RegressionBook.pdf

Reference-6

Douglas C Montgomery, Elizabeth A Peck, G Geoffrey Vining (2013) Introduction to linear regression analysis. Wiley-Blackwell. #

Reference-7

James G, Witten D, Hastic T, Tibshirani R (2013) An introduction to statistical learning. Springer, New York. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Reference-8

Guzman CI, Kibria BMG (2019) Developing multiple linear regression models for the number of citations: A case study of Florida International University professors. International Journal of Statistics and Reliability Engineering 6: 75-81. http://www.ijsreg.com/index.php/ijsre/article/view/274

Reference-9

Saleh AK Md E, Arashi M, Kibria BMG (2019) Theory of ridge regression estimation with applications. Wiley, New York. #

Reference-10

Grundy EJ, Suddek T, Filippidis FT, Majeed A, Coronini SC (2020) Smoking, SARS-COV-2 and COVID-19: A review of reviews considering implications for public health policy and practice. Tobacco Induced Disease 18: 58. http://www.tobaccoinduceddiseases.org/Smoking-SARS-CoV-2-and-COVID-19-A-review-of-reviews-considering-implications-for,124788,0,2.html