Vinn's Studio

COVID-19 Survival Analysis

Word count: 1.5kReading time: 9 min
2020/03/27 Share

On March 11, 2020, the World Health Organization (WHO) declared COVID-19 a global pandemic. COVID-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It has now spread over 185 countries or territories with more than 300,000 reported cases, claiming the lives of more than 13,000. Amid the escalating fear over the spread of the disease, it is nonetheless often reported in the mass media that vulnerability to COVID-19 is highly age-specific, with older adults the most vulnerable to its worst effects. Indeed, according to a recent study conducted by the China Centers for Disease Control and Prevention (China CDC), the case fatality rate (CFR; proportion of deaths among the confirmed cases) for patients 70 years or older can be as high as 8%-14.8%, while that for the non-elderly (<50 years old) remains steadily below 0.4%.

To confirm the validity of such reports and findings, we aim to formally assess the age difference in the case fatality rate by analyzing publicly available COVID-19 epidemiological datasets using survival analysis methodology.

##
Data Set

To that end, we use the online data pulled from the Korea Centers for Disease Control & Prevention (Korea CDC) and prepared by the DS4C (Data Science for COVID-19) Project. The particular dataset of interest is PatientInfo.csv, which contains subject-level data for over 2000 confirmed COVID-19 cases in South Korea. Key variables include:

1
2
3
4
5
6
7
patient_id: the ID of the patient
sex: the sex of the patient
age: the age of the patient
confirmed_date: the date of being confirmed
released_date: the date of being released
deceased_date: the date of being deceased (death)
state: isolated / released / deceased
##
Data Cleaning

To perform the data cleaning, I use different methods for different patients. My main ideas are as follows.

Missing State

For those with a missing state variable, I will just drop them. That was because from my perspective, state is an important variable in survival analysis. Without state, we do not know whether an observation is deceased or not. To avoid any mistake caused by incorrect state imputation, I choose to drop those with a missing state variable.

Isolated Patients

For the isolated patients, I will use the date on which the data set is last updated (2020-03-21) as the censoring date.

Released Patients

For the released patients, I will use the release date as censoring date. If the released date is not provided but the confirmed date is provided, I will use the "confirmed date + overall average isolation time duration (which is about 14 days)" as the censoring date.

Deceased Patients

For the deceased patients, the deceased date is just the death date. For those with a missing deceased date, if the confirmed date is provided, I will use "confirmed date + overall average time duration between confirmed and death" as the death date.

Missing Confirmed Date

For those with a missing confirmed date, if the censoring/death date is available from above methods and the original data set, I will perform a weighted sampling to pick a date between the first confirmed date among all the patients and this patient's censoring/death date as his/her confirmed date. The weight is generated from the counts of every confirmed dates: The more patients were confirmed during a specific date, the higher weight this date will receive.

Missing Age

For those with a missing age, if the birth year is available, I will calculate his/her age from his/her birth year. If a patient has both missing age and missing birth year, I will drop this observation because of the importance of age in this analysis. (For calculation of crude CFRs, maybe we can also include those with missing age as separate categories in order to gain a fuller picture. )

Missing Gender

For those with a missing gender, I will randomly assign a gender to them: the probabilities of being assigned to be a male and being assigned to be a female are the same. (Also. you can include those with missing gender as separate categories. )

Outcome of interest

Time from being confirmed and censoring/death is the outcome of interest. For those with a confirmed date after his/her deceased date, I will drop them because they were confirmed after their death.

##
Descriptive Analysis

Divide the study sample into 5 age groups: <= 40s, 50s, 60s​, 70s and >= 80s. Figure 1 shows the age-specific CFRs calculated from both this cleaned data and the study by the China CDC. From Figure 1, we can see that for younger patients (especially those who <=50s​), CFRs from this data set and the study by China CDC are very close to each other. However, for older patients (those who>= 60s​), CFRs from China CDC are higher than those from this data set.

Figure 1: Age-specific CFRs

##
Survival Analysis

KM Curves & Log-rank Test

Figure 2 shows the gender-specific and age-specific Kaplan—Meier curves for the case survival probabilities.

Figure 2: KM Curves

The gender-stratified log-rank test results can be seen as follows:

1
2
3
4
5
6
7
8
9
10
11
Call:
survdiff(formula = Surv(time, state) ~ age + strata(sex), data = covid)

N Observed Expected (O-E)^2/E (O-E)^2/V
age=<=40s 1143 2 17.24 13.4701 33.3935
age=50s 394 5 5.40 0.0292 0.0361
age=60s 238 6 3.55 1.6964 1.9366
age=70s 119 8 1.57 26.2214 27.7918
age=>=80s 103 8 1.24 36.7229 38.5403

Chisq= 78.6 on 4 degrees of freedom, p= 3e-16

The p-value is about 3e-16, suggesting that, controlling for gender status, age has a highly significant effect on the survival rate. From the results above in this section, we may come to a conclusion that COVID-19 is more dangerous for older patients than younger patients, and females are more likely to survival than males to some extent.

Cox proportional hazards model

I also fitted a Cox proportional hazards model with age groups (with >=80s as reference group) and gender as covariates. The summary table of this model is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Call:
coxph(formula = Surv(time, state) ~ age + sex, data = covid)

n= 1997, number of events= 29

coef exp(coef) se(coef) z Pr(>|z|)
age<=40s -4.05347 0.01736 0.79237 -5.116 3.13e-07 ***
age50s -1.95918 0.14097 0.57081 -3.432 0.000599 ***
age60s -1.36863 0.25445 0.54248 -2.523 0.011638 *
age70s -0.25268 0.77671 0.50066 -0.505 0.613768
sexfemale -1.46996 0.22993 0.41769 -3.519 0.000433 ***
---

exp(coef) exp(-coef) lower .95 upper .95
age<=40s 0.01736 57.597 0.003674 0.08205
age50s 0.14097 7.094 0.046054 0.43153
age60s 0.25445 3.930 0.087872 0.73683
age70s 0.77671 1.287 0.291140 2.07215
sexfemale 0.22993 4.349 0.101406 0.52137

Concordance= 0.867 (se = 0.029 )
Likelihood ratio test= 63.33 on 5 df, p=2e-12
Wald test = 45.92 on 5 df, p=9e-09
Score (logrank) test = 82.33 on 5 df, p=3e-16

From Figure 3, We can see that the points are mostly clustered around the identity line, indicating that overall the model fits the data reasonably well.

Figure 3: Nelsen–Aalen estimates of Cox-Snell residuals

I also conducted the chi-square tests on proportionality. The test results are as follows:

1
2
3
4
5
6
7
              rho  chisq     p
age<=40s -0.2459 1.7308 0.188
age50s -0.1142 0.3825 0.536
age60s 0.0807 0.1862 0.666
age70s -0.1259 0.4570 0.499
sexfemale -0.0432 0.0531 0.818
GLOBAL NA 3.1689 0.674

The p-value of global test on proportionality was 0.674, greater than 0.05, which means that the global proportionality test was non-significant.

Wald Test

Finally, I conducted a Wald test on the effect of age (chi-square with 4 degrees of freedom). The resulting p-value was about 2.5e-7​, which indicates that age does have significant effect on the survival rate of COVID-19.

1
2
3
4
5
6
7
8
9
10
### Wald test with df of 4
# H_0: beta_1=beta_2=beta_3=beta_4=0
beta_q = obj$coefficients[1:4]
Sigma_q = obj$var[1:4,1:4]
#chisq statistic with 4 d.f.
chisqStat = t(beta_q)%*%solve(Sigma_q)%*%beta_q
print(chisqStat) # 36.30768
#p-value
pval = 1 - pchisq(chisqStat, df = 4)
print(pval) # 2.501175e-07
##
Summary

To draw a conclusion, above analysis indicates that patients suffering from COVID-19 with different age and gender have different survival rate. In general, younger patients are more likely to survive compared with older patients. Besides, female patients, rather than male patients suffering from COVID-19, are more likely to survive.

##
Acknowledgement

BMI/STAT 741, UW-Madison




CATALOG
  1. Missing State
  2. Isolated Patients
  3. Released Patients
  4. Deceased Patients
  5. Missing Confirmed Date
  6. Missing Age
  7. Missing Gender
  8. Outcome of interest
  9. KM Curves & Log-rank Test
  10. Cox proportional hazards model
  11. Wald Test