1 Introduction

Socio-economic factors have been found to impact how epidemics spread. The availability of Covid-19 related data at a small geographic scale has enabled us to compare and contrast disease evolution across different geographic locations and study its relation to local socioeconomic factors.

Socio-economic factors measure different components and aspects of the social environment, which often include occupation, income, education, facilities, etc. cf. Krieger (2001). Emerging infectious diseases have been found to be greatly driven by socio-economic, environmental and eco-logical factors (Jones et al. 2008; Khalatbari-Soltani et al. 2020). Khalatbari-Soltani et al. (2020) recommended that it was crucial to collect data on socio-economic determinants in COVID-19 studies. Some studies have examined socio-economic impacts on COVID-19, e.g., income (Kim and Bostwick 2020), education (Kim and Bostwick 2020; Drefahl et al. 2020), living condition (Raisi-Estabragh et al. 2020), population density (Gu et al. 2020). Raisi-Estabragh et al. (2020) found that the factors underlying ethnic differences in COVID-19 are socio-economic status, e.g., living conditions. McLaren (2020) identified the socioeconomic roots of racial disparities in COVID-19 mortality. It was found that the ethnic differences in COVID-19 disappeared when factors related to education, occupation, and commuting patterns were controlled. Kim and Bostwick (2020) created a social vulnerability index (SVI) by using data about poverty, education, female-headed households with children, median household income, and employment ratio. They found that SVI is associated with COVID-19 deaths in Chicago. Cordes and Castro (2020) conducted a similar spatial analysis for New York City. Gu et al. (2020) identified that high population density was relevant to COVID-19 outcomes in Michigan. Additionally, geographic proximity to food (fast food restaurants, etc) and to health facilities (e.g., fitness centers) have been studied extensively as part of research on environmental health and food security among others (Aztsop and Joy 2013; Hilmers et al. 2012). The importance of access to food in Chicago (Kim and Bostwick 2020), New York (Cordes and Castro 2020) and Michigan (Gu et al. 2020) suggests that it should also be relevant to the development of COVID-19. This same premise can be applied more to access to health facilities such as pharmacies, fitness centers, etc.

Given the importance of understanding how particular socioeconomic factors could affect the spread of COVID-19, in this paper, we focus on the evolution of COVID-19 confirmed cases and deaths in the counties of the state of New Jersey (NJ). The socio-economic factors considered in this study include: (a) demographic variables (i.e., population, percentage of population older than 80, percentage of low-income population), (b) a geographic variable (i.e., distance from each county to New York), and (c) variables about geographic access to food and health facilities (i.e., the number of fast foods, restaurants, nursing homes, groceries, fitness centers, and pharmacies in the counties). We first examined the impacts of these socio-economic factors on the total number of cases and deaths, then the impacts on the dissimilarity between the time profiles of the counties.

The rest of the paper is organized as follows. In Sect. 2 we discuss the sources of data and the statistical methodology that was employed. In Sect. 3 we present the results of the statistical analysis, followed by the discussion and conclusions in Sects. 3 and 5 respectively.

2 Methods

2.1 Data source

We collected the daily cumulative counts of confirmed COVID-19 cases and deaths in each county of NJ during the period March 1st to September 9th, 2020. The time-course data was collected from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) (Dong et al. 2020). The population in each county of NJ was also available. The percentage of population older than 80 was obtained from the 2019 population estimates for NJ counties (https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/asrh/cc-est2019-agesex-34.csv).

Data for the annual total household income data was collected from the United States Census Bureau website (data.census.gov) by the American Community Survey (https://data.census.gov/cedsci/table?t=IncomeandPoverty&g=0400000US34,34.050000&tid=ACSDP1Y2018.DP03&moe=false&hidePreview=true). Then the estimated number and proportion of households within a certain income range for a specific county can be readily computed. Families with household incomes less than 80% of the local median income are designated as low-income families. Then, the proportion of local low-income households for each county can be obtained.

As for distance from each county to New York City (NYC), we first obtained coordinates of the center of each county and NYC by the Google Map api. The distance was obtained by using the geopy site (https://geopy.readthedocs.io/en/stable/).

The food and health facilities considered in the study included McDonald’s, other fast-food, non-fast-food restaurants, grocery, fitness, pharmacy. Since McDonald’s is an icon of fast-food, we separated it from other fast-food. We collected the data about geographic access to the facilities by querying the Yelp Fusion API (https://www.yelp.com/developers/documentation/v3). Facilities could be searched based on the provided coordinate and radius. We divided NJ into 4000 \(\times \) 4000 meters grids, searched facilities in each grid by providing the coordinate of the center and the radius of the circumscribed circle of the grid, and then removed duplicate results. With the address and zip code in the search result, which county the facilities were located was determined. At last, we obtained the number of the facilities in each county.

To find the list of all nursing homes we searched the state of New Jersey department of health website (https://healthapps.state.nj.us/Facilities/fsSearch.aspx). We also included residential health care and assisted living facilities.

2.2 Explanatory variables

The explanatory variables included in this study are food-and-health access variables (i.e., the number of McDonald’s \(x_1\), other fast food \(x_2\), non-fast-food restaurants \(x_3\), groceries \(x_4\), fitness centers \(x_5\), pharmacies \(x_6\), and nursing homes \(x_7\)), demographic variables (i.e., population, p; the percentage of population older than eighty, o, referred as elder for short; and the proportion of low-income households in each county, l, referred as low-income for short), the geographic variable (i.e., distance to NYC), d.

Some variables were highly correlated (see Fig. 4 in “Appendix A”, the matrix of scatter plots for the data). Particularly, the food-and-health access variables, \(x_1,\ldots ,x_7\), are greatly related to population (p). To reduce this effect for analysis, two new datasets were created. In the first one, the food-and-health access variable was divided by population. In the second one, we ran linear regression for these variables with population and distance to NYC since the food-and-health variables were also greatly related to distance to NYC (d), then substituted the variables with the corresponding residuals. The matrices of scatter plots of the two datasets were shown in “Appendix A”, in which the variables were comparatively independent.

2.3 Statistical analysis

In this study, we analyzed both the total number of confirmed cases and the time-course data. The time-course data was of the form (\(t_i,c_{ij}\)) where \(t_i\) was the ith date (\(i=1,\ldots , m\)) and \(c_{ij}\) was the total number of cases up to date \(t_i\) in County j.

2.3.1 Analysis on total number of cases

Since the number of cases would clearly be related to population, we also used the proportion of cases within a county as response variable, i.e. \(y_j = c_{mj}\)/\(p_j\), where \(p_j\) is the population of County j. We ran a linear regression model and then backward stepwise regression to select variables.

The total number of cases was highly related to population and distance to NYC (see Fig. 4 in “Appendix A”). We investigated also how these socio-economic factors could be contributing to total cases after removing population and distance to NYC as factors. This was done by modeling the residuals from fitting {\(c_{mj}\)} vs {\(p_j\)} and {\(d_j\)}.

2.3.2 Analysis of the time-course data

The time-course data was shown in Fig. 1a. It was also analyzed as a proportion of the population, which was obtained by letting \(y_{ij}\) = \(c_{ij}\)/\(p_j\). The profiles were shown in Fig. 1b. The cumulative cases were also analyzed as a proportion of the total number of cases, that is, \(y_{ij}\) = \(c_{ij}\)/\(c_{mj}\) to measure the gradients of the time profiles. The profiles were shown in Fig. 1c respectively. We performed the following analysis for the profiles.

Fig. 1
figure 1

The different time-course data relevant to cases. a The raw time-course data, {\(c_{ij}\)}; b the time-course data divided by population, \(c_{ij}\)/\(p_j\); c the time-course data divided by total number of cases, \(c_{ij}\)/\(c_{mj}\)

(a) Dissimilarity matrix

The dissimilarity between the time profiles for counties j and k can be assessed by the area between the jth and kth curves. A simple way to define the area between the curves is by the formula:

$$\begin{aligned} D_{jk} = \sum _{i}|y_{ij}-y_{ik}|\delta _i \end{aligned}$$
(1)

where \(\delta _i=t_i-t_{i-1}\), and since in our case \(t_i\) represents the day then \(\delta _i=1\). Another similarity measure based on area between the curves is derived from Simpson’s rule:

Let \(z_{ijk}=|y_{ij}-y_{ik}|\) and let \(\delta _i=\delta \)

$$\begin{aligned} D_{jk} = \frac{\delta }{3}(z_{1jk}+4z_{2jk}+2z_{3jk}+4z_{4jk}+2z_{5jk}+\cdots +4z_{(n-1)jk}+z_{njk}) \end{aligned}$$
(2)

This later definition we called Simpson’s rule similarity gives a better approximation to the area between the curves. However, because our times series is quite long, the difference between the two formulas is small.

(b) Multidimensional scaling

Multidimensional scaling (MDS) can then be applied to find a set of points {\(a_j\)} such that the Euclidean distance between \(a_j\) and \(a_k\) approximates \(D_{jk}\). Figure 2a–c showed a scatterplot of the points {\(a_j\)} for the different time profiles. Some outliers were marked in red.

Fig. 2
figure 2

Scatter plots of the first two components of multidimensional scaling for dissimilarity matrix relevant to a the raw time-course case data, {\(c_{ij}\)}; b the time-course case data divided by population, {\(c_{ij}\)/\(p_j\)}; c the time-course case data divided by total number of cases, {\(c_{ij}\)/\(c_{mj}\)}; d combination of {\(c_{ij}\)/\(p_j\)} and {\(c_{ij}\)/\(c_{mj}\)}

In addition, the two sets of dissimilarities, denoted \(D_{jk}\)(P) and \(D_{jk}\)(T), can be combined to give an idea of the overall picture of the differences between counties that captured both the cumulative counts as well as the time profiles:

$$\begin{aligned} D_{jk}(O) = \frac{D_{jk}(P)}{\sqrt{var(D_{jk}(P))}}+\frac{D_{jk}(T)}{\sqrt{var(D_{jk}(T))}} \end{aligned}$$
(3)

Multidimensional scaling (MDS) of {\(D_{jk}\)(O)} generated the set of points for the counties in Fig. 2d. Two outlier counties, Passaic and Cumberland, were marked in red.

(c) The first component regression

Since a large proportion of the dissimilarity between counties could be accounted for by the first eigenvector of MDS (larger than 0.90), it was reasonable to regard the values {\(a_j\)} along this eigenvector as carrying the most information regarding differences between counties; these values were denoted {\(a_j^*\)}, which were then used as response variable and modeled against the explanatory variables. We also investigated how socio-economic factors could be contributing to differences in spread of Covid-19 after removing population and distance to NYC as factors. This was done by modeling the residuals from fitting {\(a_j^*\)} vs {\(p_j\)} and{\(d_j\)}.

An identical analysis was done with the cumulative deaths data. The corresponding plots could be found in “Appendix B”.

3 Results

Table 1 showed the linear regression results using the data set where the food-and-health variables (i.e.,\(x_1,x_2,\ldots ,x_7\)) were divided by population (p). Columns 2-5 demonstrated the regression results for the total number of cases and deaths (Column C and D), as well as their proportion (Column C/P and D/P). The effects were almost identical. The most important factors were McDonalds, other fast-food, fitness, pharmacy, low-income and population. As expected, the total numbers (of cases and deaths) significantly increased with low-income and population density. For food factors, other fast food restaurants were positive with p-value < 0.001 for cases and p-value < 0.01 for deaths, while McDonalds was negative (with p-value < 0.01). This indicates that McDonalds in contrast to other fast-food and non-fast-food outlets was associated with lower transmission of the Covid-19 virus. Surprisingly, the collected data indicated that non-fast-food restaurants and grocery stores were not significant. For health factors, fitness centers were shown to be positively correlated with p-value<0.01 for cases and p-value<0.05 for deaths. Further, the number of pharmacies was negatively correlated with p-value<0.01 for cases and p-value<0.05 for deaths. Surprisingly, our data indicated that the number of nursing homes was not significant. As for the analysis on the proportion of cases, the effects of the factors were almost the same as that on the total cases (Column C). However, the regression results on the proportion of deaths were much different from the others. In particular, it was dominated by distance to NYC, while food-and-health factors were not significant except the number of nursing homes.

Table 1 Regression results using the data set where the food-and-health variables (i.e., \(x_1,x_2,\ldots ,x_7\)) were divided by population (p)

The other columns in Table 1 demonstrated the regression analysis when using the first component of MDS for the dissimilarity matrix. This analysis also showed that the reason for the dissimilarity on cases and deaths between counties could be explained by the socio-economic factors. The effects on cases dissimilarity (Column C) were almost identical to that on deaths (Column D). The most important factors were population, distance to NYC, and low-income. For food-and health factors, only non-fast-food was significant, which indicated that the more non-fast-food restaurants resulted in more cases and deaths. On the other hand, the results for the dissimilarity on cases (deaths) proportion (Column C/P and D/P) were similar to that for total number of cases (death) (Column 3 and 5), except that fitness centers was not significant in Column C/P and nursing home was not significant for D/P. Also, it was worth noting that death proportion dissimilarity was influenced greatly by distance to NYC. The larger the distance was, the less the death proportion. Columns C/T (D/T) showed the effects on the dissimilarity on the gradients between the counties, where the gradients were computed as the ratios of the cumulative cases (deaths) divided by total number of cases (deaths) calculated in the time course of the curves. The dominant effect was the distance to NYC, as seen in Figs. 2c and 8 in “Appendix B”. Furthermore, we also combined the two sets of dissimilarities (i.e., C/P and C/T) and run regressions for the first component of MDS. The results for cases and deaths were shown in Column CC and CD respectively. Again, it depended greatly on the distance to NYC, followed by low-income. For food-and health factors, only the other fast-food outlets factor was significant.

Based on the results above, the most important factors were population and distance to NYC. Hence, we removed the two factors in responses (total numbers and the first components of MDS) and the food-and-health variables. It was done by using residuals obtained by fitting them vs the two factors. After that, we conducted the same analysis. Table 2 demonstrated the results. Many socio-economic factors were still significant. In particular, more food-and-health factors came out and were significant.

We applied K-means clustering on the first two components of the combination similarity matrix {\(D_{jk}\)(O)} to cluster the counties (see Fig. 3). The clustering results were almost identical, except for the county of Hunterdon.

Table 2 Regression results using the dataset where variables \(x_1^\prime ,x_2^\prime ,\ldots ,x_7^\prime \) were the residuals obtained from fitting food-and-health variables \(x_1,x_2,\ldots ,x_7\) versus population (p) and distance to NYC (d)
Fig. 3
figure 3

K-means clustering on the first two components of the combination similarity matrix for a cumulative cases and b deaths

4 Discussion

We may expect that the places where people got food, e.g., restaurant or grocery, were the main locations the virus spread. Grocery should be better than restaurant since grocery was less crowded. In our analysis, grocery was not significant. However, one interesting finding in this study was that the effects of different restaurant types were different. In most models, we found a contrast between the frequency of McDonald’s vs other fast-food or non-fast-food restaurants. The frequency of McDonald’s was associated with fewer cases (deaths), while other fast-food or non-fast-food restaurants was associated with more cases (deaths). One explanation might be that McDonald’s did a good job on hygiene or crowd control measures.

The effects of most factors were consistent across all the analysis. Pharmacies were consistently associated with fewer cases and deaths, whereas gyms and percent of low-income were associated with more cases and deaths. However, some were not, that is, some were negative while others were positive, e.g., distance to NYC in Table 1 and nursing home in Table 2. It should be noted that different analysis represented different measurements on cases or deaths. The effects of the same factor on different measurement could be different. For example, Column “C” in Table 1 represented the total number of cases, while Column “D/P” meant the proportion of deaths among the populations. The results showed that a larger distance to NYC resulted in more cases, but less death proportion.

Figure 3 showed the clear demarcation of three New Jersey regions, North Jersey, South Jersey and Central Jersey. That was suggested by the Covid-19 cases and deaths data and by the demographics and socioeconomic variables. (this study supports one side of the never ending debate about the existence of Central Jersey!)

5 Conclusions

In this study, we investigated the effects of several socio-economic factors on the number (and number over time) of confirmed Covid-19 cases and Covid-19-related deaths for the twenty one counties of New Jersey during the period March 1st to September 9th, 2020.

We found that counties could be clustered into three groups based on (a) the case totals, (b) the total number of deaths, (c) the time course of the cases and (d) the time course of the deaths. The four groupings were very similar to one another and could all be largely explained by the county population, the percentage of low-income population, and the distance of the county from New York, with the highly populated counties close to New York clustering together as the hardest hit counties and the less populated ones far from it clustering together as the least affected.

We also examined the effects of various food and health factors, where people may gather and potentially result in more cases and found that an increased number of McDonald’s, in contrast to other fast-food or non-fast-food restaurants, led to a reduction in the number of cases and deaths regardless of whether or not it was adjusted by the total number of cases or by population. In addition, (a) a larger number of pharmacies resulted in fewer cases and fewer deaths and (b) greater access to fresh food (groceries) led to fewer deaths. However, the number of fitness centers was associated with an increased number of cases and deaths.

Moreover, our study found that these same socio-economic factors could be used to explain the differences in the evolution of the disease between different geographical locations (counties in New Jersey). Furthermore, these factors still emerged as significant even after regressing out the main factors (population and distance to New York).

Overall, we found that the evolution of Covid-19 in New Jersey has been influenced by certain socio-economic factors, which could be helpful for the formulation of public health policies.