Introduction
Survey methodologists have acknowledged that the social environment can influence survey and census participation—both at the societal level and at the community level (Groves and Couper 1998; Johnson et al. 2006). The 2020 COVID-19 pandemic affected daily life throughout the entire U.S., but to differing degrees depending upon neighborhood or community (Orgera, McDermott, and Rae 2020; University of Minnesota 2020). Mandates to quarantine and wear face coverings in public varied across states and among localities within states (e.g., cities vs. more rural places, see Bunks and Rough 2020; Weise 2020). In addition, population density, public transportation usage, and phased reopenings of local economies varied state to state and locale to locale (COVID-Local 2020; Lee et al. 2020; Moore and Lazar 2020; National Public Radio 2020; U.S. Chamber of Commerce 2020). Such factors are hypothesized to correlate with COVID-19 infection rates (Liu et al. 2020).
In early August, the Census Bureau announced it would conclude nonresponse follow-up (NRFU) operations by the end of September. This was a reversal from earlier plans to extend NRFU through the end of October (to make up for operational postponements due to the pandemic). As a result, the agency looked for innovative ways to complete the census count on the accelerated schedule. The goal of this article is to quantify whether a community-level environmental variable (operationalized as COVID-19 infection rates at the country level) was associated with participation in the U.S. 2020 Decennial Census.
Methodology
In mid-March, the Census Bureau mailed materials to households with instructions to complete the census online, by phone, or by paper questionnaire. Soon after, the agency announced a public facing tract-level response rate map to track (in close to real time) self-response to the 2020 Decennial Census (U.S. Census Bureau 2020). Response rate data for this article were extracted as of July 25, 2020. At that point, the cumulative national response rate was 64.6% with a standard deviation of 12.7 percentage points. Cumulative tract level response rates were merged with other tract level data from the Census Bureau’s Planning Database (PDB)—a data file containing a subset of 2014–2018 5-year American Community Survey (ACS) estimates as well as other census operational variables. The PDB variables included those documented to predict 2010 Census self-response (Erdman and Bates 2017) as well as those correlated with COVID-19 death rates (Knittel and Ozaltun 2020). These included socioeconomic, household, and population density variables (see Appendix for entire list).
Finally, we included a variable indicating the mail implementation strategy each census tract was flagged to receive—either mailing flights that first encouraged online response without a paper questionnaire (Internet First) or flights that included a paper questionnaire in the first mailing (Internet Choice). Census tracts received a paper questionnaire in the first mailing (Internet Choice) if the area was expected to have lower Internet usage and thus would be more likely to benefit from an earlier paper questionnaire. Tracts were assigned to Internet Choice if they had lower self-response rates to the ACS and had either low Internet response, a higher population of people aged 65 or more, or low Internet subscribership. Otherwise, tracts were assigned to Internet First. The First/Choice tract-level indicator variable was available at the public facing website. The cumulative self-response rates, PDB variables, and First/Choice variable were then merged by tract.
Next, we merged these data with the latest COVID-19 cumulative infection rate data from the CDC. Rates were defined as the total number of positive COVID-19 tests since January in a given county over that county’s total population. Thirty-nine counties contained no positive cases; the mean positive infection rate as of July 25, 2020 was 1.5%; and the maximum was McKinley County, NM, with a positive infection rate of 3.6%. The most granular COVID-19 data available was at the county level. Consequently, we paired tracts with their cumulative county infection rate as of July 25, 2020. Our analysis was limited to census tracts where all housing units within the tract were designated to receive 2020 Census materials in the mail by mid-March.[1] Each model used ordinary least squares (OLS) to predict tract-level self-response rate as of July 25, 2020. In addition to the PDB predictors previously listed, a state fixed effect was included (output not shown).
Results
Table 1 shows response rates broken down by COVID-19 infection rate quartiles and demonstrates that response rates are lower in counties with high COVID-19 infection rates. The county infection rate column shows the average infection rate among counties in each quartile. For example, the average county in the lowest infection quartile had 0.53% of its population test positive for COVID-19 since the start of the pandemic, and on average, 68.7% of households responded to the census. From this bivariate perspective, response and infection rates appear to be negatively correlated, with the lowest infection rate quartile having the highest response rate (68.7%) and the highest infection rate quartile having the lowest response rate (60.6%).
Table 2 presents an OLS regression model predicting cumulative self-response (mail, online, and telephone response combined[2]). Results indicate that COVID-19 infection county rates were a significant (and negative) predictor of self-response, even when controlling for a variety of operational, socio-economic, and demographic covariates known to be associated with census participation and COVID-19 infection rates. The R2 indicated that a significant portion of the variance of response rate was accounted for (at around 80%).
The cumulative response rate model shows a negative relationship between county-level COVID-19 cases and response. For every percentage point increase in a tract’s county-level infection rate, the model expects response to fall about 1.3 percentage points. The standard deviation in county infection rate was 1.1 percentage points. Figure 1 shows this relationship visually. Most tracts were in counties with a cumulative infection rate below 5%, meaning the predicted effect for most counties was under 3 percentage points. However, there were a significant number of tracts in areas such as New York City and parts of Arizona where the predicted effect was over 5 percentage points.
Once we established a significant relationship between infection rates and response, we wanted to understand the importance of COVID-19’s predicted effect relative to other factors likely to influence response included in the model. Much of the work on variable “importance” in modeling comes from machine learning classification methods. A general way to gauge variable contribution is to perform machine classification with each variable included and excluded from the model. This process is iterated many times to include many different combinations of variable specifications. With each iteration, some accuracy measure is recorded and attributed to variables included in that iteration. The measures included in models with high measures of accuracy receive high scores for importance and those excluded from models with high accuracy are given poor scores. With enough iterations, it is possible to estimate not only the directionality of each variable’s effect, but the relative contribution to predictive accuracy.
We used two methods to measure our linear regression covariate importance. The first was the Filter Variable Importance function from the Caret package in R (Kuhn 2008). Instead of using out-of-sample predictions, this method uses fit measures on the full model data frame. Although it does not have the advantage of testing for overfitting, it does provide a fair comparison of explanatory power for each covariate (Kuhn and Johnson 2013). The method tests many iterations of the model using some or all of the normalized predictors shown in the regression tables. For each iteration, the function records the absolute R2 value for each variable included. After many iterations, the function records the stacked sum of all the R2 values to derive the overall score. The final score for each predictor variable is then normalized from 0 to 1, so they may be compared. A value of 1 indicates the largest possible contribution, while zero indicates the lowest possible contribution. Table 3 shows the relative contribution of each variable in the cumulative response rate model.
Results show that COVID-19 infection rate was among the most important variables in the model. By this importance score measure, the infection rate had a predictive accuracy similar to that of the percent of female-led households in a tract or the percent of vacant households. Other variables, such as tract percent identifying as Asian alone or in combination with another race, fell low on the relative importance score ranking.
The second method used also comes from machine learning, called the Boruta method for feature selection (Kursa and Rudnicki 2010). Intended to assist with model specification, the method is effective unless the estimation method is computationally intensive (Rudnicki, Wrzesień, and Paja 2015). Unlike the previous method, Boruta creates a threshold for variable inclusion. For example, we know tract percent Asian scored lowest in relative importance but is that too low to include that variable in the model? Boruta, using shuffled data as a baseline, determines if a variable contributes enough predictive power to warrant inclusion. Using many iterations of the data, Boruta shuffles the rows for one variable at a time and sees how predictive the model is versus the version where no data are shuffled. It repeats this technique over many different specifications. The method suggests inclusion for variables that have higher predictive scores than 95% of their shuffled versions. Using the cumulative response rate model, the Boruta method suggests inclusion of each variable in the model, including the state fixed effects (results not shown). Based on results from these two additional tests, we conclude that our models are reasonably robust in explaining variation in tract response rates and that infection rates played a significant part.
Discussion
In this article, we explore self-response rates to the 2020 Decennial Census and whether an unprecedented environmental event—the COVID-19 virus pandemic—influenced participation in the census. The pandemic upended daily life, with quarantines, business and transportation closures, supply chain disruptions, and the like. In addition, in densely populated cities with high infection rates such as New York City, some residents exited the city seeking areas with lower infection rates. With the first mailing flight arriving at households in mid-March, the Census Bureau was concerned the timing of the pandemic would be particularly disruptive with some residents temporarily displaced. In addition, reporting on the pandemic overtook many media outlets, potentially overshadowing the carefully constructed communication campaign designed to raise awareness, educate the public, and encourage nationwide participation in the census. These events culminated in an unprecedented 2020 Census environment that was no longer “business as usual.”
Our analysis is caveated with several limitations. First, because the COVID-19 infection rates are at the county level, we lost some variability in this measure given our unit of analysis was census tract. Since the variance of COVID-19 was decreased due to the aggregation, the variance of the estimate (error term) was, by definition, larger (King, Keohane, and Verba 1994). However, since the model contained over 50,000 observations that spanned over 3,000 counties, the loss in variation on the covariate should have had a minimal effect on the hypothesis test according to the Law of Large Numbers (Finlay and Agresti 1986). Because our unit of analysis was census tract, there was additional between-household variability within tracts not reflected in our models. Also, our measures were cumulative and not a true time series. That means some tracts had a surge in COVID-19 cases during the most crucial portion of the campaign (March), while others experienced it during later “pushes” aimed to count hard-to-survey areas. In addition, community resources and efforts to promote the census varied by state and county, but this exogenous variable is difficult, if not impossible, to operationalize and include in the models (ICF 2012). Finally, with the large number of census tracts under study (over 58,000), the statistical power was large, with most predictor variables being statistically significant. Consequently, we applied several techniques to better understand the explanatory power of each predictor variable.
Our results suggest that even after controlling for variables associated with hard-to-survey populations (e.g., percent female-headed households, percent renter households, percent White persons with less than college education) we found that the higher the county-level rate of COVID-19 infections, the lower the tract-level self-response rates. We offer several hypotheses to this finding. First, in areas with extremely high infection rates, it is likely the topic dominated media reports, drowning out and/or reducing earned media that might otherwise have helped advertise the 2020 Census. Second, the pandemic caused changes in media consumption, with fewer people performing out-of-home activities such as attending movies or using mass transit, and more consuming in-home media such as Netflix and cable television. In areas with high infection rates, these changes were likely magnified, potentially weakening the impact of a paid advertising campaign designed to increase awareness and ultimately, response. Third, areas with high infection rates undoubtedly experienced high anxiety and uncertainty as a result of health concerns, job losses, school closures, fear of eviction, and other negative physical and mental health outcomes. As a result, the task of completing the census may have simply become a lower priority.
Nancy Bates
U.S. Census Bureau (retired)
Joseph Zamadics
PSB Insights