The tools, procedures, and designs for modifying strategies during data collection have proven to be valuable for the Survey of Income and Program Participation (SIPP) despite continual decline in overall response rates. The SIPP is a complex survey whose potential value becomes more evident with the programmatic and policy questions raised during challenging times. While the survey environment for longitudinal household panel surveys has always been challenging, addressing increasing nonresponse, focusing on data quality measures, and being sensitive to the respondent and interviewer burden are even more critical when the collection is dramatically interrupted.
The SIPP is a longitudinal government survey that collects information on a variety of topics, which allows for the study of interaction between tax, transfer, and other government and private polices, with multiple social and demographic characteristics. The SIPP is a computer-assisted person interview (CAPI) survey collected primarily in person. Respondents in the first wave remain in the sample for the next three data collection years regardless of whether they move to another address. The first wave nonrespondents are dropped from the sample but are replaced with a new sample, so the total sample size remains at approximately 53,000 households. Further information of the survey, its sample design, content, and uses are available on http://www.census.gov/sipp.
Despite challenges of uncertainty of funding for SIPP in 2018, a 35-day federal furlough during the peak hiring and training for SIPP in 2019, followed by the incredible impacts of the coronavirus pandemic in 2020 and 2021, case prioritization efforts have allowed the SIPP program to adapt to unexpected conditions and to make well-informed decisions throughout data collection aimed at attaining respondents that are representative of the population. This research examines the effect of case prioritization on representation during the last five years.
Case prioritization refers to targeting a subset of cases with pre-identified data collection features that are different from the typical features applied to the nontargeted population. SIPP targets cases that belong to subgroups with historically low response rates to help produce a representative respondent population. Interviewers are instructed to give targeted cases “first attention” and to “make at least one attempt per week.” Though the instructions are intentionally left vague, so that interviewers have a little flexibility regards how they alter their contact strategy, it typically means more contact attempts and more time spent on these cases, which reduces contact attempts and attention on other remaining cases. Increasing response in under-represented populations while decreasing response in over-represented populations reduces the variance in response rates of differing populations.
Case prioritization is a common way of allocating scarce resources to prioritized cases (Schouten, Peytchev, and Wagner 2017). Prior studies have shown that case prioritization has had varying effects on data quality. The following surveys applied case prioritization differently to attend to their goals leading to a variety of outcomes. The National Survey of Family Growth (NSFG), a CAPI in-person survey, conducted multiple prioritization experiments with differing goals where targeted cases were instructed to receive more calls. Their prioritization rarely improved survey outcomes, such as response rates (Tourangeau et al. 2017). Statistics Canada capped the number of calls on nontargeted cases with the goal of increasing representativeness in their surveys. They found that while their prioritization did not lead to much increase in representativeness, it did lead to a reduction in interviewing hours (Tourangeau et al. 2017). The case prioritization did not affect the response rate which meant a decrease in interviewing hours per response as well.
The Community Advantage Panel Survey (CAPS), a CAPI in-person survey, offered larger interviewer incentives for completing low response propensity cases with the goal of increasing overall response. This reduced variation in the response propensities but did not reduce nonresponse bias as intended (Peytchev et al. 2010). The High School Longitudinal Study, a multi-mode survey with web and phone modes as the initial mode, targeted the lowest propensity quartile with an increase in in-person visits, which lowered the average relative bias by 0.4% (Schouten, Peytchev, and Wagner 2017). The National Survey of College Graduates (NSCG), a web and phone-mode government survey, found evidence that case prioritization in the form of a system of continuous monitoring and intervening protocols, can improve R-indicators (Coffey, Reist, and Miller 2020).
Past research on the SIPP found that prioritizing cases led to smaller attrition bias in unweighted key estimates and helped increase retention of movers through a pilot study first conducted in 2016 followed by a larger interviewer-level experiment in 2017 (Tolliver et al. 2019).
The effect of case prioritization in SIPP is reliant on interviewer compliance, intervention effectiveness, and proper targeting. It is documented that interviewers do not always comply with prioritization instructions (Nagle and Walejko 2018; Walejko and Miller 2015; Walejko and Wagner 2015). With full interviewer compliance, the effect still relies on the effectiveness of the intervention. We assume that the interviewer altering their contact strategy leads to higher response propensities. If, however, the altered contact strategy has no impact on responding, then our prioritization has no effect. Lastly, if we have interviewer compliance and an effective intervention but incorrectly target cases, then we may introduce bias instead of reducing bias.
The primary prioritization goal in SIPP is to obtain respondents representative of the United States population. Through continuous monitoring and intervention, SIPP instructs interviewers to reallocate their efforts to cases that may have the highest impact on data quality. Every case on the active workload has a priority status of high, medium, or low. Using a combination of known and unknown demographic and geographic predictors (see Appendix A2), SIPP identifies which cases are most at risk of being under-represented and designates them as high priority. Cases at risk of being overrepresented and cases that have been worked appropriately with low chances of responding are designated as low priority. All other cases are designated as medium priority.
Interviewers are instructed to give high priority cases first attention each day they work. They are also instructed to work their medium priority cases as they normally would and to not make any attempts on low priority cases until they have made sufficient effort on their high and medium priority SIPP cases.
Typically, at the start of data collection, about 20% of the workload is designated high priority and 80% of the workload is designated medium priority. Through the 2021 data collection, there has never been any low priority cases at the start of data collection.
The priority status for a case may change as often as every two weeks, though the priority status of a case usually does not change more than once throughout the entire data collection. SIPP continually monitors paradata, interviewing costs, and the distribution of respondents when considering priority changes. For simplicity, we categorize these prioritizations as early-stage (s), mid-stage (m), and late-stage (l) prioritizations.
The interviewers are required to communicate with the case management systems regularly to see the most up-to-date priority. When a case becomes high priority, interviewers are instructed to make a contact attempt within one week. Late in data collection, some low priority cases are stopped to ensure interviewers focus on their higher priority cases. Once a case is stopped, no additional attempts are allowed. Cases that are stopped are assumed to be eligible nonrespondents.
We assess if interviewers are complying with instructions by observing the percent of cases that had attempts within the first seven days of the initial prioritization and the differences in weekly attempts per case for high, medium, and low priority cases.
Representativeness-indicators “R-indicators” measure representation by observing one minus the variance in the response propensities given demographic and geographic characteristics (Schouten, Cobben, and Bethlehem 2009). Smaller variances are indicative that the demographic and geographic characteristics contribute equally to nonresponse, and thus, we assume larger R-indicators are likely to have smaller nonresponse biases.
A secondary goal of case prioritization in the SIPP is not to sacrifice overall response while in pursuit of the most representative population possible. The coefficient of variation (CV; De Heij, Schouten, and Shlomo 2015) in the response propensities measures representation in relation to response. In contrast to R-indicators, smaller CVs are likely to have smaller nonresponse biases. Because this metric has the mean response propensity as a divisor, it is assumed that smaller CVs are more likely to have smaller mean squared errors.
We assess if the intervention is effective by comparing R-indicators and CVs of the treated sample, where there was a mix of high, medium, and low priority cases, to the nontreated sample where all cases were displayed as medium priority. The R-indicator and CV are computed separately for the Wave 1 and Wave 2+ samples since the Wave 2+ sample has the benefit of having prior wave information as predictors of the R-indicator and CV.
While both the 2017 and 2018 data collections were experimental years, where we can simply compare the treatment to the control, the 2019-2021 data collections are thought as nonexperimental years because there has been no formal control group. This means our comparisons of the treated sample to the non-treated sample are observational. This paper estimates what the R-indicator and CV might have been without prioritization by matching high and low priority cases to similar medium priority cases (not made high or low for evaluation purposes), then adjusting the propensities by an estimated effect. The steps for the analyses for 2019–2021:
Compute observed R-indicator, mean response propensity, and CV as
whereis the standard deviation of response propensities.
Match any case that was high priority, low priority, or stopped using greedy nearest-neighbor propensity matching using demographic, geographic, calendar year, stage, and interviewer caseload information. (SAS Documentation; Stuart 2010) “Similar” in the context of this paper refers to medium priority cases that were matched to the high or low priority cases.
Estimate the response propensity effectby
^βp,t(ρ)=(II+NI)Treated data−(II+NI)Matched data
where I is number of interviews, NI is the number of eligible noninterviews, p is priority (H,M,L), t is the time period when the case was made priority p (e= early-stage [first 4 weeks], m=mid-stage [5th week to 7th to last week], l=late stage [last 6 weeks]).
Re-estimate propensity scores as
Compute the expected R-indicator, mean response propensity, and CV if no prioritization was used as
Estimate the R-indicator, CV, and response rate effect,respectively, by
The prioritization led to increased effort on higher priority cases and decreased effort on lower priority cases, though neither as much as intended. Overall, there were more attempts on high priority cases. The difference in attempts among high priority and all other cases was larger at the start of data collection but became smaller throughout data collection. Even though the instructions state “make at least one attempt every week,” fewer than half of the high priority cases had one attempt within the first seven days of becoming high priority. Overall, there were not fewer attempts on low priority cases, but there were fewer attempts on low priority cases compared to similarly matched medium priority cases. Low priority cases that were stopped ultimately led to more attempts on other remaining cases. For every 1,000 cases stopped, the remaining workload receives about +0.05 more weekly attempts. In addition, fewer than half of the initial high priority cases had attempts within the first seven days of data collection, but many more were started the first week compared to their medium priority counterparts (H = 37% vs. M = 27%). The median first attempt for cases prioritized at the start of data collection is day 22 compared with day 30 for all other cases.
Table 1 illustrates how the priority status affected the number of weekly contact attempts per case during different periods in data collection. This is calculated as the total attempts divided by the total number of longitudinal case observations by priority category during that period.
Interviewers with prior SIPP experience and/or with larger SIPP workloads were 10% to 20% more likely to work their high priority cases within the first seven days of becoming high priority. Table 2 summarizes how the differing effort affected the response propensities.
The observed Wave 1 R-indicator was on average 2.1% larger than the expected Wave 1 R-indicator if no prioritization was used, while having a CV that was on average 4.3 percentage points smaller than the expected Wave 1 CV. The observed Wave 2+ R-indicator was on average 5.8% larger than the expected Wave 2+ R-indicator, while having a CV that was on average 4.3 percentage points smaller than the expected Wave 2+ CV. The observed mean response propensity and estimated mean response propensity were nearly the same. Though these results may not be statistically significant, they are generally favorable. Table 3 gives a high-level summary of case prioritization spanning multiple years.
As noted in the introduction, each year the survey has faced a different set of challenges making the estimation step in propensity matching imperfect. Funding uncertainty, government shutdowns, and a global pandemic have made it so that no two years are exactly alike. See Appendix A1 for more details. This means across years, every case can be matched, but there is variability from one year to the next, and while we believe that data has benefited from case prioritization, our method for evaluating is a limitation. All data are subject to errors arising from a variety of sources.
Despite this limitation, the belief is that the case prioritization benefits overall representation. Furthermore, by design, our case prioritization method has helped the survey’s nonresponse adjustments because we use known covariates that are closely associated with the survey’s nonresponse adjustments to target cases. Perhaps another method of evaluating the prioritization’s effectiveness is estimating the prioritization’s effect on the nonresponse adjustments.
Though response rates have continued to suffer in SIPP in the wake of numerous challenges, we believe that prioritization has helped mitigate some issues in data quality. The prioritization effects are not large but consistently positive. There has been a consistent increase in R-indicator, and outside of the 2019 Wave 1 sample where it was decided to stop most cases to focus on the 2018 Wave 2 sample, there has been a consistent decrease in CV. The prioritization improved data quality indicators (R = +2.1%, CV = -4.3 percentage points) for Wave 1 data and (R = +5.8%, CV = -4.3 percentage points) for Wave 2+ data. In the most recent years, where we have leveraged information from prior years, we believe that the prioritization helped the overall response rate (+3.6 for Wave 1, +1.2 for Wave2+).
This paper provides evidence that dynamic case prioritizations can be a tool to improve data quality. While there is evidence of positive effects, the results leave room for improvement. As SIPP plans to continue use of case prioritization, the survey program will test new strategies to increase interviewer compliance, explore administrative data to improve estimation of response propensities, and consider the tradeoffs between the increase in representation and costs.
Any opinions and conclusions expressed herein are those of the author(s) and do not reflect the views of the U.S. Census Bureau.
We extend a thank you to Bennett Adelman, Carolyn Pickering, and Rachel Horwitz for the review of this paper.