Introduction
Survey researchers may seek to reduce costs by choosing not to offer an incentive. However, there is evidence that surveys with guaranteed incentives tend to improve response rates when compared to surveys with no incentive or a lottery/prize drawing (see for example, Halpern et al. 2011 and Dykema et al. 2024). Further, studies have shown that the provision of an incentive may increase response rates from hard-to-reach groups with little or no impact on data quality.
For example, a mail-administered consumer preference survey focused on ethanol fuel found that providing an incentive tended to increase the portion of respondents who were less educated and who were less familiar with the survey topic (Petrolia and Bhattacharjee 2009). Another study found that using an incentive was an effective and cost-efficient strategy to increase response rates for hard-to-reach caregivers in a phone survey (Fomby, Sastry, and McGonagle 2017).
Continued research is needed on the impact of incentives on data quality and sample composition (Singer and Ye 2013). Specifically, researchers have noted that the available evidence on the impacts of monetary incentives in web surveys is more limited compared to other administration modes (Abdelazeem et al. 2023).
In this study we consider whether an incentive impacts data quality or sample composition through the investigation of non-observation and observation effects. To measure non-observation effects, we compared the reported demographic makeup of an incentive and no incentive group. For observation effects, we investigated whether the presence of an incentive led to satisficing or affected data quality or response patterns. To operationalize satisficing, we used multiple indicator variables. For each group, we reviewed the amount of time spent on the survey, the length of open-ended responses, and the tendency to straightline responses to grid or matrix-style questions. Response patterns were examined by comparing the groups’ rates of item non-substantive response and satisfaction and how they answered attitudinal and behavioral questions.
Methods
In 2022 and 2023, utility customers in a Mid-Atlantic state in the United States were invited to a web survey via email. The purpose of the survey was to help inform an evaluation of a behavior modification or Home Energy Reports (HER) program.
The survey invitations offered either no incentive or a $5 incentive. Initially, the survey did not offer an incentive; a $5 incentive was added to improve the response rate and meet survey quota requirements. Incentive group respondents were sent $5 electronic gift cards through an online fulfillment company 5-10 days after survey completion. Respondents were informed that the gift card would be sent to them via email 5-10 days following survey submission, and they could redeem it to be used at one of over 80 different retailers.
We selected groups, or “lots,” of invitees for the survey from the HER program tracking file. Tracking data contained customer names, email addresses, and account information as well as whether customers had received reports. The selection of lots was not random, but no underlying characteristics were considered in selection – that is, the lots were selected in a non-systematic fashion. For example, customers in rows 2 through 500 were selected to receive the survey invite with no offer of an incentive. After initial lots did not result in a sufficient number of completes, later lots were offered incentives (for example, those in row 9,530-12,500 were selected to be offered an incentive). Tracking data did not list customers in any discernable order (e.g., date of report enrollment, geographic area, alphabetically) and lots were pulled from tracking data in an ad hoc fashion. Incentive assignment was unrelated to HERs treatment group or date of HERs enrollment.
Program data did not include respondent characteristics, so it was not possible to determine or use demographic or background characteristics for the assignment of the incentive and no incentive groups. Therefore, incentive assignment should not have led to any potential impact on the outcomes of this study. About the same portion of control and treatment group customers were offered an incentive.
All survey questions required a response to allow respondents to move forward. However, for each of the demographic and background questions, either “Prefer not to answer” or “Don’t know” were provided as options.
For nominal-level variables, the test statistics presented are chi-square tests for categorical variables and two-sample z-tests for proportions for “Don’t know” and “Prefer not to answer.” In cases in which categorical variables had expected values less than 5 and tests were significant, Fisher’s Exact Test was used (statistics are denoted with two hyphens for these tests). For ordinal-level variables, the test statistic is the z score for Mann-Whitney U. We acknowledge and recognize both Type I and Type II error concerns as legitimate, thus we present individual p-values as well as 95% confidence intervals for differences between groups.
Results
Table 1 shows the response rate for the survey. The presence of an incentive was associated with a 2.1 percentage point higher response rate (95% CI [1.77, 2.39]).[1]
Non-observation effects
No underlying customer characteristics were considered (or available to be considered) during incentive group assignment, thus the groups should not have differed on any demographic or household characteristics variables.
The incentive group had a larger portion of respondents identify as Asian (9% compared to 2% of the no incentive group). Also, the ages of the groups differed, with the no incentive group skewing older, with a larger portion identifying as being over 55 years old compared to the incentive group. The incentive group tended to have larger household sizes. Forty-two percent of the incentive group indicated their household had three or more members; this compared to 28% of the no incentive group. To sum up, the incentive group tended to be younger, have larger households, and be more likely to identify as Asian ethnicity.
Because underlying or benchmark demographic and background information was not available, we cannot state whether or not survey respondents were demographically representative of the population. Though lots were not selected using demographic or background characteristics as this information was unavailable to the survey administration team, we must caveat these findings and note that observed differences between the groups may be a function of population differences that we have no way of knowing about. Complete tables with survey respondents’ reported demographic and household information are provided in the Appendix.
Observation Effects
We investigated whether the presence of an incentive affected data quality or response patterns.
Satisficing
To gauge whether the presence of an incentive was related to satisficing, we reviewed the amount of time spent on the survey for the groups, length of open-ended responses, and the tendency of each group to straightline their responses to grid or matrix-style questions.
We did not find evidence that providing an incentive led to satisficing or less nuanced or thoughtful responses. The time to complete the survey, length of open-ended responses, and straightlining rates were similar for the two groups. After removing outliers, the average survey time response time was 9:56 for the incentive group, and 10:20 for the no incentive group, respectively. This difference was not significant at the p<0.05 level.
Customers who received Home Energy Reports and indicated they were dissatisfied or felt they were inaccurate were asked open-ended questions to get feedback on how to improve the reports. Though the groups had differing response lengths, these differences were not statistically significant. Additionally, not only were the differences not statistically significant, but the trends for the two questions went in opposite directions. The incentive sample gave shorter responses to the first question but longer ones to the second question. We note that the sample sizes for these open-ended questions were smaller than other analyses in this paper.
The survey contained five grid-style Likert questions. We used these questions to investigate straightlining. Each respondent was asked up to four grid-style questions that requested they provide a scale rating for several sub-components. We assigned each respondent a straightlining score from 0% to 100% (count of straightlined grid questions/total grid questions asked). Each grid-style question was considered independently. For example, respondents were asked to rate their level of agreement on a scale from 0 (strongly disagree) to 10 (strongly agree) for a grid with nine statements. If a respondent selected the same response (e.g., 9 for all statements, 0 for all statements) as their level of agreement for all nine statements, they were considered to have provided a straightline response for that grid.
The average straightline score for the incentive group was 15.1%, and the no incentive group had an average score of 17.9% (difference of -2.8 with 95% CI [-9.2, 3.6]). More intuitively, of all matrix or grid questions asked to the incentive group, 18.9% had straightlined responses, compared to 22.3% of the no incentive group (difference of -3.4 with 95% CI [-10.3, 3.5]). The difference between groups was not significant, and the result went against the hypothesis that the incentive group would be more likely to straightline grid-style questions.
Item Non-Substantive Response
The incentive and no incentive groups had similar rates of item nonresponse for demographic questions (choosing “Prefer not to answer” or “Don’t know”). Respondents who received an incentive provided non-substantive responses for an average of 4.1% of demographic questions, compared to 3.6% of the no incentive group (difference of 0.5% 95% CI [-2.7, 3.7]).
Satisfaction
Survey respondents who received reports were asked to rate their satisfaction with the reports, as well as their accuracy and the perceived value of the components. Eleven questions were asked in the first year and seven in the second year; four of the value ratings were removed in the second year, as their subject overlapped with the satisfaction ratings. Though there were differences between the groups, none of them were significant and generally the groups provided similar satisfaction ratings. For example, 70.1% of the no incentive group said the reports were accurate compared to 65.8% of the incentive group. About 78% of both groups indicated they were satisfied with the reports overall.
Leverage-salience theory
The leverage-salience theory suggests that people respond at higher rates to surveys when the topic is of more interest (or more salient) to them (Groves, Singer, and Corning 2000). Without any other lever, the main motivation for taking a survey is typically the salience of a topic to a potential respondent.
Though we did not find evidence of satisficing or any significant differences in data quality, there was some evidence that the two groups had differing levels of interest in the survey topic. A larger portion of the no incentive group indicated that they read all the reports (see Table 3).
Each year, the respondents were asked attitudinal questions. Respondents were asked to rate their level of agreement on up to nine questions. The portion of each group that agreed with each statement was similar. However, a higher portion of the no incentive group indicated they agreed that scarce energy supplies will be a major problem in the future and that it is possible to save energy without sacrificing comfort by being energy efficient.
The portion of incentive and no incentive groups that reported purchasing or installing energy-efficient equipment in the past 12 months or taking energy-efficient actions was similar (see Table 5). This finding is consistent with the fact that, in the above table, the two groups differed on the two statements about attitudes or beliefs, while they were similar on the items that had more to do with their actual behaviors or intentions:
-
There is very little I can do to reduce …
-
I know of steps I could take…
-
I am interested in reducing…
-
I intend to reduce…
These findings, in addition to the incentive group being more prone to marking “Don’t know” for home size and age questions (see Table A1) seem to give some evidence that the no incentive group had more interest in and awareness of the survey topic. They tended to know more about their home and be more interested in their energy efficiency, compared to the incentive group (though the groups reported similar behaviors or intentions).
Controlling for multiple comparisons
It is important to recognize that making multiple comparisons elevates the risk of finding statistically significant differences between groups and erroneously rejecting the null hypothesis. Increasing the number of comparisons or statistical tests between groups raises the chances of false positives. For example, if we conducted 20 statistical tests and the null hypothesis was true for all of them, we would expect to reject the null hypothesis at the p<0.05 level for one of the tests, just due to chance. To account for the increased risk of a false positive (Type I Error), some researchers call for corrections, or to aim to control the False Discovery Rate using a higher significance threshold (McDonald 2014). Others caution that these adjustments could cause researchers to miss possibly important findings by increasing the number of false negatives (Rothman 1990).
Once we controlled for multiple comparisons, evidence that the two groups differed diminished. We used the Benjamini-Hochberg procedure to control for false discovery. Two differences remained significant with an adjusted threshold for significance of p<0.002: the no incentive and incentive groups’ reported age and level of agreement to the question on scarce resources.
Discussion
Offering a small incentive increased the response rate. We did not find evidence that response quality differed between the groups. But there was some evidence that the offer of an incentive may impact the types of customers that choose to take a survey and their response patterns. The groups provided similar satisfaction ratings, but the incentive group was younger, more non-white, more likely to rent, and more prone to marking “Don’t know” for home size and age questions (see Table A1). Further, responses to level of agreement questions provided evidence that the no incentive group tended to be more engaged or interested in the topic (though groups reported similar intentions or actual behavior).
The incentive and no incentive groups were distinct and there were not any known differences between the groups. They received similar recruitment messages and completed the same survey instrument. The groups were not known to have systematic differences or to have been chosen in a manner that could have impacted results. The only potential impact may have been that the no incentive group was invited to take the survey before the incentive group. However, there is no compelling reason that this would impact sample composition, response patterns, or data quality, as all invitees had significant time to complete the survey once they had been invited and the two groups were distinct.
Recent research has questioned the validity of standard weighting models’ assumption that whether and how people respond is not correlated and suggest survey researchers in various fields should investigate and account for this correlation more actively (Bailey 2023). This paper provides evidence that relying on individuals willing to provide responses without an incentive may create non-ignorable nonresponse. We found several differences between the no incentive and incentive groups. Modeling selection and adjusting results such that the incentive group is weighted heavier may create truer estimates.
Additional research is needed to validate this paper’s findings. Other potential areas to explore include different geographies or target populations, survey topics, and incentive levels (or forms of incentives). Another area to strengthen this study could be to seek larger sample sizes. The sample sizes of 387 and 203 provided decent power to detect moderate differences, but less power to detect very small differences. Leaving Type I error at the default 5% rate, these sample sizes deliver at least 76% power to detect a 10-percentage point difference and at least 31% power to detect a 5-percentage point difference.