Following the First Law of Studying Societal Change, the General Social Survey (GSS) strives for consistent measurement over time by employing constant measures. However, in certain cases measures have been changed for various reasons. When such alterations occur, the GSS has introduced the revised version in a controlled manner, typically using some combination of across-subjects experiments and within-subjects repetition. This procedure is important so that variation due to measurement effects is not confounded with studying true change. This report considers a possible change in the GSS measure of self-rated health.
Since 1972, the GSS has included a self-rated health measure (HEALTH – Would you say your own health, in general, is excellent, good, fair, or poor?). This simple item is widely used in health studies and is a notable predictor of mortality and other health outcomes, even controlling for other variables such as specific health history and medical evaluations. The GSS wording came from Gallup surveys in 1941 and 1950. In the 1970s about half of major US national studies measuring self-rated health employed a 4-category response scale and half used a 5-category version (Danchik and Drury 1986). When the National Health Interview Survey (NHIS) was redesigned in 1982, it switched from a 4-category version to a 5-category format. Consistent with that decision, virtually all US governmental, health surveys now use 5-category versions (e.g. the National Health Examination Survey, the Health and Retirement Study, the Study of Assets and Health Dynamics, the Behavioral Risk Factors Surveillance Study), as do most other health scales (e.g. SF-36). Besides the GSS, relatively few studies continue to employ a 4-category version.
As Kovar and Poe (1985) note, the NHIS study switched to five categories in order “to improve the ability to differentiate among people” and others have preferred it for similar reasons. The unarticulated expectation was that the finer measurement would more accurately measure health status and produce stronger associations with health variables and demographics.
On the GSS and other studies a variety of comparisons between the different response scales used for the subjective health measures exits. These include non-experimental comparisons and experiments using both inter-subject and intra-subject designs. Table 1 examines the impact of the 4- and 5-category response scales on marginals. Table 1A looks at non-experimental comparisons in which different surveys of similar populations were conducted at approximately the same time and Table 1B covers intra- and inter-subjects experiments. In the intra-subjects design people were asked both versions of the self-rated health question in different parts of the survey. In the inter-subjects design, different random samples were given 4 or 5 categories versions.
Table 1 Comparisons of the Distributions of Self-Rated Health
Using 4 or 5 Response Options
B. Experimental (Intra- and Inter-Subject Designs)
Adding the fifth “very good” category takes responses from the more positive “excellent” option and the less positive “good” option and reduces both. The declines in “excellent” range from 4.9 percentage points to 16.8 points and “good” decreases from between 15.4 points to 21.2 points. There is considerable difference as to whether most of the “very good” responses appear to come from “excellent” or “good”. The decline in “excellent” apparently contributes as little as 19.5% of the “very good” responses (Table 1A-2) to as much as 61.5% (Table 1A-3). The differences are even notable within the experimental studies. There is little impact on the distribution of “fair” and “poor” response across response scales. An intra-subjects design among employed adults on the 2002 GSS confirms the very limited impact on these two more negative responses. The impact of the changes in response scales on distributions is large, but variable, making any simple comparison across the response scales difficult.
Next, the associates of health are examined (Table 2). This examines whether the two items reveal the same structural relationships, and tests the hypothesis that the finer scale yields stronger correlations. Overall there is no meaningful difference in the strength or statistical significance of associations. The average absolute correlations were 0.130 for the former and 0.132 for the latter.
Table 2 Correlates of 4-Category and 5-Category Health Self-Ratings (Pearsons r/probability)
The lack of any meaningful and consistent difference in correlations is not surprising since several previous GSS studies showed little or no impact on associations of using response scales with more categories. It is also expected because on the 2002 GSS the correlation between 4- and 5-category health items is 0.85 and if Excellent on the 4-category scale is considered consistent with Excellent or Very Good on the 5-category scale and likewise Good with Very Good or Good, that means that 93.6% of the cases are on the diagonal when crosstabulating the items. Also, as indicated above, there is little impact on the bottom two categories and Singer (1994) argues that the “predictive value of self-rated health is driven by ratings of fair or poor health”.
This evaluation of self-rated, health items indicates that 1) no discernable difference in the explanatory power of the two scales occurs, 2) major shifts in the distributions happen at the positive end, but little at the negative end, 3) the variation in the contributions from Excellent and Good to the added Very Good option would not allow trends in these categories to be reliably estimated across scales and, as a result, would restrict trend analysis combining both 4- and 5-category data points to comparing the bottom two responses with the combined top two or three categories, and 4) correlations across studies using the 4- and 5-category scale might be compared since they do not produce different estimates.
The large impact of the shift in response scales over part of the distribution and the unexpected nil impact on correlations underscores that survey researchers must be careful whenever changing methods. Changing methods should always be presumed to muddy, if not eviscerate, valid comparisons. Additionally, changes will often not yield the improvements expected. When modifications are introduced, experiments and other rigorous designs should be utilized and any expected improvements need to be verified.
The GSS is the largest and longest-term project of the Sociology Program of the National Science Foundation. It has conducted 26 national, in-person, full-probability surveys of adults living in US households between 1972 and 2006 (Davis et al. 2007).
See Hardy and Pavalko 1986; Idler and Angel 1990; Siegel 1994, Perry et al. 1996; Idler and Benyamini 1997; Ferraro and Farmer 1999; Remle 2004.
The NHIS is the main, continuous health monitoring study of the household population conducted by the US government. For more information see www.cdc.gov/nchs/about/major/nhis/hisdesc.htm On the switch see Kovar and Poe (1985).
On the meaning of the self-rated health measure and how evaluations are done by respondents see Groves et al. 1992; Mallinson 2002; Schechter 1993; Sehulster 1994; Singer, 1994.
On the GSS, see Peterson 1985; Smith 1994a,b. Alwin (1992) found a slight increase in reliability moving from 4 to more than 4 categories, but Davis et al. (1996) found no gains between 4 categories and 5–6 categories.