From Prompt to Practice: Evaluating Unedited AI-Generated Psychometric Scales

Peter J. Martini; Patrick Scagnelli; Marisol E. Trujillo Medina

doi:10.29115/SP-2026-0002

Martini, Peter J., Patrick Scagnelli, and Marisol E. Trujillo Medina. 2026. “From Prompt to Practice: Evaluating Unedited AI-Generated Psychometric Scales.” Survey Practice 20 (March). https://doi.org/10.29115/SP-2026-0002.

Download all (2)

Figure 1. Bland-Altman Analysis for self-esteem scales
Download
Figure 2. Bland-Altman Analysis for RWA scale and subscales
Download

View more stats

Abstract

The current study assesses whether a non-expert researcher can take an unrefined, AI-generated scale for survey dissemination and obtain results comparable to those derived from conventional, psychometrically developed scales. A sample of 140 respondents completed both established measures (e.g., the Rosenberg Self-Esteem Scale and a 10-item variant of the Right-Wing Authoritarianism Scale) and unaltered AI-derived counterparts. Analyses included factor structure comparisons, reliability assessments, paired t-tests of standardized scores, and Bland-Altman plots of agreement. Results showed that AI-generated items often reproduced similar factor solutions and reliability levels as traditional scales. Agreement was strongest at the whole-scale level, though subscales revealed nuanced differences. These findings suggest that even without expert refinement, AI-generated measures can approximate established scales in many respects. For practitioners, this raises important considerations about efficiency and rigor in survey design: although AI can provide usable starting points, careful evaluation remains essential before deployment in applied research.

Psychometric scale development remains an important part of survey methodology and social science research; requiring item generation, factor analysis, and reliability assessment (e.g., Krumm et al. 2024; Stefana et al. 2025). This rigorous process often requires a level of expertise and training to ensure proper scale development. Recent advances in Artificial Intelligence (AI), particularly Large Language Models (LLMs) such as ChatGPT, may shift how scales are created, allowing individuals without methodological expertise to create scales that function comparably to conventionally developed scales (Beghetto et al. 2025; Götz et al. 2024).

AI in Item and Scale Generation with Human Oversight

Early work, such as the AI-IP project and the Psychometric Item Generator, display an ability to create large pools of related items that could then be reduced by expert intervention (Götz et al. 2024; Hernandez and Nie 2022). Such programs increase efficiency by reducing the time experts spend curating large numbers of verbally similar items but still require expert guidance for proper scale construction (Hernandez and Nie 2022; Wang and Chuang 2023). However, most AI-generated instruments required expert revision to ensure the generated language for each of the items adhere to the underlying construct without becoming redundant (Franco-Martínez et al. 2023; Laverghetta and Licato 2023; Mussel 2025). These programs highlight the underlying question of whether LLM AI tools can create psychometrically valid scales without the intervention or oversight of experts.

Toward Democratization of Scale Development

Beyond the importance of streamlining scale creation, it can be argued that AI represents a shift toward a methodological democratization where these tools support, but do not replace, expertise (Smith et al. 2025). Widespread adoption of increasingly competent tools like ChatGPT allows for the possibility that non-experts could design functional scales with little methodological training. This could democratize access to, and reduce the costs of, scale creation, for the purposes of applied research and surveys. This may be particularly prescient for arenas that utilize rapid, iterative approaches to research. However, this process is rife with concerns regarding the cultural bias of LLM tools, a point that may deeply impact the construction of scales designed to study rare or stigmatized facets of the human experience (Dong and Dumas 2025).

Current Study

The present study evaluates whether unedited, ChatGPT-generated scales are comparable to conventionally developed and well-established instruments. We set out to test these scales “as is” against established measures of self-esteem and right-wing authoritarianism. By focusing on factor structure, internal consistency, convergent validity, and Bland–Altman agreement, our work provides early guidance on whether LLMs can produce ready-to-use instruments without expert intervention. This work contributes to debates on the proper place, utilization, and abilities of LLM AI tools within the broader survey research space.

Method

Participants

A total of 140 undergraduates were recruited through the department’s Sona system at a medium-sized Mountain West university. Participants (76% female, 19% male, 5% non-binary) averaged 25.16 years of age (SD = 7.22). Race/ethnicity was reported as White/Caucasian (60%), Hispanic/Latino (40%), Black/African American (15%), Asian (13%), Native Hawaiian/Pacific Islander (6%), American Indian/Alaska Native (3%), and Other (20%).

Measures

AI-Generated Scales

GPT-4 Turbo was prompted to create a self-esteem scale (“create a scale for measuring self-esteem”) and a right-wing authoritarianism (RWA) scale (“construct a 15-item right-wing authoritarianism scale”. This work highlights short (≤ 15 items) because most non-experts are likely hunting for short scales that facilitate quicker survey completion. The self-esteem scale was indicated as unidimensional; the RWA scale included three proposed subscales: submission to authority, aggression toward outgroups, and conventionalism (OpenAI 2024). Items for both scales were rated from 1 (Strongly Disagree) to 5 (Strongly Agree).

Conventional Scales

For comparison, Rosenberg’s (1979) Self-Esteem Scale (RSES) was used, consisting of 10 items (five positively worded, five negatively worded) with a 1(Strongly Disagree) to 4 (Strongly Agree) response scale. For RWA, Altemeyer’s (2022) 10-item short form was used. Although originally designed and argued to be unidimensional, prior evidence strongly supports a three-factor structure (submission, aggression, conventionalism; Zakrisson 2005).

Analytic Approach

Principal components analyses were conducted on each conventional and AI-generated scale and subscale. Reliability was assessed with Cronbach’s α, with Feldt’s tests used to compare reliability between conventional and AI-generated scales. Convergent validity was examined using correlations and paired-samples t-tests. Agreement between the scales was assessed with Bland–Altman analyses.

Results

Initially, each scale was assessed for face validity (see Appendix for full item breakdown for both scales). Because LLMs are trained on text blocks that may include existing psychological scales, we explicitly examined the degree of overlap between the AI-generated items and the conventional comparison scales. Two authors independently compared each AI-generated item to items from the corresponding conventional scale and classified them as (a) verbatim or near-verbatim paraphrases, (b) conceptually similar but distinct in wording, or (c) novel content with no clear counterpart.

No AI-generated items were verbatim copies of existing items, and only two (2) of 25 items were judged to be close paraphrases. Those were Item 1 on the RSES and Item 2 on the AISES; Item 6 on RWA-10 and Item 3 on AIRWA. The majority of items were classified as conceptually similar but distinct. Additionally, there were two items in the AISES (Items 6 and 7) that had no correlates in the RSES, and three items in the RWA-10 (7, 8, and 10) that were judged to be conceptually distinct from any of the 15 items in the AIRWA scale. So, although the AI-generated scales were intentionally organized around the same conceptual domains and subdomains as the original RSES and RWA-10 scales to facilitate comparison, the specific item wordings are not simple reproductions of the existing measures within the RWA-10 or RSES.

When comparing the AI-generated scales to other existing scale items, item reproduction did exist. The AIRWA was compared to the items from four additional right-wing authoritarianism scales (Bizumic and Duckitt 2018; Funke 2005; Manganelli Rattazzi et al. 2007; Zakrisson 2005). The first item in the AIRWA scale appears to be nearly a verbatim copy of an item that appears in both the Funke (2005) and Manganelli Rattazzi et al. (2007) scales. Additionally, Item 1 appears in a very similar capacity in the American National Election Studies (ANES) childrearing battery. Although there are no other verbatim copies there are several items that are linguistically very similar to items that appear in one or more of the five conventional scales. The items in the AISES were then compared to two self-esteem scales (Helmreich and Stapp 1974; Ryden 1978) and the Ryff Psychological Well-Being Scale (1989). Though no item was a verbatim copy of existing items, Items 1, 3, 4, 6, 7, and 10 were semantically very similar to items in the existing conventional scales.

As shown in Table 1, both self-esteem scales explained a similar amount variance, though the AI version yielded higher internal consistency. The AI-generated scale resulted in a single-factor solution operating as described, and RSES resulted in the familiar two-factor solution along lines of positive/negative question wording. Both self-esteem scales correlated strongly, and paired samples t-tests revealed no mean differences between the standardized scales. Bland–Altman analysis indicated negligible bias and no proportional error, suggesting strong agreement between the conventional and AI-generated scales.^[1]

Table 1.Comparisons between AI Self-Esteem and Rosenberg Self-Esteem

Test	AI Self-Esteem	Rosenberg Self-Esteem
KMO	.93	.90
Bartlett’s χ²	1288.64^***	906.90^***
# Factors (eigen > 1)	1	2
Variance explained (%)	72.3	70.3
Loading range	.80 to .91	.55 to .83
Cronbach’s 𝜶	.96^{^}	.91
Correlation	.85^***

***p < .001; ^ Feldt’s F(138,134) = 2.37, p < .001

For RWA, Tables 2 and 3 show analyses for the three subscales identified in prior research and by ChatGPT, with correlations ranging from modest (conventionalism) to strong (submission, aggression). When both RWA scales were forced into a single-factor solution (Table 4), the AI scale demonstrated stronger internal consistency and remained highly correlated with the RWA-10, indicating comparable coverage of broader construct. Bland–Altman analyses revealed negligible bias at the full-scale level but wider limits of agreement for the subscales, suggesting greater individual variability in subscale scores (see Figures 1 and 2).

Table 2.AI Right-Wing Authoritarianism (OpenAI 2024) Factor Analysis

Test	Submission to Authority	Aggression toward Outgroups	Conventionalism
KMO	.79	.85	.80
Bartlett’s χ²	249.04^***	273.31^***	395.61^***
# Factors (eigen > 1)	1	1	1
Variance explained (%)	58.6	62.9	68.0
Loading range	.62 to .85	.73 to .84	.77 to .88
Cronbach’s 𝜶	.80^{^}	.85	.88^{^}

***p < .001, ^Feldt’s F indicates significant difference in Cronbach’s α compared to RWA-10

Table 3.Altemeyer’s Short Form RWA-10 Factor Analysis

Test	Authoritarian Submission	Authoritarian Aggression	Conventionalism
KMO	.56	.64	.75
Bartlett’s χ²	87.41^***	185.35^***	150.11^***
# Factors (eigen > 1)	1	1	1
Variance explained (%)	61.6	75.0	59.8
Loading range	.65 to .89	.80 to .93	.64 to .84
Cronbach’s 𝜶	.68^{^}	.82	.77^{^}
Correlation with AI-RWA	.70^***	.78^***	.42^***

***p < .001, ^Feldt’s F indicates significant difference in Cronbach’s α compared to AI-RWA

Table 4.Comparison between forced single-factor RWA Scales

Test	AI RWA	10-item Altemeyer’s RWA
KMO	.92	.86
Bartlett’s χ²	1351.43^***	704.53^***
# Factors (eigen > 1)	2	2
Variance explained (%)	54.7	48.3
Loading range	.59 to .81	.47 to .83
Cronbach’s 𝜶	.96^{^}	.87
Correlation	.79^***

***p < .001; ^ Feldt’s F(137, 137) = 2.34, p < .001

Figure 1.Bland-Altman Analysis for self-esteem scales

Note: Data points have been jittered by ±0.05 z-score units on both axes to reduce overlap; this does not affect the underlying values.

Figure 2.Bland-Altman Analysis for RWA scale and subscales

Note: Data points have been jittered by ±0.05 z-score units on both axes to reduce overlap; this does not affect the underlying values.

Discussion

This study assessed whether unedited AI-generated scales could perform comparably to established psychometric instruments. Across both self-esteem and right-wing authoritarianism, the AI-generated scales and subscales showed reliability, factor structures, and correlations similar to, or stronger than, those of the conventional scales and subscales. Bland–Altman analyses revealed negligible bias at the full-scale level, with no evidence of proportional error. Taken together, these findings suggest that, without expert refinement, AI-generated instruments may function as viable tools for research purposes.

However, our results also point to the potential limits for AI-generation. Importantly, we do not claim that the LLM AI is generating scale items ex nihilo; outputs are almost certainly informed by exposure to text that overlaps with existing measures. Our focus here is on whether LLMs can efficiently produce item pools that (a) avoid verbatim reuse of existing items and (b) reproduce the psychometric structure and performance of established scales, thereby offering a practical tool for adapting or extending measurement in new contexts. In this area there are admittedly mixed results. As indicated, ChatGPT 4-Turbo engaged in some verbatim or nearly semantically identical item construction. However, given the well-trodden area both scales address, combined with the issue of content validity and what Haynes et al. (1995) call the “universe of items” that could be reasonably created, true item novelty may be difficult for experts and AI alike.

Additionally, analyses of the subscales show wider limits of agreement, indicating greater variability at more refined levels of measurement. This highlights the continued importance of expert review, particularly when instruments are intended to capture complex, multidimensional constructs, or provide greater degrees of measurement precision. Although our work addressed reliability and agreement, we did not examine cultural bias, redundancy, or conceptual drift; concerns well-cited in prior research (Franco-Martínez et al. 2023; Mussel 2025). Nor does our work address perhaps the most fundamental issue with the use of LLM AI in this domain; the fact that the programmatic default for AI-generated scales is the development of new “jangles” (Kelley 1927) that gravitate toward linguistically newer but nearly identical versions of existing items within the nomological net of a construct.

Another limitation concerns scope and generalizability. Our study focused on two well-established measures for well-studied constructs (self-esteem and RWA) within a sample of undergraduates. Although these constructs offer strong test cases, broader replication across domains and populations is necessary. Additionally, survey practice contexts often require scales that function within defined time and budget constraints. AI-generated scales could provide a solution in such cases, but their use and viability should be tempered by individual project needs pertaining to precision and cultural bias.

Our limited work currently supports the democratizing potential of LLMs in psychometric scale development. These tools may lower the barriers to creating reliable ready-to-use instruments. AI-generated scales may increase access to the ability to create valid measurement tools for non-experts, practitioners, and researchers working under resource constraints. However, democratization should not be mistaken for full automation or novelty. Even if AI-generated instruments demonstrate strong psychometric properties in initial tests, expert oversight remains critical for ensuring conceptual clarity, cultural sensitivity, and alignment with established theory.

Future research should extend this work by testing additional constructs – particularly development of novel scales without current counterparts, purposeful creation of short-form scales, applying cross-cultural samples, and exploring real-world survey contexts where rapid scale development is needed. Ultimately, our findings provide cautious optimism: with limited vetting, AI-generated scales may serve as practical tools in many settings, marking a meaningful step toward more accessible, efficient, and inclusive measurement practices.

Implications for Practice

Findings from this study suggest that unedited AI-generated scales can perform comparably to established measures on key psychometric indices. For survey practitioners, this opens the possibility of rapidly generating instruments when time or resources are limited. Tools such as ChatGPT can help create scales that approximate conventional tools’ reliability and structure, reducing barriers to measurement in many applied settings and potentially democratizing the research process for organizations with limited analytic capacity.

At the same time, these tools should be used with caution. Wider variability at the subscale level, along with unresolved issues of cultural bias and conceptual drift, underscores the importance of expert review. As AI becomes more widely adopted—often without direct involvement from measurement specialists—it will be increasingly important to evaluate the quality of unedited, AI-generated scales, especially given the potential for significant linguistic similarity with existing scales. Practitioners may treat AI-generated items as an efficient starting point, but any resulting scale should be thoroughly vetted and, when necessary, revised before deployment. With responsible application, AI-based item generation can enhance accessibility, reduce costs, and meaningfully expand the toolkit of applied survey research.

Submitted: October 13, 2025 EDT

Accepted: January 21, 2026 EDT

References

Altemeyer, Robert. 2022. “A Shorter Version of the RWA Scale.” The Authoritarians, March 5. https://theauthoritarians.org/a-shorter-version-of-the-rwa-scale/.

Beghetto, Ronald A., Wendy Ross, Maciej Karwowski, and Vlad P. Glăveanu. 2025. “Partnering with AI for Instrument Development: Possibilities and Pitfalls.” New Ideas in Psychology 76: 101–21. https://doi.org/10.1016/j.newideapsych.2024.101121.

Google Scholar

Bizumic, Boris, and John Duckitt. 2018. “Investigating Right Wing Authoritarianism with a Very Short Authoritarianism Scale.” Journal of Social and Political Psychology 6 (1): 129–50. https://doi.org/10.5964/jspp.v6i1.83.

Google Scholar

Dong, Yixiao, and Denis Dumas. 2025. “Is a Less Wrong Model Always More Useful? Methodological Considerations for Using Ant Colony Optimization in Measure Development.” Psychological Methods, ahead of print. https://doi.org/10.1037/met0000734.

Google Scholar

Franco-Martínez, Alejandro, Rodrigo Rey-Sáez, and Inés Castillejo. 2023. “The Seven Wonderings of Large Language Models as Psychometric Designers, Refiners, and Analysts.” Preprint. https://doi.org/10.31234/osf.io/kmqy5.

Funke, Friedrich. 2005. “The Dimensionality of Right-Wing Authoritarianism: Lessons from the Dilemma between Theory and Measurement.” Political Psychology 26: 195–218. https://doi.org/10.1111/j.1467-9221.2005.00415.x.

Google Scholar

Götz, Friedrich M., Richard Maertens, Saurabh Loomba, and Sander van der Linden. 2024. “Let the Algorithm Speak: How to Use Neural Networks for Automatic Item Generation in Psychological Scale Development.” Psychological Methods 29 (3): 494–518. https://doi.org/10.1037/met0000540.

Google Scholar

Haynes, Stephen N., David C. S. Richard, and Edward S. Kubany. 1995. “Content Validity in Psychological Assessment: A Functional Approach to Concepts and Methods.” Psychological Assessment 7 (3): 236–47. https://doi.org/10.1037/1040-3590.7.3.238.

Google Scholar

Helmreich, Robert, and Joy Stapp. 1974. “Short Forms of the Texas Social Behavior Inventory (TSBI), an Objective Measure of Self-Esteem.” Bulletin of the Psychonomic Society 4 (5A): 473–75.

Google Scholar

Hernandez, Ian, and Weiwen Nie. 2022. “The AI-IP: Minimizing the Guesswork of Personality Scale Item Development through Artificial Intelligence.” Personnel Psychology 76 (4): 1011–35. https://doi.org/10.1111/peps.12543.

Google Scholar

Kelley, T. L. 1927. Interpretation of Educational Measurements. World Book Company.

Google Scholar

Krumm, Stefan, Alyce M. Thiel, Nomi Reznik, Jan-Philipp Freudenstein, Philipp Schäpers, and Patrick Mussel. 2024. “Creating a Psychological Test in a Few Seconds: Can ChatGPT Develop a Psychometrically Sound Situational Judgment Test?” European Journal of Psychological Assessment 0 (0). https://doi.org/10.1027/1015-5759/a000878.

Google Scholar

Laverghetta, Antonio, Jr., and John Licato. 2023. “Generating Better Items for Cognitive Assessments Using Large Language Models.” In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). https://doi.org/10.18653/v1/2023.bea-1.34.

Google Scholar

Manganelli Rattazzi, Anna Maria, Andrea Bobbio, and Luigina Canova. 2007. “A Short Version of the Right-Wing Authoritarianism (RWA) Scale.” Personality and Individual Differences 43 (5): 1223–34. https://doi.org/10.1016/j.paid.2007.03.013.

Google Scholar

Mussel, Patrick. 2025. “The Structure of Self-Related Core Beliefs.” Journal of Personality and Social Psychology 129 (3): 618–48. https://doi.org/10.1037/pspp0000553.

Google Scholar

OpenAI. 2024. “Gpt-4-Turbo-2024-04-09.” OpenAI API Model Documentation. https://developers.openai.com/api/docs/models/gpt-4-turbo.

Rosenberg, Morris. 1979. Conceiving the Self. Basic Books.

Google Scholar

Ryden, Muriel B. 1978. “An Adult Version of the Coopersmith Self-Esteem Inventory: Test-Retest Reliability and Social Desirability.” Psychological Reports 43: 1189–90. https://doi.org/10.2466/pr0.1978.43.3f.1189.

Google Scholar

Smith, Andria, Hunter P. van Wagoner, Ksenia Keplinger, and Can Celebi. 2025. “Navigating AI Convergence in Human–Artificial Intelligence Teams: A Signaling Theory Approach.” Journal of Organizational Behavior, ahead of print. https://doi.org/10.1002/job.2856.

Google Scholar

Stefana, Alberto, Paolo Fusar-Poli, Eduard Vieta, and Eric A. Youngstrom. 2025. “Evaluating the Psychometric Properties of the 24-Item and 12-Item Real Relationship Inventory-Client Forms.” PLOS ONE 20 (3): e0311411. https://doi.org/10.1371/journal.pone.0311411.

Google Scholar

Wang, Yu-Yin, and Yu-Wei Chuang. 2023. “Artificial Intelligence Self-Efficacy: Scale Development and Validation.” Education and Information Technologies 29 (4): 4785–808. https://doi.org/10.1007/s10639-023-12015-w.

Google Scholar

Zakrisson, Ingrid. 2005. “Construction of a Short Version of the Right-Wing Authoritarianism (RWA) Scale.” Personality and Individual Differences 39 (5): 863–72. https://doi.org/10.1016/j.paid.2005.02.026.

Google Scholar

Appendix

Appendix Table 1.Self-Esteem Scale Items

Rosenberg Self-Esteem Scale		AI-Generated Self-Esteem Scale
1	I feel that I am a person of worth, at least on equal plane with others.	1	I feel confident in my ability to handle life’s challenges.
2	I feel that I have a number of good qualities.	2	I believe I am a person of worth, equal to others.
3	All in all I am inclined to feel that I am a failure.	3	I am satisfied with who I am as a person.
4	I am able to do things as well as most other people.	4	I feel good about the things I accomplish.
5	I feel I do not have much to be proud of.	5	I believe I have positive qualities that make me unique.
6	I take a positive attitude toward myself.	6	I feel like I am able to contribute meaningfully to the world.
7	On the whole, I am satisfied with myself.	7	I trust in my ability to make good decisions.
8	I wish I could have more respect for myself.	8	I feel like I am living up to my potential.
9	I certainly feel useless at times.	9	I respect myself even when I make mistakes.
10	At times, I think I am no good at all.	10	I feel secure in who I am, regardless of others’ opinions.

Appendix Table 2.Right-Wing Authoritarianism Scale Items

Altemeyer RWA-10		AI-Generated RWA Scale
1	Our country desperately needs a mighty leader who will do what has to be done to destroy the radical new ways and sinfulness that are ruining us.	1	Obedience and respect for authority are the most important virtues children should learn
2	Gays and lesbians are just as healthy and moral as anybody else.	2	People should always support their country’s leaders, even when they believe they are wrong
3	The “old-fashioned ways” and the “old-fashioned values” still show the best way to live.	3	It is important to have a strong leader who will do what needs to be done without worrying about elections or public opinion
4	Atheists and others who have rebelled against the established religions are no doubt every bit as good and virtuous as those who attend church regularly.	4	Government authorities know what is best for society, and citizens
5	God’s laws about abortion, pornography and marriage must be strictly followed before it is too late, and those who break them must be strongly punished.	5	Once authorities decide on a course of action, it is the duty of all citizens to follow it without protest
6	What our country really needs is a strong, determined leader who will crush evil and take us back to our true path.	6	The government should take stronger actions to suppress the activities of people who disrupt the social order
7	You have to admire those who challenged the law and the majority’s view by protesting for women’s abortion rights, for animal rights, or to abolish school prayer.	7	Criminals deserve harsh punishment to ensure they will not commit crimes again.
8	Homosexuals and feminists should be praised for being brave enough to defy “traditional family values.”	8	People who criticize their country’s values are more dangerous than those who break the law
9	The only way our country can get through the crisis ahead is to get back to our traditional values, put some tough leaders in power, and silence the troublemakers spreading bad ideas.	9	It is acceptable to use force against groups that threaten the traditions and stability of society
10	Everyone should have their own lifestyle, religious beliefs, and sexual preferences, even if it makes them different from everyone else.	10	Immigrants who do not respect the country’s culture should be deported.
		11	There is no excuse for people who break the rules of society, regardless of the situation
		12	Traditional family values are the foundation of a healthy society.
		13	Our society would be better off if people lived according to traditional moral values
		14	Respecting societal norms and customs is more important than personal freedom
		15	People who challenge the status quo are often selfish and don’t care about the harm they cause.

Additional tables, graphs, and data can be seen on the project OSF page