Artificial intelligence (AI) approaches have progressed rapidly in recent years and are now providing practical tools that are changing the way work is done in many fields of endeavor. One such widely accessible tool is ChatGPT (ChatGPT 2023), an AI internet chat program launched in 2022 by the company OpenAI (Wikipedia 2024).
Despite some concerns about the potential misuse of such AI methods (Federspiel et al. 2023), these new techniques are being used increasingly in various psychological and mental health applications, often with good success. For example, a study by Naher (2024) showed that ChatGPT generated accurate, high-quality responses to patient queries around a variety of mental health issues including anxiety and depression. Maurya et al. (2025) conducted a study in which two highly experienced mental health counselors evaluated ChatGPT responses to a set of queries related to depression, anxiety, substance abuse, and interpersonal issues. Their findings showed that the ChatGPT results were quite accurate, clear, relevant and appropriately empathic in tone.
Researchers and practitioners are often faced with the challenge of summarizing core themes or issues contained in text data collected from interviews or survey responses. However, a large commitment of time and resources is often required for thematic analysis of such open-ended data (Feuston and Brubaker 2021), particularly for large datasets (Castleberry and Nolen 2018). Unfortunately, as a result open-ended text responses in surveys are sometimes never fully analyzed or are treated only cursorily by researchers (Male 2016; Rouder et al. 2021). Typically, these responses must first be entered into some form of computer file, and then read and classified by one or more human coders. Depending on the nature of the data, this can be quite burdensome, leading some investigators to postpone the process, sometimes indefinitely. According to Terry et al. (2017), this is especially a challenge for junior investigators. In this context, it is worth considering if recent advances in AI can be applied to help in the thematic analysis of qualitative text data.
Following a surge of technical advances in AI and machine learning over the last 5-10 years, there are now multiple smart natural language AI programs available to researchers and the public alike. Most notably these include Microsoft’s Copilot, Google’s Gemini, OpenAI’s ChatGPT, and the Chinese entry Deepseek (Ortiz 2024). Most recently, the company XAI has released Grok3, claiming it to be the most powerful natural language chatbot available today (Duffy 2025).
A number of recent studies have examined ChatGPT for analyzing qualitative text data, concluding it is a promising tool in this regard. For example, Bijker et al. (2024) used ChatGPT as an aid in the analysis of text taken from a forum on reducing dietary sugar, which has been linked to several mental health problems including depression (Zhang et al. 2024). Using both inductive and deductive qualitative content analysis, Bijker et al. (2024) report high agreement between ChatGPT responses and those obtained from human coders. In another recent study, the authors used ChatGPT to re-analyze text data from two previous studies in an attempt to see if ChatGPT could reproduce the thematic categories originally identified by the author (Morgan 2023). Here, the text consisted of new graduate students’ descriptions of the challenges and stressors they faced during the year. Results showed that ChatGPT was successful in identifying the concrete, descriptive themes previously identified by human coders. However, ChatGPT was somewhat less successful in identifying “subtle, interpretive” themes (Morgan 2023). A Japanese study also found that ChatGPT lacked sensitivity to the Japanese cultural context when identifying themes in text material (Sakaguchi et al. 2025). Finally, a comprehensive study by Zhang et al. (2025) examined the utility of ChatGPT in qualitative text analysis, concluding that “ChatGPT can effectively analyze qualitative data” (p. 24), particularly when prompts are carefully designed and constructed.
Thus, while ChatGPT shows considerable promise as a tool for analyzing text in a variety of studies, the need remains to evaluate its utility and accuracy with various types of data and conditions. The present study applies ChatGPT to the analysis of survey responses from U.S. Military Academy (West Point) cadets obtained shortly after the 9/11 terrorist attacks on the World Trade Centers and Pentagon and compares the ChatGPT results against those provided by a human coder using a conventional thematic analysis (Braun and Clarke 2021) approach.
Methods
The text material for the present study was taken from archival survey data obtained from West Point cadets following the 9/11 terrorist attacks. The U.S. Military Academy (USMA) at West Point is located just 50 miles north of the New York City attacks. The survey was part of a research project conducted by the author while teaching at West Point and was reviewed and approved by the USMA Office of Institutional Research. Freshman cadets (N=134) who were enrolled in an introductory psychology course responded to the voluntary survey, which included several open-ended questions aimed at identifying cadet reactions to the terror attacks. The survey was administered during class, and cadets who volunteered to participate received course credit. Cadets provided short answers to the following two questions: (1) How were you personally affected by the 9/11 terrorist attacks on the World Trade Center and Pentagon? and (2) What are your feelings about these events and how the government has responded to them? Written responses were typed into a MS Word document for possible later analysis.
Unfortunately, question #2 is recognized as a “double-barreled” question, which asks two questions within one. With a double-barreled question, it is likely that some respondents may respond to only one aspect of the question, ignoring the other. This was not expected to influence results of the present analysis, the main point of which was to compare ChatGPT results against those of human coders.
The author conducted a thematic analysis of responses to both questions, following steps as outlined by Braun and Clarke (2022) including (1) data familiarization, (2) production of initial codes, (3) search for general themes, (4) reviewing themes, and (5) defining and naming themes. In this case, our analytic strategy falls closest to the “coding reliability” thematic analysis approach as described by Braun and Clarke (2021). Our focus is on the manifest content of the text responses, as opposed to latent content analysis in which investigators seek to interpret the underlying, deeper meanings and themes represented in the text (Lune and Berg 2018). However, coders were not just counting specific words, a method which would more properly be considered quantitative rather than qualitative. Our approach follows what Hsieh and Shannon have described as “summative content analysis”, which goes beyond mere word counts to recognize the latent meaning within, for example, indirect or euphemistic phrasing (Hsieh and Shannon 2005).
Two independent coders read all responses and classified them into major themes in an inductive manner, with no pre-existing coding scheme or theory. Similar to Liu (2025), each individual response could receive multiple theme codes whenever more than one concept was mentioned within the response. However, each theme, if present, was coded at most one time per individual response. Coder reliability was assessed in terms of percent agreement between the two coders. For question 1, overall agreement was 85.26%, and for question 2, agreement was 78.66%. Discrepancies were resolved by consensus.
ChatGPT was next applied to thematic analysis of the same two questions. The ChatGPT program is accessed simply by opening a browser and navigating to the ChatGPT website (www.chatgpt.com). All of the present analyses were conducted using the (free) version 4.0 of ChatGPT.
Specific instructions or prompts were given to ChatGPT as detailed below and in the supplemental materials. In order to gauge the accuracy of the ChatGPT results, these were compared with results provided by the human coders.
Results
Table 1 below presents the key themes identified by human coders, examining cadet open ended responses to the question: “How were you personally affected by the 9/11 terrorist attack on the World Trade Center and Pentagon?” Again, the counts listed in the tables represent the number of individuals who mentioned a given theme in their response (Liu 2025; Lune and Berg 2018).
Next, we analyzed the same responses using ChatGPT. With ChatGPT or any language-based AI tool, it’s important to provide the program with detailed background information regarding the task, and clear instructions (“prompts”) as to the output that is desired (Zhang et al. 2025). The investigator first gives ChatGPT the necessary background on the nature of the text data, along with specific and detailed instructions. No restrictions were placed on ChatGPT regarding the number of themes to extract, or the desired format for the results, allowing the program to be completely guided by the data. In the final prompt, the text of cadet responses to be analyzed is provided to ChatGPT. The results for the ChatGPT analysis are summarized in Table 2 below. The full text of the author’s prompts or queries to ChatGPT for open-ended question 1 as well as the responses generated by ChatGPT are provided in Supplemental material (1).
Responses to the second question were also analyzed for core themes, first by human coders and then using ChatGPT. The question was: “What are your feelings about the 9/11 terrorist attacks on the World Trade Center and Pentagon, and how the government has responded to them?” Not all respondents addressed both aspects of this “double-barreled” question, with some choosing to only describe their feelings, while others focused only on government responses.
Once again for comparison, results from the human coder are presented first (Table 3), followed by the ChatGPT results (Table 4). The complete ChatGPT responses for question 2 are provided in Supplemental material (2).
Discussion
Researchers and clinicians often have a need to summarize open-ended text data obtained from a variety of sources, including patient satisfaction surveys, research surveys, focus groups, and case reports. While techniques exist for the human coding of such material, these approaches are both costly and time consuming, and so often are given limited attention (Rouder et al. 2021; Zhang et al. 2025). The present study sought to assess the utility of applying AI tools, in this case ChatGPT, to facilitate the rapid and cost-effective identification of core themes in open-ended text data. To this end, the reactions of first year military academy cadets at West Point, New York to the 9/11 terrorist attacks provided the raw verbal data. In response to specific instructions, ChatGPT generated a detailed list of themes extracted from the data, while also providing frequencies and sample quotes.
In comparing the results from the human coding with those from ChatGPT, it is clear that the overlap is considerable, although not perfect. For the first question, both approaches identified emotional responses, such as shock, sadness, anger and disbelief as the primary theme. However, there were some differences. For example, ChatGPT separated anger and desire for retribution into two distinct categories, while the human analysis grouped anger under the more general “Emotional responses” category. Both approaches identified a theme of “Concern and sympathy for those affected,” including worries about one’s own friends and relatives. Both approaches also noted a common theme of increased sense of duty, purpose and commitment to one’s profession. In this regard, ChatGPT separated out three distinct categories, “Patriotism and Increased Sense of Duty”, “Sense of Purpose and Career Focus”, and “Sense of Clarity about Career and Calling”. While these different categories appear justifiable, a higher-level grouping would likely merge these together, as was done in the human coding.
The human coding identified two categories, “Increased awareness” and “Future orientation” which do not immediately appear in the ChatGPT results. However, a closer examination reveals considerable overlap with ChatGPT’s theme of “Personal reflection and life lessons”, which is about greater awareness of life’s finitude and uncertainty about the future. There is also commonality with ChatGPT’s “Vulnerability and realization of uncertainty”. While the labels attached to these themes by the human vs. ChatGPT approaches are not identical, it appears that both approaches have identified cadets’ personal concerns and reflections on their future (ChatGPT: “Personal reflection and life lessons”; human coding: “Future orientation”), treating this as a separate theme from concerns and reflections on a national or global level (ChatGPT: “Vulnerability and realization of uncertainty”; human coding: “Increased awareness”). Thus, on these dimensions, both approaches appear to have recognized the same underlying themes, while attaching somewhat different descriptive labels.
Finally, both analytic strategies identified a category of minimal to no personal impact, again with slightly different labels. On this factor, the human coders found quite a few more mentions (33) than did ChatGPT. This likely indicates that ChatGPT counted only those instances where the response was simple and unidimensional, as for example “I was not directly affected” or “No effect”. On the other hand, the human coders scored a mention in this category even in more varied responses that covered several themes, as in “I wasn’t necessarily personally affected by this event, but I think many Americans realize the importance of our military.” This suggests that ChatGPT is less sensitive to subtleties or latent content in the text material, similar to what Morgan (2023) reported.
Overall, core themes obtained by the ChatGPT approach were quite similar to those resulting from the human coding. The major difference appears to be a somewhat greater level of detail provided in the ChatGPT results, which may afford some advantage over the human coding. In this case, it may even be that ChatGPT was actually somewhat more sensitive than the human coders for identifying fine distinctions within the data. However, it should be remembered that instructions to ChatGPT were fairly open ended, allowing the program to decide how many distinct themes could be discerned in the data.
In analyzing the second question, the ChatGPT results appear to correspond with those of the human coders even more closely, although attaching somewhat different labels to the themes (Tables 3 and 4). Both approaches found the most common theme to be approval of the government’s response to the attacks, while also noting a category of criticism of the government’s actions or failing to do enough. Both approaches also found themes of a need for swift and aggressive action against the terrorists, as well as positive effects in terms of increased patriotism and national unity. “Emotional responses” was recognized as a theme by both approaches, although ChatGPT separated out “Anger and outrage” as a distinct category, while the human coders included anger under a more general “Emotional reactions” theme. Both approaches also noted a theme of sorrow or grief over the attacks, again with somewhat different labels (ChatGPT: “Recognition of the tragedy and loss”; Human coders: “Sorrow and grief over the tragic loss”). Overall then, the results from the ChatGPT analysis show a fairly close correspondence to those of the human coders for this question.
One possible explanation for the differences observed between the human coders and ChatGPT is that, not surprisingly given how it was developed and trained, ChatGPT tends to be highly literal in identifying words or phrases that are alike and classifying them accordingly. This was more noticeable in the results for the first question, where ChatGPT identified 11 distinct categories or themes (Table 2), vs. the 8 found by the human coders. Although this issue might be mitigated somewhat by specifically limiting the number of categories in the instructions to ChatGPT, there remains the potential for ChatGPT and similar AI tools to overlook subtleties of meaning contained in the text material (Morgan 2023). Considering this, a useful approach may be a ChatGPT/human hybrid, wherein ChatGPT is allowed freedom to identify as many distinct themes or categories in the data as possible, with the human coders then carrying out the next steps of grouping these initial categories into fewer, more general categories or themes. In this way, the investigator remains intimately involved in the analysis, while considerable time is saved on the initial task of data processing and coding. This and similar ways to combine ChatGPT with human coding in qualitative content analysis of text material are discussed in some detail by H. Zhang and colleagues (Zhang et al. 2025).
Advantages of ChatGPT
There are a number of advantages that make ChatGPT a promising tool for thematic analysis of text data. One concerns the speed with which initial processing of text data might be accomplished. Thematic analysis is often time consuming and labor intensive, especially when dealing with large datasets (Nowell et al. 2017). This can be frustrating for investigators, and sometimes may lead to incomplete analysis of text material from surveys (Rouder et al. 2021). Once the ChatGPT program has been primed with important background and context information, and given clear and appropriate prompts, results can be seen in seconds. This makes it possible to conduct at least preliminary processing of large amounts of text data in a fairly short time. Based on the present study, as well as several others in the literature, the ChatGPT program appears effective in identifying the central themes in a body of text data (Bijker et al. 2024; Morgan 2023; Wachinger et al. 2024). As discussed in the next section below, AI tools clearly have their limitations; however, there are many instances in research and clinical practice where they may be applied to obtain a tentative picture of the key themes or issues within an open-ended text data set, and where otherwise the time and resources needed for analysis by human coders may not be available.
Nevertheless, at this early stage in the development of AI tools such as ChatGPT, in order to preserve research integrity and trust in the results, AI should not be used as a substitute for human investigators and coders, but more as an adjunct or supplement. There are several ways this can be done. For example, ChatGPT could be applied to help uncover patterns in the data, providing initial insights that human researchers can then investigate further and seek to validate (Zhang et al. 2025). Such a hybrid approach to thematic analysis should lead to greater efficiency and time savings, while maintaining the confidence and accuracy provided by context- and culture-sensitive human researchers (Sakaguchi et al. 2025).
Limitations of ChatGPT
Despite its potential advantages for text data analyses, ChatGPT has a number of limitations that investigators and practitioners must take into account. One such issue is a general concern across all AI platforms. That is, these programs are only as good as the information used in their training, which could introduce biases of various types (Wikipedia 2024). As the use of programs like ChatGPT continues to grow, the language material available to them increases as well, presumably improving their knowledge base and reducing opportunities for bias.
Another significant concern relates to privacy. Under the terms of the present user agreements, all information entered into ChatGPT can be stored and used by the program for its training base. While most versions of ChatGPT offer an “opt-out” feature in this regard, it is nevertheless prudent for the investigator to assure that no personal identifiers are contained in any text data submitted to ChatGPT in order to assure the privacy of respondents (Paul et al. 2023). However, stripping personal identifiers from text data is not an absolute guarantee of respondent privacy, as the possibility of reidentification of individuals based on other information still exists. This again underscores the need for greater transparency in how these AI programs are actually working.
Another potential limitation in the use of ChatGPT for text analysis concerns the need for clear instructions, including background and context. It is incumbent upon the investigator to be precise in establishing for ChatGPT what was the context in which the text data were collected, and at least a general idea of the population segment(s) that contributed to the data. Instructions or prompts to ChatGPT must be clear and explicit as to what it is expected to do. It should also be noted that the free version of ChatGPT has a word input limit (3,000 words as of this writing), limiting the size of text material that can be analyzed. Larger data sets can be analyzed using paid versions of ChatGPT, or other AI tools that are available (Somoye 2024).
A final limitation that should be mentioned concerns the secrecy with which AI companies maintain their internal algorithms and program workings; this material is considered proprietary by AI firms. This means that investigators cannot know precisely how AI programs like ChatGPT are analyzing data and making decisions, a situation that makes it difficult if not impossible to exactly replicate such studies. While OpenAI initially claimed to provide open-source language analysis programs, it has failed to do so (Bommasani et al. 2025). This lack of transparency needs to be addressed by AI companies, as it is fundamentally incompatible with the goals of objective scientific endeavor.
Conclusions and Future Directions
This study adds to a growing body of evidence that AI programs such as ChatGPT can be useful supplements in the thematic analysis of open-ended text material. In the present study, when given clear and detailed instructions, ChatGPT succeeded in identifying an appropriate set of core themes contained in West Point cadets’ reported reactions to the 9/11 terrorist attacks on the World Trade Center and Pentagon. Compared to the human coding, the ChatGPT codes were somewhat more concrete and detailed, while still showing considerable overlap with the themes identified by human coders.
The capabilities of AI tools are improving rapidly. Future research should be directed toward exploring the utility of newer versions of ChatGPT and similar AI tools to aid in the timely, accurate, and cost-effective processing of qualitative text data. Bearing in mind their limitations, AI tools like ChatGPT can, when properly used, provide a useful time- and cost-saving adjunct to human coders for processing and summarizing text-based data from open-ended survey questions, case studies, progress notes and various other sources.
Corresponding author contact information
Paul T. Bartone, bartonep@gmail.com
12019 E. ML Ave, Climax, MI 49034 USA
ACKNOWLEDGMENTS
Jocelyn Bartone, MA, assisted in coding of cadet responses. The author is also grateful to an anonymous reviewer for many helpful suggestions.
DISCLOSURES
None.
BRIEF BIOGRAPHY
Paul T. Bartone is a retired U.S. Army Colonel, an Adjunct Assistant Professor in Psychiatry at the Uniformed Services University of the Health Sciences, and a former Professor and Senior Research Fellow at the National Defense University in Washington, DC. He holds a Master’s and Ph.D. in Human Development from the University of Chicago, and was a Fulbright Scholar in Norway (2006-07). Bartone has taught leadership and information management at the National Defense University and at the U.S. Military Academy, West Point, where he also was Director of the West Point Leader Development Research Center.