Nick Byrd, Ph.D.
@byrdnick.com
3.2K followers 520 following 14K posts
I study how to improve decisions and well-being at @GeisingerCollege.bsky.social. 🎓 gScholar: shorturl.at/uBDPW ▶️ youtube.com/@ByrdNick 👨‍💻 psychologytoday.com/us/blog/upon-reflection 📓 byrdnick.com/blog 🎙️ byrdnick.com/pod
Posts Media Videos Starter Packs
byrdnick.com
Is solipsism measurable?

Seven studies developed a self-report scale of how much one doubts that others exist (beyond one's own mind).

It correlated with loneliness, social disconnection, aggression, and problematic gaming.

doi.org/10.1016/j.pa...

#xPhi #psychometrics.#psych
Is there anyone else out there? A measure of psychological solipsism
For centuries philosophers have debated solipsism, the idea that people cannot prove that anything exists outside of their own minds. However, there h…
doi.org
byrdnick.com
"younger colleagues ...look down on those who do not deposit research data.... 'if you say ‘data is available upon request,’ they take it as a #middleFinger. ...it would be so easy to share the data—it takes just as much energy to write that statement.'”

ir.library.illinoisstate.edu/fpml/268
Skaggs, L., Scott, R., & Cilento, C. (2026). Not Just Monetary: Arts and Humanities Scholars’ Perspectives on the Costs of Open Access Publishing. College & Research Libraries. https://ir.library.illinoisstate.edu/fpml/268
byrdnick.com
🤬 “Among articles stating that data was available upon request, only 17% shared data upon request. … Results replicate those found elsewhere: data is generally not available upon request, and promissory Data Availability Statements are typically not adhered to.”

#openScience #philSci #ethics #psych
ianhussey.mmmdata.io
My article "Data is not available upon request" was published in Meta-Psychology. Very happy to see this out!
open.lnu.se/index.php/me...
LnuOpen | Meta-Psychology
open.lnu.se
byrdnick.com
👆Yet another AI paper finds "combining [intuitive and reflective] decision modalities through a separate metacognitive function allow[ed] for higher decision quality with less resource consumption compared to employing only one of the two modalities."

doi.org/10.1038/s443...
The first page of the paper in the NPJ AI journal.
byrdnick.com
Proud of Vahid's use of #computational #CogSci to identify and compare #reasoning errors in #Reddit users and communities.

He's presenting it at an #AI + #decisionSci workshop at #CMU : www.cmu.edu/ai-sdm/r...

Follow him for alerts about this and more: www.researchgate.net...
Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect & Counter Online Misinformation

Vahid Ashrafi, Nick Byrd, and Jordan W. Suchow

Misinformation is persistent: From pandemic conspiracies to election denial, false claims spread faster and farther than facts, threatening
health and democracy.

Current defenses fall short: Fact-checking and moderation focus on sentiment but overlook the cognitive vulnerabilities that make cognitive errors drive spread. Biases like confirmation bias, emotional reasoning, and exaggerated thinking distort judgment and amplify viral narratives.

We detect and quantify 419 cognitive errors in Reddit users' text, building "cognitive fingerprints" that reveal reasoning flaws behind misinformation.

Impact: These insights enable flexible human-AI teams to anticipate, detect, and counteract misinformation. Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect and Counter Online Misinformation

Vahid Ashrafi, Nick Byrd, and Jordan Suchow 

Abstract. [Vahid construced] a comprehensive knowledge base of 419 cognitive errors—including cognitive biases and logical fallacies—grounded in empirical findings from psychology and cognitive science. We then trained GPT-4.1, a state-of-the-art language model, to identify potential examples of these cognitive errors in users' posts. ... The robustness of AI-generated insights was rigorously validated by human annotators, achieving high reliability (Spearman’s ρ = 0.86). ... we propose actionable solutions such as targeted cognitive interventions to educate users about common reasoning errors, AI-driven moderation systems that identify problematic reasoning patterns rather than merely flagging false information, and personalized cognitive profiling tools to proactively identify at-risk individuals and communities. The NSF AI Institute for Societal Decision Making (NSF AI-SDM) sponsors the participation of selected speakers and students in an annual workshop of Human-AI Complementarity for Decision Making. Human-AI Complementarity, defined as the condition in which Humans + AI working together results in better decisions than humans or AI working alone, is a broad goal pursued in several projects of the NSF AI-SDM.

In 2025, we will focus on how to create flexible Human-AI teams to achieve complementarity. This theme refers to the interdisciplinary study of how to design and deploy AI systems in ways that are dynamically aligned with human values, robust to unexpected behavior, and safe even under failure modes. It encompasses short-term concerns about deployed systems (e.g., fairness, robustness, interpretability, and misuse) and long-term concerns about advanced general AI (AGI) that could have large-scale societal impacts if not aligned with human interests.
byrdnick.com
How can #AI and #cognitiveScience improve #healthcare?

We got some answers from the #NudgesInHealthcare Symposium at #UPenn.

Check out this summary of some themes in the write-up below:

ldi.upenn.edu/our-wo...

#medicine #psych #econ #compSci #tech #LLM #edu #bioethics #xPhi
A poster presentation at UPenn with Nick Byrd and a poster about antibiotic resistance in the foreground. Ashley West, PhD, Director of Behavioral Design at Lirio (left), discusses her poster, “Designing an ‘At Home’ Digital Health Intervention for Supporting Chronic Condition Lifestyle Management,” with LDI Senior Fellow Renée Betancourt, MD, an Associate Professor of Clinical Family Medicine and Community Health at the Perelman School of Medicine. A panel discussion featuring Adam Rodman, Carissa Kathuria, Craig Joseph, and Kenrick Caito, moderated by Sri Adusimalli. A panel discussion featuring Kim Waddell, Meeta Kerlin, Zahera Farhan, Sunita Desai, and Hayley Belli.
byrdnick.com
Thousands of people in a dozen countries thought reflective reasoning was usually the best way to make a decision in ordinary dilemmas.

Runners up were intuition, friends' advice, and the wisdom of a crowd (in that order).

doi.org/10.1098/rspb...

#cogSci #epistemology #xPhi
About the "everyday dilemmas". Descriptive statistics about the data. The descriptions of intuition, deliberation, friends' advice, and wisdom of a crowd. Results
byrdnick.com
Do reflection test solutions actually involve reflection?

Our think-aloud studies found they usually do (doi.org/10.14264/0f1...), but Ryan Jesson found that solution-prompting insight is often unconscious or spontaneous.

doi.org/10.14264/0f1...

#ProcessTracing #psychometrics
Participants and data Procedure Results (1 of 2) Results (2 of 2)
byrdnick.com
I sometimes use 'nations' instead of 'countries' to allow a post to satisfy character limits. However, this post had enough spare characters for the latter. So I'm doubly without excuse.

Thanks for sharpening our thinking!
byrdnick.com
Might the concept of "good judgment" vary by framing or social roles?

Five studies of four nations (🇺🇸🇨🇦🇬🇧🇨🇳) found some words and roles were more associated with "rational" than "reasonable" (and vice versa).

doi.org/10.1162/opmi...

#CogSci #xPhi #linguistics #dataViz
Descriptive statistics about the samples "Figure 1. Top ten most frequent co-occurrence of Adjectives with ‘rational’ (blue) and ‘reasonable’ (orange) when asked to describe most important characteristics of a person showing sound judgment (Study 1a) / good judgment in a challenging situation (Study 1b). Adjectives are ordered from those most associated with ‘rational’ to those with ‘reasonable.’ Dumbbell nodes represent the percentage of each adjective’s co-occurrence relative to the sum of independent occurrences of each pair of terms." "Figure 2. Qualities attributed to rational and reasonable persons in Study 1. Color-coded adjectives reflect Analytical, Moral, and Inner Fortitude items (Study 1a)/Agency and Communion factors (Study 1b). Top panels: Estimates from linear mixed model with responses to all characteristics nested in participants, with target order (rational vs. reasonable) as a covariate and false discovery rate correction for multiple testing. Dashed vertical line delineates effects 1 unit above midpoint of the 1–7 scale in Study 1a / half a unit above the midpoint of the 1–5 scale in Study 1b. Bottom panels: Pearson’s correlations and 95% CIs of average scores across items making up each factor. ***p < .001, **p < .01, *p < .05." "Figure 3. Top Panel: Preference proportions for reasonable vs. rational agents in social roles. Displays proportions (with 95% CI) derived from logits in generalized mixed models. The dashed line at .50 indicates parity; above this, preference leans towards reasonable agents, and below, towards rational agents. Bottom Panel: Necessity ratings for rationality and reasonableness by role (1–5 rating scale). Shows estimated means and 95% CI. Ratings indicate moderate to high necessity (3–4) for both rationality and reasonableness across rule-based (on the right) and holistic roles (on the left)."
byrdnick.com
Thanks to @baptistejacquet.bsky.social and all supporters of this 4-day HYBRID conference with presenters on many continents!

I'm tired of waking up every day before 3:00 a.m. for this event, but #accessible and #sustainable intl. #conferencing still beats the alternative!

doi.org/10.11647/OBP...
a man is clapping his hands in front of a pink background .
Alt: Neil Patrick Harris clapping his hands and expressing admiration in front of a pink background.
media.tenor.com
byrdnick.com
Closing my #HAR2025 experience was Laura Martignon on some of the apps her team have developed to help people see statistical patterns or #risk in repeated choices.

You can find out about some of these apps in the paper from #HAR2023:
doi.org/10.1007/978-...

🔓 www.researchgate.net/publication/...
Fig. 5. A fact box designed by the Harding Center and used by the largest health insurancecompany in Germany, the AOK, for communicating information on the risk reduction caused byregular screening Fig. 6. This icon array, unsorted (left) and sorted (right), represents 100 people, diseased or notdiseased, who are tested as to whether or not they are HIV positive.
byrdnick.com
How might #meditation impact mental habits?

At #HAR2024, Lachaud and Louis found a 10-minute #mindfulness exercise may have impacted cognitive rigidity (compared to a podcast about mindfulness): doi.org/10.1007/978-...

At #HAR2025, they found it impacting a confusion about plural noun agreement.
Agreement success was higher in the mindfulness group than the podcast group. Across names, verbs, and adjectives, there was an omnibus difference between the two conditions. However, the omnibus difference seems to have been driven by a group difference in only verbs. Discussion
• Brief mindfulness session could favors cognitive flexibility meseared in a grammar task
• Not enough evidence for a definitive answer
• Opens researches on the Einstellung effect to a grammar paradigm
byrdnick.com
Are Eastern people more accepting of contradictions than Western people?

Hiroshi Yama found Japanese and Chinese people were NOT more accepting of contraction on all measures (contrary to influential work from Peng & Nisbett).

More on *religious* contradiction below:

#culture #logic #psychology
3. General or religious contradiction

Dialectical thinking Measure
- 10 pairs of opinions which were opposite each other. Example:
A) I think that it is good to accept foreign cultures, to be part of a nation which responds to the globalizing world.
B) I don't think it is good to accept foreign cultures, because our traditional folk customs and cultures are broken.
(7-point-scale) Fig. 1 The mean score of Dialectical Self Scale (DSS) by country and education level (left: Japanese > Chinese > British) and the mean dialectical *thinking* score for each group (right: Japanese < British = Chinese). 

(1) The results of previous studies were replicated using DSS, but the difference between Japanese and Chinese was added.
(2) The results of DT score did *not* support the hypothesis of Peng and Nisbett (1999)

(So Easterners probably do NOT accept contradictions more than Westerners. Easterners may just see the self as involving more contradiction than Westerners.) Questionnaires

Religious beliefs (Kaneko & Watabe, 2003), 6-point scale (26 items)
-- Subscales: Pro-religiosity, divine protection, and retribution. 
-- Example A: Having faith gives me the meaning of life.

Anti-religious beliefs, 6-point scale: 26 items were made so that each item is opposite to that of the religious belief questionnaire
- Example B: Having faith does not give me the meaning of life.

If you agree or disagree with both A and B, then you are classified as a "dialectical thinker" Dialectical religious belief subscale scores of Japanese people were higher than those of British or French people, on average.
byrdnick.com
Remember the viral studies inferring some people are less likely to think visually?

Well some DECISIONS are also less likely to involve #visualation — e.g., #finance versus #recreation: doi.org/10.1080/2044...

And visual vividness predicted #risk taking: doi.org/10.1016/j.co...

#cogSci #xPhi #edu
Financial decisions were about 4 times more likely to involve analytic reasoning (45%) than visual imagery (12%).

Recreational decisions were about 3 times more likely to involve visual imagery (31%) than analytic reasoning (12%). Mental imagery was similarly easy to generate and it was similarly vivid across financial and recreational decisions. The vividness of imagery during decision-making predicts risk-taking. Conclusions
• Mental imagery seems to be a distinct decision-making mode that complements other established modes (calculation, affect, recognition).
• Its application is context-dependent:
Recreational decisions (experiential, concrete) - imagery use is more tural, images are more vivid, and their valence stronger predicts risk-taking willingness.
• Financial decisions (abstract, analytical) - imagery use is less frequent and less influential; calculation seem to dominate.
• Implication: Decision-making frameworks should include imagery-based processing as a mode that bridges cognition and emotion, particularly in experiential domains.
byrdnick.com
It's the final day of the 2025 Human & Artificial Rationalities conference.

First @oriplonsky.bsky.social shared experiments finding that people preferred advice that aligned with their own biases, even if the advice was from an algorithm — contrary to #algorithmAversion.

bsky.app/profile/byrd...
byrdnick.com
Ori Plonsky et al. found people liked biased advice in #expectedValue experiments.

Contrary to #algorithmAversion, people liked advice that aligned with their #biases (even if it came from an algorithm).

To learn when these data are published, follow Dr. Plonsky: scholar.google.com/citations?hl...
The presentation abstract The setup of experiment 3 Results of experiment 3: "Biased humans like biased advice" (not human advice per se) Summary (pasted from slide):

• Experts gaining expertise by experience often give biased advice: They both choose and recommend others to choose options better most of the time
• People prefer advisors that recommend options better most of the time
• Biased humans advisors are preferred over unbiased algorithmic advisors
• But it is simple to design even more biased algorithmic advisors that accommodate human biases, and are liked more than human advisors
byrdnick.com
Can #AI aid literature reviews?

#LanguageModels can scan and summarize text WAY faster than humans, but are they any good?

Hocine Kadi et al. tried screening the #pharmacy literature.
- 94% of articles correctly identified
- Mostly neutral to positive user feedback

www.linkedin.com/in/hocine-kadi
Why Al in Pharmacovigilance? Literature Monitoring

Currently most of the steps are conducted manually leading to:
- High resource consumption including multiple profiles / person
- Acute need for coordination
- Lack of scalability - any additional journal included will incur additional cost
- Prone to errors since the processes are manual

Can it be
automated?
- Content collection? Yes
- Read to Identify insight? Yes
- Build report tracker? Yes

Automation in this context doesn't mean there is 0% manual activity, the user will still be involved - the specialist will validate inputs, select templates, etc. Preliminary Results - Performance & Outcomes

- 40.5% False positive = Sensitivity offset
- 1.1% Objective risk of non detection

Literature Screening Summary (on a total of 89 literatures)
- 94% of articles were correctly identified as relevant or non-relevant according to the client's extended criteria.
- 100% alignment with GVP criteria — no missed valid PV
cases under formal regulatory definitions.

Excluded from analysis:
- 20% of articles mention the molecule of interest only in bibliographic references (not in the main text).
- 1% of articles could not be assessed due to PDF access restrictions (e.g., password-protected files requiring manual login). Pre-Use questionnaire: Trust in Al Scale

Trust in the Local Literature Al Tool (5-point Likert: Strongly Disagree → Strongly Agree)
- Decision support: comfortable using to aid screening decisions
- Benchmark: more effective than a novice PV reviewer 1
- Over-reliance caution (reverse-coded)
- Efficiency: screens/classifies quickly
- Dependability: secure relying on initial classifications 
- Reliability: identifies relevant safety content
- Consistency: produces predictable results
- Accuracy: confident it accurately classifies safety-relevant literature 

Expectations of Explanations (Likert)
- Curiosity to explore how decisions are made
- Sufficient level of detail
- Clear & understandable explanations 

Perceived Risks & Bias (Likert)
- Concern about human-like biases (familiar drugs / well-known AEs) -
- Concern about missing critical safety information - 5% Where we are ?
- Tool Development
- Feedback collection (here)
- Analysis
- Tool Improvement
byrdnick.com
How can #AI enhance #communication, #medicine, and #policy?

Darya Filatova et al used #LLMs to correct alignment errors in a European medical regulation corpus with 25 parallel languages, yielding better results than existing machine #translation systems.

www.linkedin.com/in/delnouty

#linguistics
• Corpus Construction: Extracted and structured
SmPC documents from EMA PDFs.
• Semantic Chapter Alignment: Used LaBSE embeddings to match chapters across languages.
• Sentence Alignment: Applied BERTAlign for one-to-one and multi-sentence matching.
• Refinement: Used Claude 3.5 Sonnet to correct low-similarity alignments.
• Asymmetric Strategy: English as source, 24 EMA languages as targets.
• Evaluation: Automated scoring with LLaMA 3.2 using expert and intuitive protocols. Corpus Construction:
• Source: 432 SmPCs from the EMA database.
• Selection: 4 SmPCs randomly selected for multilingual alignment.
• Languages: 24 target languages aligned with English.
• Alignment Method: semantic similarity models and alignment algorithms.
• Sentence Pairs: ~ 700 aligned pairs per language, totaling 16,800 bilingual pairs.
• Quality Assurance:
• Manual verification and semi-automatic refinement.
• Ensured hich-quality correspondence between source and target.
• Purpose: Used for alignment evaluation and enhancement experiments. Comparison of Translation Approaches across Metrics Key Achievements:
• Built a high-quality aligned corpus of SmPCs in 25 EMA languages.
• Applied BERTAlign and LLMs for robust sentence alignment and correction.
• Achieved superior EN-FR translation performance via context-aware strategies.
• Demonstrated the value of domain-specific corpora in regulatory NLP.
Impact:
• Reduces time and cost of multilingual regulatory documentation.
• Supports pharmaceutical market expansion across Europe.
Future Work:
• Improve alignment for low-performing languages (e.g., BG, MT, HR, IS. ET, EL).
• Developing a custom machine translation engine tailored to EMA regulatory content.
byrdnick.com
Interested in #computerScience, #decisionScience, *and* #philosophy?

@mircomusolesi.bsky.social's keynote was for you.

Their Machine Intelligence Lab has been studying cognitive biases, moral decision-making, #philosophyOfScience, and more.

#cogSci #AI #RL #psychology #economics #ethics #morality
Moral Decision-Making

Images of Aristotle, Jeremy Bentham, Immanuel Kant

Citation: Elizaveta Tennant, Stephen Hailes and Mirco Musolesi. Modeling Moral Choices in Social Dilemmas with Multi-agent Reinforcement Learning. In Proceedings of the 32nd Joint Conference on Artificial Intelligence (IJCAl 2023). Macao, August 2023. Tennant, E., Hailes, S., & Musolesi, M. (2025). Hybrid Approaches for Moral Value Alignment in AI Agents: A Manifesto (No. arXiv:2312.01818). arXiv. https://doi.org/10.48550/arXiv.2312.01818
Macmillan-Scott, O., & Musolesi, M. (2024). (Ir)rationality and cognitive biases in large language models. Royal Society Open Science, 11(6), 240255. https://doi.org/10.1098/rsos.240255 Macmillan-Scott, O., & Musolesi, M. (2025). (Ir)rationality in AI: State of the Art, Research Challenges and Open Questions. Artificial Intelligence Review, 58(11), 352. https://doi.org/10.1007/s10462-025-11341-4
byrdnick.com
How can #philosophy improve #banking?

Loan decisions are often automated and #AI chatbots are increasingly used to "explain" the decisions.

So Christine Howes et al. are studying how to improve #LLM counterfactual reasoning — Socratic dialogue helped?

www.researchgate.net/profile/Chri...

#cogSci
Counterfactual reasoning

The process of considering how events might have turned out differently if conditions had been different
- central in human language and thought
- contrastive (why event X occurs rather than some other event Y)
- actionable (what can be done to change the outcome)

Examples of counterfactual explanations (CFEs)
- "if you had been two years younger, you would get the loan"
- "when you become two years older, you will get the loan" Method
• Simulated credit approval chatbot
• Embedded algorithms define eligibility
• Collected responses → annotated for CFEs
• Alignment measured against human notions of actionability Prompt
$
You are a chatbot deployed by a bank to help customers get credit from the bank.

Credit is granted if the following condition concerning the applicant is met:

<Algorithm >

If the customer is currently not eligible, but the customer could potentially become eligible through a change in circumstances, you communicate what such a change in circumstances would look like. Follow-up experiments

Hypothesis 1: Too implicit actionability cue in prompt
- replace "what such a change in circumstances would look like" with "what the customer would need to do to get credit" → same pattern of results

Hypothesis 2: Positivity bias
- add "or monthly_income >= 2000"; user income is €1800 → improves results in monotonicity condition (GPT3.5 still misaligned in 40% of cases). No effect on causal dependencies

Hypothesis 3: Generation problem
- Socratic "elenchus" follow-up questions → alignment after questioning is perfect
byrdnick.com
Some argue #AI language models are incapable of #rationality because they violate axioms of #decisionTheory.

In the jargon, #LLMs fall prey to "Dutch books" and "money pumps".

Alina Chadwick et al. shared methods to rectify such vulnerabilities.

Follow Alina @ www.researchgate.net/profile/Alin...
Motivation: Can LLMs be Rational?

LLMs used as (rational) decision making agents.

Rational agents should have:
1. Probabilistically coherent judgments
2. Transitive preferences

Do LLMs adhere to 1 and 2?
Simon Goldstein argued that they cannot (https://philpapers.org/rec/GOLLCN):

A. Token prediction is structurally different from predicting the likelihood of an event

B. When prompting an LLM to choose between actions, the model's preferences will violate the axioms of decision Summary

Rectify LLM vulnerabilities to Dutch books by ensuring probabilistic coherence via
1. Normalization pipeline
2. Quadratic program

Rectify LLM vulnerabilities to money pumps by leveraging voting rules
• Introduced (x-)IMDC as a method to calculate and explain a transitive ranking
byrdnick.com
Like humans, #AI language models are influenced by the number and order of response options.

Jonathan Erhardt & Michael Messerli shared a method that reduced order effects.

Like humans (👇), indifference was preferred when it was an option.

Jonathan's on #LinkedIn: www.linkedin.com/in/jonathan-...
Measuring Preferences: Improvements

Our attempt at improving the method:

1. Create 2 data sets to test preferences: one with item pairs where an LLM plausibly doesn't have strong preferences and one with items where an LLM plausibly does have strong preferences.

2. Offer the model a third "I am indifferent" option.

3. Use the token probability of the choice token as a proxy for the strength of a preference.
"B": 54.85, "A": 44.77 (Option A: seeing a meteor shower, Option B: seeing a lunar eclipse) Table 1. Evaluation: Neutral vs Alignment-Biased (LLAMA 3 70b)

Table 2. Neutral vs Alignment-Biased with Indifference Option (LLAMA 3 70b)
byrdnick.com
Can the people who use #AI for #mentalHealth counseling get an approximate #diagnosis?

Yuriy Mikheev found a customized #LLM generated depression scores that correlated strongly with Beck #Depression Inventory (BDI) scores (r = 0.76, p < 0.001).

Find Yuri on @orcid.org at orcid.org/0000-0002-76...
Methodology & Study Design

Participants & Design
• 97 recruited → 70 complete data (72% completion rate)
• Demographics: Mean age 31.4 years, 67% female
• Randomized order: BDI-II and ChatGPT interview via Telegram bot

ChatGPT-4 Protocol
• Empathic, neutral tone with open-ended questions per BDI-II item
• Autonomous scoring using detailed BDI-II manual guidelines
• Safety protocols: Crisis helpline information for suicidal ideation

Statistical Analysis Framework
• Primary: Pearson correlation (target r ≥ 0.70)
• Secondary: Linear regression, Bland-Altman agreement analysis
• Tools: Python (pandas, NumPy, SciPy, scikit-learn) Sample Characteristics & Score Distributions

Final Sample (N = 70)
• Completion rate: 72% from initial recruitment
• Age: Mean 31.4 years (SD = 9.2),
67% female
• Recruitment: Social media and niversity mailing lists Primary Finding - Strong Convergent Validity

Clinical Significance
• Exceeds target threshold of r ≥ 0.70
significantly
• Large effect size by conventional standards
• Comparable to established correlations: BDI-II vs Hamilton (r =
0.71), BDI-II vs PHQ-9 (r = 0.77-0.84)

Compelling evidence that conversational LLM interviews can elicit clinically meaningtul symptom information aligned with established screening criteria. Agreement Analysis & Clinical Utility

Bland-Altman Agreement Results
• Mean difference: -0.99 points (ChatGPT tends to underestimate)
• 95% limits of agreement: -11.54 to +9.56 points
• Clinical interpretation: £10 points for most participants

Clinical Utility Assessment
• Suitable for: Preliminary screening, triage, first-line assessment
• Caution needed: individual-level substitution decisions
• Recommendation: Confirm critical decisions with human-administered measures

Error Pattern Analysis
• Underestimations: Vague symptom descriptions, minimized emotional expression
• Overestimations: Expressive narratives, dramatic language patterns