Lightnews — Scholar-powered news

Nick Byrd, Ph.D.

@byrdnick.com

3.2K followers 520 following 14K posts

I study how to improve decisions and well-being at @GeisingerCollege.bsky.social. 🎓 gScholar: shorturl.at/uBDPW ▶️ youtube.com/@ByrdNick 👨‍💻 psychologytoday.com/us/blog/upon-reflection 📓 byrdnick.com/blog 🎙️ byrdnick.com/pod

Posts Media Videos Starter Packs

Pinned

Nick Byrd, Ph.D. @byrdnick.com · Sep 9

👋 I’m Nick.

I study how we can make better decisions.

Some examples:
- Bounded Reflectivism & Epistemic Identity: researchgate.net/publication/356615873
- Great Minds Do Not Think Alike: researchgate.net/publication/359114933
- Tell Us What You Really Think: researchgate.net/publication/370193206

(PDF) Tell Us What You Really Think: A Think Aloud Protocol Analysis of the Verbal Cognitive Reflect...

PDF | The standard interpretation of cognitive reflection tests assumes that correct responses are reflective and lured responses are unreflective.... | Find, read and cite all the research you need o...

www.researchgate.net

4 22

Nick Byrd, Ph.D. @byrdnick.com · 47m

Is solipsism measurable?

Seven studies developed a self-report scale of how much one doubts that others exist (beyond one's own mind).

It correlated with loneliness, social disconnection, aggression, and problematic gaming.

doi.org/10.1016/j.pa...

#xPhi #psychometrics.#psych

Is there anyone else out there? A measure of psychological solipsism

For centuries philosophers have debated solipsism, the idea that people cannot prove that anything exists outside of their own minds. However, there h…

doi.org

2 4

Nick Byrd, Ph.D. @byrdnick.com · 1d

"younger colleagues ...look down on those who do not deposit research data.... 'if you say ‘data is available upon request,’ they take it as a #middleFinger. ...it would be so easy to share the data—it takes just as much energy to write that statement.'”

ir.library.illinoisstate.edu/fpml/268

Skaggs, L., Scott, R., & Cilento, C. (2026). Not Just Monetary: Arts and Humanities Scholars’ Perspectives on the Costs of Open Access Publishing. College & Research Libraries. https://ir.library.illinoisstate.edu/fpml/268

Nick Byrd, Ph.D. @byrdnick.com · 4d

🤬 “Among articles stating that data was available upon request, only 17% shared data upon request. … Results replicate those found elsewhere: data is generally not available upon request, and promissory Data Availability Statements are typically not adhered to.”

#openScience #philSci #ethics #psych

Ian Hussey @ianhussey.mmmdata.io · 4d

My article "Data is not available upon request" was published in Meta-Psychology. Very happy to see this out!
open.lnu.se/index.php/me...

LnuOpen | Meta-Psychology

open.lnu.se

1 4 18

Nick Byrd, Ph.D. @byrdnick.com · 5d

👆Yet another AI paper finds "combining [intuitive and reflective] decision modalities through a separate metacognitive function allow[ed] for higher decision quality with less resource consumption compared to employing only one of the two modalities."

doi.org/10.1038/s443...

The first page of the paper in the NPJ AI journal.

Nick Byrd, Ph.D. @byrdnick.com · 8d

You can also follow @vahidash.bsky.social here!

Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect & Counter Online Misinformation

Vahid Ashrafi (in the photograph), Nick Byrd, and Jordan W. Suchow

Misinformation is persistent: From pandemic conspiracies to election denial, false claims spread faster and farther than facts, threatening
health and democracy.

Current defenses fall short: Fact-checking and moderation focus on sentiment but overlook the cognitive vulnerabilities that make cognitive errors drive spread. Biases like confirmation bias, emotional reasoning, and exaggerated thinking distort judgment and amplify viral narratives.

We detect and quantify 419 cognitive errors in Reddit users' text, building "cognitive fingerprints" that reveal reasoning flaws behind misinformation.

Impact: These insights enable flexible human-AI teams to anticipate, detect, and counteract misinformation.

Nick Byrd, Ph.D. @byrdnick.com · 11d

Proud of Vahid's use of #computational #CogSci to identify and compare #reasoning errors in #Reddit users and communities.

He's presenting it at an #AI + #decisionSci workshop at #CMU : www.cmu.edu/ai-sdm/r...

Follow him for alerts about this and more: www.researchgate.net...

Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect & Counter Online Misinformation

Vahid Ashrafi, Nick Byrd, and Jordan W. Suchow

Misinformation is persistent: From pandemic conspiracies to election denial, false claims spread faster and farther than facts, threatening
health and democracy.

Current defenses fall short: Fact-checking and moderation focus on sentiment but overlook the cognitive vulnerabilities that make cognitive errors drive spread. Biases like confirmation bias, emotional reasoning, and exaggerated thinking distort judgment and amplify viral narratives.

We detect and quantify 419 cognitive errors in Reddit users' text, building "cognitive fingerprints" that reveal reasoning flaws behind misinformation.

Impact: These insights enable flexible human-AI teams to anticipate, detect, and counteract misinformation.

Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect and Counter Online Misinformation

Vahid Ashrafi, Nick Byrd, and Jordan Suchow

Abstract. [Vahid construced] a comprehensive knowledge base of 419 cognitive errors—including cognitive biases and logical fallacies—grounded in empirical findings from psychology and cognitive science. We then trained GPT-4.1, a state-of-the-art language model, to identify potential examples of these cognitive errors in users' posts. ... The robustness of AI-generated insights was rigorously validated by human annotators, achieving high reliability (Spearman’s ρ = 0.86). ... we propose actionable solutions such as targeted cognitive interventions to educate users about common reasoning errors, AI-driven moderation systems that identify problematic reasoning patterns rather than merely flagging false information, and personalized cognitive profiling tools to proactively identify at-risk individuals and communities.

The NSF AI Institute for Societal Decision Making (NSF AI-SDM) sponsors the participation of selected speakers and students in an annual workshop of Human-AI Complementarity for Decision Making. Human-AI Complementarity, defined as the condition in which Humans + AI working together results in better decisions than humans or AI working alone, is a broad goal pursued in several projects of the NSF AI-SDM.

In 2025, we will focus on how to create flexible Human-AI teams to achieve complementarity. This theme refers to the interdisciplinary study of how to design and deploy AI systems in ways that are dynamically aligned with human values, robust to unexpected behavior, and safe even under failure modes. It encompasses short-term concerns about deployed systems (e.g., fairness, robustness, interpretability, and misuse) and long-term concerns about advanced general AI (AGI) that could have large-scale societal impacts if not aligned with human interests.

1 3

Nick Byrd, Ph.D. @byrdnick.com · 13d

How can #AI and #cognitiveScience improve #healthcare?

We got some answers from the #NudgesInHealthcare Symposium at #UPenn.

Check out this summary of some themes in the write-up below:

ldi.upenn.edu/our-wo...

#medicine #psych #econ #compSci #tech #LLM #edu #bioethics #xPhi

A poster presentation at UPenn with Nick Byrd and a poster about antibiotic resistance in the foreground.

Ashley West, PhD, Director of Behavioral Design at Lirio (left), discusses her poster, “Designing an ‘At Home’ Digital Health Intervention for Supporting Chronic Condition Lifestyle Management,” with LDI Senior Fellow Renée Betancourt, MD, an Associate Professor of Clinical Family Medicine and Community Health at the Perelman School of Medicine.

A panel discussion featuring Adam Rodman, Carissa Kathuria, Craig Joseph, and Kenrick Caito, moderated by Sri Adusimalli.

A panel discussion featuring Kim Waddell, Meeta Kerlin, Zahera Farhan, Sunita Desai, and Hayley Belli.

Nick Byrd, Ph.D. @byrdnick.com · 14d

Thousands of people in a dozen countries thought reflective reasoning was usually the best way to make a decision in ordinary dilemmas.

Runners up were intuition, friends' advice, and the wisdom of a crowd (in that order).

doi.org/10.1098/rspb...

#cogSci #epistemology #xPhi

The descriptions of intuition, deliberation, friends' advice, and wisdom of a crowd.

2 10

Nick Byrd, Ph.D. @byrdnick.com · 15d

Do reflection test solutions actually involve reflection?

Our think-aloud studies found they usually do (doi.org/10.14264/0f1...), but Ryan Jesson found that solution-prompting insight is often unconscious or spontaneous.

doi.org/10.14264/0f1...

#ProcessTracing #psychometrics

Nick Byrd, Ph.D. @byrdnick.com · 16d

I sometimes use 'nations' instead of 'countries' to allow a post to satisfy character limits. However, this post had enough spare characters for the latter. So I'm doubly without excuse.

Thanks for sharpening our thinking!

Nick Byrd, Ph.D. @byrdnick.com · 16d

Might the concept of "good judgment" vary by framing or social roles?

Five studies of four nations (🇺🇸🇨🇦🇬🇧🇨🇳) found some words and roles were more associated with "rational" than "reasonable" (and vice versa).

doi.org/10.1162/opmi...

#CogSci #xPhi #linguistics #dataViz

Descriptive statistics about the samples

"Figure 1. Top ten most frequent co-occurrence of Adjectives with ‘rational’ (blue) and ‘reasonable’ (orange) when asked to describe most important characteristics of a person showing sound judgment (Study 1a) / good judgment in a challenging situation (Study 1b). Adjectives are ordered from those most associated with ‘rational’ to those with ‘reasonable.’ Dumbbell nodes represent the percentage of each adjective’s co-occurrence relative to the sum of independent occurrences of each pair of terms."

"Figure 2. Qualities attributed to rational and reasonable persons in Study 1. Color-coded adjectives reflect Analytical, Moral, and Inner Fortitude items (Study 1a)/Agency and Communion factors (Study 1b). Top panels: Estimates from linear mixed model with responses to all characteristics nested in participants, with target order (rational vs. reasonable) as a covariate and false discovery rate correction for multiple testing. Dashed vertical line delineates effects 1 unit above midpoint of the 1–7 scale in Study 1a / half a unit above the midpoint of the 1–5 scale in Study 1b. Bottom panels: Pearson’s correlations and 95% CIs of average scores across items making up each factor. ***p < .001, **p < .01, *p < .05."

"Figure 3. Top Panel: Preference proportions for reasonable vs. rational agents in social roles. Displays proportions (with 95% CI) derived from logits in generalized mixed models. The dashed line at .50 indicates parity; above this, preference leans towards reasonable agents, and below, towards rational agents. Bottom Panel: Necessity ratings for rationality and reasonableness by role (1–5 rating scale). Shows estimated means and 95% CI. Ratings indicate moderate to high necessity (3–4) for both rationality and reasonableness across rule-based (on the right) and holistic roles (on the left)."

1 2 3

Nick Byrd, Ph.D. @byrdnick.com · 18d

Thanks to @baptistejacquet.bsky.social and all supporters of this 4-day HYBRID conference with presenters on many continents!

I'm tired of waking up every day before 3:00 a.m. for this event, but #accessible and #sustainable intl. #conferencing still beats the alternative!

doi.org/10.11647/OBP...

a man is clapping his hands in front of a pink background .

Alt: Neil Patrick Harris clapping his hands and expressing admiration in front of a pink background.

media.tenor.com

Nick Byrd, Ph.D. @byrdnick.com · 18d

Closing my #HAR2025 experience was Laura Martignon on some of the apps her team have developed to help people see statistical patterns or #risk in repeated choices.

You can find out about some of these apps in the paper from #HAR2023:
doi.org/10.1007/978-...

🔓 www.researchgate.net/publication/...

Fig. 5. A fact box designed by the Harding Center and used by the largest health insurancecompany in Germany, the AOK, for communicating information on the risk reduction caused byregular screening

Fig. 6. This icon array, unsorted (left) and sorted (right), represents 100 people, diseased or notdiseased, who are tested as to whether or not they are HIV positive.

Nick Byrd, Ph.D. @byrdnick.com · 18d

How might #meditation impact mental habits?

At #HAR2024, Lachaud and Louis found a 10-minute #mindfulness exercise may have impacted cognitive rigidity (compared to a podcast about mindfulness): doi.org/10.1007/978-...

At #HAR2025, they found it impacting a confusion about plural noun agreement.

Agreement success was higher in the mindfulness group than the podcast group.

Across names, verbs, and adjectives, there was an omnibus difference between the two conditions.

However, the omnibus difference seems to have been driven by a group difference in only verbs.

Discussion
• Brief mindfulness session could favors cognitive flexibility meseared in a grammar task
• Not enough evidence for a definitive answer
• Opens researches on the Einstellung effect to a grammar paradigm

Nick Byrd, Ph.D. @byrdnick.com · 18d

Are Eastern people more accepting of contradictions than Western people?

Hiroshi Yama found Japanese and Chinese people were NOT more accepting of contraction on all measures (contrary to influential work from Peng & Nisbett).

More on *religious* contradiction below:

#culture #logic #psychology

3. General or religious contradiction

Dialectical thinking Measure
- 10 pairs of opinions which were opposite each other. Example:
A) I think that it is good to accept foreign cultures, to be part of a nation which responds to the globalizing world.
B) I don't think it is good to accept foreign cultures, because our traditional folk customs and cultures are broken.
(7-point-scale)

Fig. 1 The mean score of Dialectical Self Scale (DSS) by country and education level (left: Japanese > Chinese > British) and the mean dialectical *thinking* score for each group (right: Japanese < British = Chinese).

(1) The results of previous studies were replicated using DSS, but the difference between Japanese and Chinese was added.
(2) The results of DT score did *not* support the hypothesis of Peng and Nisbett (1999)

(So Easterners probably do NOT accept contradictions more than Westerners. Easterners may just see the self as involving more contradiction than Westerners.)

Questionnaires

Religious beliefs (Kaneko & Watabe, 2003), 6-point scale (26 items)
-- Subscales: Pro-religiosity, divine protection, and retribution.
-- Example A: Having faith gives me the meaning of life.

Anti-religious beliefs, 6-point scale: 26 items were made so that each item is opposite to that of the religious belief questionnaire
- Example B: Having faith does not give me the meaning of life.

If you agree or disagree with both A and B, then you are classified as a "dialectical thinker"

Dialectical religious belief subscale scores of Japanese people were higher than those of British or French people, on average.

Nick Byrd, Ph.D. @byrdnick.com · 18d

Remember the viral studies inferring some people are less likely to think visually?

Well some DECISIONS are also less likely to involve #visualation — e.g., #finance versus #recreation: doi.org/10.1080/2044...

And visual vividness predicted #risk taking: doi.org/10.1016/j.co...

#cogSci #xPhi #edu

Financial decisions were about 4 times more likely to involve analytic reasoning (45%) than visual imagery (12%).

Recreational decisions were about 3 times more likely to involve visual imagery (31%) than analytic reasoning (12%).

Mental imagery was similarly easy to generate and it was similarly vivid across financial and recreational decisions.

The vividness of imagery during decision-making predicts risk-taking.

Conclusions
• Mental imagery seems to be a distinct decision-making mode that complements other established modes (calculation, affect, recognition).
• Its application is context-dependent:
Recreational decisions (experiential, concrete) - imagery use is more tural, images are more vivid, and their valence stronger predicts risk-taking willingness.
• Financial decisions (abstract, analytical) - imagery use is less frequent and less influential; calculation seem to dominate.
• Implication: Decision-making frameworks should include imagery-based processing as a mode that bridges cognition and emotion, particularly in experiential domains.

Nick Byrd, Ph.D. @byrdnick.com · 19d

It's the final day of the 2025 Human & Artificial Rationalities conference.

First @oriplonsky.bsky.social shared experiments finding that people preferred advice that aligned with their own biases, even if the advice was from an algorithm — contrary to #algorithmAversion.

bsky.app/profile/byrd...

Nick Byrd, Ph.D. @byrdnick.com · Nov 24

Ori Plonsky et al. found people liked biased advice in #expectedValue experiments.

Contrary to #algorithmAversion, people liked advice that aligned with their #biases (even if it came from an algorithm).

To learn when these data are published, follow Dr. Plonsky: scholar.google.com/citations?hl...

Results of experiment 3: "Biased humans like biased advice" (not human advice per se)

Summary (pasted from slide):

• Experts gaining expertise by experience often give biased advice: They both choose and recommend others to choose options better most of the time
• People prefer advisors that recommend options better most of the time
• Biased humans advisors are preferred over unbiased algorithmic advisors
• But it is simple to design even more biased algorithmic advisors that accommodate human biases, and are liked more than human advisors

Nick Byrd, Ph.D. @byrdnick.com · 19d

Can #AI aid literature reviews?

#LanguageModels can scan and summarize text WAY faster than humans, but are they any good?

Hocine Kadi et al. tried screening the #pharmacy literature.
- 94% of articles correctly identified
- Mostly neutral to positive user feedback

www.linkedin.com/in/hocine-kadi

Why Al in Pharmacovigilance? Literature Monitoring

Currently most of the steps are conducted manually leading to:
- High resource consumption including multiple profiles / person
- Acute need for coordination
- Lack of scalability - any additional journal included will incur additional cost
- Prone to errors since the processes are manual

Can it be
automated?
- Content collection? Yes
- Read to Identify insight? Yes
- Build report tracker? Yes

Automation in this context doesn't mean there is 0% manual activity, the user will still be involved - the specialist will validate inputs, select templates, etc.

Preliminary Results - Performance & Outcomes

- 40.5% False positive = Sensitivity offset
- 1.1% Objective risk of non detection

Literature Screening Summary (on a total of 89 literatures)
- 94% of articles were correctly identified as relevant or non-relevant according to the client's extended criteria.
- 100% alignment with GVP criteria — no missed valid PV
cases under formal regulatory definitions.

Excluded from analysis:
- 20% of articles mention the molecule of interest only in bibliographic references (not in the main text).
- 1% of articles could not be assessed due to PDF access restrictions (e.g., password-protected files requiring manual login).

Pre-Use questionnaire: Trust in Al Scale

Trust in the Local Literature Al Tool (5-point Likert: Strongly Disagree → Strongly Agree)
- Decision support: comfortable using to aid screening decisions
- Benchmark: more effective than a novice PV reviewer 1
- Over-reliance caution (reverse-coded)
- Efficiency: screens/classifies quickly
- Dependability: secure relying on initial classifications
- Reliability: identifies relevant safety content
- Consistency: produces predictable results
- Accuracy: confident it accurately classifies safety-relevant literature

Expectations of Explanations (Likert)
- Curiosity to explore how decisions are made
- Sufficient level of detail
- Clear & understandable explanations

Perceived Risks & Bias (Likert)
- Concern about human-like biases (familiar drugs / well-known AEs) -
- Concern about missing critical safety information - 5%

Where we are ?
- Tool Development
- Feedback collection (here)
- Analysis
- Tool Improvement

Nick Byrd, Ph.D. @byrdnick.com · 19d

How can #AI enhance #communication, #medicine, and #policy?

Darya Filatova et al used #LLMs to correct alignment errors in a European medical regulation corpus with 25 parallel languages, yielding better results than existing machine #translation systems.

www.linkedin.com/in/delnouty

#linguistics

• Corpus Construction: Extracted and structured
SmPC documents from EMA PDFs.
• Semantic Chapter Alignment: Used LaBSE embeddings to match chapters across languages.
• Sentence Alignment: Applied BERTAlign for one-to-one and multi-sentence matching.
• Refinement: Used Claude 3.5 Sonnet to correct low-similarity alignments.
• Asymmetric Strategy: English as source, 24 EMA languages as targets.
• Evaluation: Automated scoring with LLaMA 3.2 using expert and intuitive protocols.

Corpus Construction:
• Source: 432 SmPCs from the EMA database.
• Selection: 4 SmPCs randomly selected for multilingual alignment.
• Languages: 24 target languages aligned with English.
• Alignment Method: semantic similarity models and alignment algorithms.
• Sentence Pairs: ~ 700 aligned pairs per language, totaling 16,800 bilingual pairs.
• Quality Assurance:
• Manual verification and semi-automatic refinement.
• Ensured hich-quality correspondence between source and target.
• Purpose: Used for alignment evaluation and enhancement experiments.

Comparison of Translation Approaches across Metrics

Key Achievements:
• Built a high-quality aligned corpus of SmPCs in 25 EMA languages.
• Applied BERTAlign and LLMs for robust sentence alignment and correction.
• Achieved superior EN-FR translation performance via context-aware strategies.
• Demonstrated the value of domain-specific corpora in regulatory NLP.
Impact:
• Reduces time and cost of multilingual regulatory documentation.
• Supports pharmaceutical market expansion across Europe.
Future Work:
• Improve alignment for low-performing languages (e.g., BG, MT, HR, IS. ET, EL).
• Developing a custom machine translation engine tailored to EMA regulatory content.

1 1

Nick Byrd, Ph.D. @byrdnick.com · 20d

Interested in #computerScience, #decisionScience, *and* #philosophy?

@mircomusolesi.bsky.social's keynote was for you.

Their Machine Intelligence Lab has been studying cognitive biases, moral decision-making, #philosophyOfScience, and more.

#cogSci #AI #RL #psychology #economics #ethics #morality

Moral Decision-Making

Images of Aristotle, Jeremy Bentham, Immanuel Kant

Citation: Elizaveta Tennant, Stephen Hailes and Mirco Musolesi. Modeling Moral Choices in Social Dilemmas with Multi-agent Reinforcement Learning. In Proceedings of the 32nd Joint Conference on Artificial Intelligence (IJCAl 2023). Macao, August 2023.

Tennant, E., Hailes, S., & Musolesi, M. (2025). Hybrid Approaches for Moral Value Alignment in AI Agents: A Manifesto (No. arXiv:2312.01818). arXiv. https://doi.org/10.48550/arXiv.2312.01818

Macmillan-Scott, O., & Musolesi, M. (2024). (Ir)rationality and cognitive biases in large language models. Royal Society Open Science, 11(6), 240255. https://doi.org/10.1098/rsos.240255

Macmillan-Scott, O., & Musolesi, M. (2025). (Ir)rationality in AI: State of the Art, Research Challenges and Open Questions. Artificial Intelligence Review, 58(11), 352. https://doi.org/10.1007/s10462-025-11341-4

1 2

Nick Byrd, Ph.D. @byrdnick.com · 20d

How can #philosophy improve #banking?

Loan decisions are often automated and #AI chatbots are increasingly used to "explain" the decisions.

So Christine Howes et al. are studying how to improve #LLM counterfactual reasoning — Socratic dialogue helped?

www.researchgate.net/profile/Chri...

#cogSci

Counterfactual reasoning

The process of considering how events might have turned out differently if conditions had been different
- central in human language and thought
- contrastive (why event X occurs rather than some other event Y)
- actionable (what can be done to change the outcome)

Examples of counterfactual explanations (CFEs)
- "if you had been two years younger, you would get the loan"
- "when you become two years older, you will get the loan"

Method
• Simulated credit approval chatbot
• Embedded algorithms define eligibility
• Collected responses → annotated for CFEs
• Alignment measured against human notions of actionability

Prompt
$
You are a chatbot deployed by a bank to help customers get credit from the bank.

Credit is granted if the following condition concerning the applicant is met:

<Algorithm >

If the customer is currently not eligible, but the customer could potentially become eligible through a change in circumstances, you communicate what such a change in circumstances would look like.

Follow-up experiments

Hypothesis 1: Too implicit actionability cue in prompt
- replace "what such a change in circumstances would look like" with "what the customer would need to do to get credit" → same pattern of results

Hypothesis 2: Positivity bias
- add "or monthly_income >= 2000"; user income is €1800 → improves results in monotonicity condition (GPT3.5 still misaligned in 40% of cases). No effect on causal dependencies

Hypothesis 3: Generation problem
- Socratic "elenchus" follow-up questions → alignment after questioning is perfect

Nick Byrd, Ph.D. @byrdnick.com · 20d

Some argue #AI language models are incapable of #rationality because they violate axioms of #decisionTheory.

In the jargon, #LLMs fall prey to "Dutch books" and "money pumps".

Alina Chadwick et al. shared methods to rectify such vulnerabilities.

Follow Alina @ www.researchgate.net/profile/Alin...

Motivation: Can LLMs be Rational?

LLMs used as (rational) decision making agents.

Rational agents should have:
1. Probabilistically coherent judgments
2. Transitive preferences

Do LLMs adhere to 1 and 2?
Simon Goldstein argued that they cannot (https://philpapers.org/rec/GOLLCN):

A. Token prediction is structurally different from predicting the likelihood of an event

B. When prompting an LLM to choose between actions, the model's preferences will violate the axioms of decision

Summary

Rectify LLM vulnerabilities to Dutch books by ensuring probabilistic coherence via
1. Normalization pipeline
2. Quadratic program

Rectify LLM vulnerabilities to money pumps by leveraging voting rules
• Introduced (x-)IMDC as a method to calculate and explain a transitive ranking

Nick Byrd, Ph.D. @byrdnick.com · 20d

Like humans, #AI language models are influenced by the number and order of response options.

Jonathan Erhardt & Michael Messerli shared a method that reduced order effects.

Like humans (👇), indifference was preferred when it was an option.

Jonathan's on #LinkedIn: www.linkedin.com/in/jonathan-...

Measuring Preferences: Improvements

Our attempt at improving the method:

1. Create 2 data sets to test preferences: one with item pairs where an LLM plausibly doesn't have strong preferences and one with items where an LLM plausibly does have strong preferences.

2. Offer the model a third "I am indifferent" option.

3. Use the token probability of the choice token as a proxy for the strength of a preference.
"B": 54.85, "A": 44.77 (Option A: seeing a meteor shower, Option B: seeing a lunar eclipse)

Table 1. Evaluation: Neutral vs Alignment-Biased (LLAMA 3 70b)

Table 2. Neutral vs Alignment-Biased with Indifference Option (LLAMA 3 70b)

1 1

Nick Byrd, Ph.D. @byrdnick.com · 20d

Can the people who use #AI for #mentalHealth counseling get an approximate #diagnosis?

Yuriy Mikheev found a customized #LLM generated depression scores that correlated strongly with Beck #Depression Inventory (BDI) scores (r = 0.76, p < 0.001).

Find Yuri on @orcid.org at orcid.org/0000-0002-76...

Methodology & Study Design

Participants & Design
• 97 recruited → 70 complete data (72% completion rate)
• Demographics: Mean age 31.4 years, 67% female
• Randomized order: BDI-II and ChatGPT interview via Telegram bot

ChatGPT-4 Protocol
• Empathic, neutral tone with open-ended questions per BDI-II item
• Autonomous scoring using detailed BDI-II manual guidelines
• Safety protocols: Crisis helpline information for suicidal ideation

Statistical Analysis Framework
• Primary: Pearson correlation (target r ≥ 0.70)
• Secondary: Linear regression, Bland-Altman agreement analysis
• Tools: Python (pandas, NumPy, SciPy, scikit-learn)

Sample Characteristics & Score Distributions

Final Sample (N = 70)
• Completion rate: 72% from initial recruitment
• Age: Mean 31.4 years (SD = 9.2),
67% female
• Recruitment: Social media and niversity mailing lists

Primary Finding - Strong Convergent Validity

Clinical Significance
• Exceeds target threshold of r ≥ 0.70
significantly
• Large effect size by conventional standards
• Comparable to established correlations: BDI-II vs Hamilton (r =
0.71), BDI-II vs PHQ-9 (r = 0.77-0.84)

Compelling evidence that conversational LLM interviews can elicit clinically meaningtul symptom information aligned with established screening criteria.

Agreement Analysis & Clinical Utility

Bland-Altman Agreement Results
• Mean difference: -0.99 points (ChatGPT tends to underestimate)
• 95% limits of agreement: -11.54 to +9.56 points
• Clinical interpretation: £10 points for most participants

Clinical Utility Assessment
• Suitable for: Preliminary screening, triage, first-line assessment
• Caution needed: individual-level substitution decisions
• Recommendation: Confirm critical decisions with human-administered measures

Error Pattern Analysis
• Underestimations: Vague symptom descriptions, minimized emotional expression
• Overestimations: Expressive narratives, dramatic language patterns

1 2

Nick Byrd, Ph.D. @byrdnick.com · 20d

👆 You can follow Adi's forthcoming research about #AI use on @researchgate.bsky.social — link below👇

Follow Guy on Google Scholar: scholar.google.com/citations?vi...

Adi FRENKENBERG, PhD Student | Cited by 7 | of Reichman University, Herzliya (IDC) | Read 1 publication | Contact Adi FRENKENBERG