Jesper N. Wulff
@jnwulff.bsky.social
660 followers 950 following 37 posts
Professor @AarhusUni doing research on organizational research methods and teaching deep neural networks in our Msc. BI program. https://sites.google.com/view/jesperwulff/bio
Posts Media Videos Starter Packs
Reposted by Jesper N. Wulff
lakens.bsky.social
New blog post on Gelman's recent claim that Type S and M errors are intended as a 'rhetorical tool', and if I was wrong to believe they were recommended more routinely in our recent preprint criticizing the idea of Type S and M errors. daniellakens.blogspot.com/2025/09/type...
Type S and M errors as a “rhetorical tool”
We recently posted a preprint criticizing the idea of Type S and M errors ( https://osf.io/2phzb_v1 ). From our abstract: “While these conce...
daniellakens.blogspot.com
Reposted by Jesper N. Wulff
joachimbaumann.bsky.social
🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**.

Paper: arxiv.org/pdf/2509.08825
We present our new preprint titled "Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation".
We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks.
For these tasks, we use a combined set of 2,361 realistic hypotheses that researchers might test using these annotations.
Then, we collect 13 million LLM annotations across plausible LLM configurations.
These annotations feed into 1.4 million regressions testing the hypotheses. 
For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions.
Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to annotation errors.
Across all experiments, LLM hacking occurs in 31-50\% of cases even with highly capable models.
Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present anything as statistically significant.
jnwulff.bsky.social
What corresponds to the Z-test in this analogy? If the P-curve is the W-test then what is the Z-test?
Reposted by Jesper N. Wulff
jnwulff.bsky.social
Absolutely! I'm planning on getting met into the stats curriculum in our undergrad business adm program. My favorite resource is Lakens' online book.
jnwulff.bsky.social
If it makes sense to test a hypothesis, do minimum effect testing and/or set alpha as a function of sample size.
Reposted by Jesper N. Wulff
mvugt.bsky.social
"OpenAI is making “small steps that are good, but I don’t think we’re anywhere near where we need to be”, says Mark Steyvers, a cognitive science and AI researcher at UC Irvine. “It’s not frequent enough that GPT says ‘I don’t know’.”" www.nature.com/articles/d41...
Can researchers stop AI making up citations?
OpenAI’s GPT-5 hallucinates less than previous models do, but cutting hallucination completely might prove impossible.
www.nature.com
Reposted by Jesper N. Wulff
p-hunermund.com
➡️ Deadline approaching—only one month left to send in your papers and presentation proposals for #CDSM2025!

🚨 𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀: 𝗖𝗮𝘂𝘀𝗮𝗹 𝗗𝗮𝘁𝗮 𝗦𝗰𝗶𝗲𝗻𝗰𝗲 𝗠𝗲𝗲𝘁𝗶𝗻𝗴 𝟮𝟬𝟮𝟱 🚨
📅 𝗡𝗼𝘃 𝟭𝟮–𝟭𝟯, 𝟮𝟬𝟮𝟱 (𝗩𝗶𝗿𝘁𝘂𝗮𝗹)
📥 Submission Deadline: 𝗦𝗲𝗽𝘁 𝟯𝟬, 𝟮𝟬𝟮𝟱
Reposted by Jesper N. Wulff
dingdingpeng.the100.ci
Ever stared at a table of regression coefficients & wondered what you're doing with your life?

Very excited to share this gentle introduction to another way of making sense of statistical models (w @vincentab.bsky.social)
Preprint: doi.org/10.31234/osf...
Website: j-rohrer.github.io/marginal-psy...
Models as Prediction Machines: How to Convert Confusing Coefficients into Clear Quantities

Abstract
Psychological researchers usually make sense of regression models by interpreting coefficient estimates directly. This works well enough for simple linear models, but is more challenging for more complex models with, for example, categorical variables, interactions, non-linearities, and hierarchical structures. Here, we introduce an alternative approach to making sense of statistical models. The central idea is to abstract away from the mechanics of estimation, and to treat models as “counterfactual prediction machines,” which are subsequently queried to estimate quantities and conduct tests that matter substantively. This workflow is model-agnostic; it can be applied in a consistent fashion to draw causal or descriptive inference from a wide range of models. We illustrate how to implement this workflow with the marginaleffects package, which supports over 100 different classes of models in R and Python, and present two worked examples. These examples show how the workflow can be applied across designs (e.g., observational study, randomized experiment) to answer different research questions (e.g., associations, causal effects, effect heterogeneity) while facing various challenges (e.g., controlling for confounders in a flexible manner, modelling ordinal outcomes, and interpreting non-linear models).
Figure illustrating model predictions. On the X-axis the predictor, annual gross income in Euro. On the Y-axis the outcome, predicted life satisfaction. A solid line marks the curve of predictions on which individual data points are marked as model-implied outcomes at incomes of interest. Comparing two such predictions gives us a comparison. We can also fit a tangent to the line of predictions, which illustrates the slope at any given point of the curve. A figure illustrating various ways to include age as a predictor in a model. On the x-axis age (predictor), on the y-axis the outcome (model-implied importance of friends, including confidence intervals).

Illustrated are 
1. age as a categorical predictor, resultings in the predictions bouncing around a lot with wide confidence intervals
2. age as a linear predictor, which forces a straight line through the data points that has a very tight confidence band and
3. age splines, which lies somewhere in between as it smoothly follows the data but has more uncertainty than the straight line.
Reposted by Jesper N. Wulff
theissbendixen.bsky.social
"Being Bayesian in a Frequentist World"

New post on "Bayesian dynamic borrowing" in R 📚

Link 👇
Reposted by Jesper N. Wulff
lakens.bsky.social
If you are preparing your bachelor statistics course and would like to add optional material for students to better understand statistics on a conceptual level (see topics in the screenshot) my free textbook provides a state of the art overview. lakens.github.io/statistical_...
Reposted by Jesper N. Wulff
cfiesler.bsky.social
My video about how LLMs are not search engines has led to many, MANY comments telling me that I should be using Perplexity. Some insisting that Perplexity does not hallucinate.

Out of a list of 26 papers it just provided me (in "Research" mode) 4 were real. FOUR. 85% hallucination rate.
Reposted by Jesper N. Wulff
scientificdiscovery.dev
TIL the original paper describing CRISPR, by Francisco Mojica, was rejected by 4 journals and took 2 years to be published
CRISPR as a microbial immune system

In 2003, Mojica wrote the first paper suggesting that CRISPR was an innate microbial immune system. The paper was rejected by a series of high-profile journals, including Nature, Proceedings of the National Academy of Sciences, Molecular Microbiology and Nucleic Acids Research, before finally being accepted by Journal of Molecular Evolution in February, 2005.[3][4]
Reposted by Jesper N. Wulff
carlbergstrom.com
Just in case there was any doubt, ChatGPT 5.0 still makes up completely random citations that don't exist and should not be used for literature search.
1. David Ackerly (UC Berkeley)

While his most-cited work is on leaf size and SLA, he also wrote explicitly about plasticity in leaf traits, including shape, in the context of ecological strategies.

Example: Ackerly (1997), “Allocation, leaf display, and growth in fluctuating light environments: A comparative study of deciduous and evergreen species” (Oecologia). This emphasizes how plasticity in leaf traits mediates adaptation to light.
Reposted by Jesper N. Wulff
carlislerainey.bsky.social
‼️Cool new paper‼️

Finds that journal data policies in psychology boost sharing statements to ~100%, but only about half of datasets are complete, understandable, reusable.

Open: open.lnu.se/index.php/me...
Reposted by Jesper N. Wulff
zetaof1.bsky.social
5. Most frequentist methods are just *fine* and there's no need to always go full luxury bayesian in every application.
jnwulff.bsky.social
Will the videos be released to a broad audience after the conference?
Reposted by Jesper N. Wulff
p-hunermund.com
Inspiring PDW on using sensitivity analysis in empirical management research. My contribution is to present the sensemakr package by Cinelli & Hazlett (2020) for observational designs. Thanks a lot to the organizers for putting this fantastic session together. #AOM2025
Reposted by Jesper N. Wulff
Reposted by Jesper N. Wulff
cfiesler.bsky.social
This is fascinating: www.reddit.com/r/OpenAI/s/I...

Someone “worked on a book with ChatGPT” for weeks and then sought help on Reddit when they couldn’t download the file. Redditors helped them realized ChatGPT had just been roleplaying/lying and there was no file/book…
From the OpenAI community on Reddit
Explore this post and more from the OpenAI community
www.reddit.com
Reposted by Jesper N. Wulff
khoavuumn.bsky.social
Using time series graphs to make causal claims be like
jnwulff.bsky.social
Thick vs thin causality
dingdingpeng.the100.ci
It's a distinction from @kph3k.bsky.social's Genetic Lottery. I used it before even without talking about genetics; it catches a lot of the misconceptions that student have about what it means to call something a cause. Those misconceptions also led to a blog post: www.the100.ci/2024/06/26/s...
Thick and Thin Causation In the course of ordinary social science and medicine, we are quite comfortable calling something a cause, even when (a) we don’t understand the mechanisms by which the cause exerts its effects, (b) the cause is probabilistically but not deterministically associated with effects, and (c) the cause is of uncertain portability across time and space. “All” that is required to assert that you have identified a cause is to demonstrate evidence that the average outcome for a group of people would have been different if they had experienced X instead of Not-X. And the most convincing evidence that you know what might have been is to assign people randomly to X or Not-X. (The word “all” is in scare quotes here, because as any scientist of human behavior and society knows, actually isolating the variable of interest from the web of potential confounds, so that one can make an inference about causation, turns out to be an incredibly difficult and delicate operation.) I’m going to call this a “thin” model of causation.22 We can contrast the “thin” model of causation with the type of “thick” causation we see in monogenic genetic disorders or chromosomal abnormalities. Take Down’s syndrome, for instance. Down’s syndrome is defined by a single, deterministic, portable cause. To have three copies of chromosome 21, instead of two, is the necessary, sufficient, and sole cause of Down’s syndrome. The causal relationship between having three copies of chromosome 21 and Down’s is one-to-one, with the result that forward and reverse inferences work equally well. The cause of Down’s is chromosome 21 trisomy; the effect of chromosome 21 trisomy is Down’s. Having three copies of chromosome 21 doesn’t raise your probability of having Down’s; it is deterministic of the condition. And this causal relationship operates as a “law of nature,” in the sense that we expect the trisomy-Down’s relationship to operate more or less in the same way, regardless of the social mili…