Lightnews — Scholar-powered news

Gordon Forbes @gforb.bsky.social · Jul 30

How is AI going to make this all easier?

And relatedly, how can we stop the increase in productivity form AI leading to an overwhelming ammount of methedologicaly questionable research.

1

Gordon Forbes @gforb.bsky.social · Jul 8

To me this is what age as catagorical means - eg. age 0-18, 19 - 65, 65+.

I would describe using individual integer ages as (ie. 1,2,3,4,5,6,...) as descrete age.

The 'big catagory' approach is used worryingly often - sometimes due to restrictions on the data.

1

Reposted by Gordon Forbes

Giles @gdeejay.bsky.social · Jul 1

Hey #rstats,

What's your rule for splitting R scripts that form part of a wider analysis pipeline / project?

I usually write a single script which includes sections for each step from data cleaning to the final results, but it can become unwieldy when the script becomes long.
...

13 8 26

Reposted by Gordon Forbes

Brennan Kahan @brennankahan.bsky.social · Jul 1

New paper posted on Arxiv: "When do composite estimands answer non-causal questions?"

This can happen more often than you think, and can have a dramatic impact on trial results (e.g. a false-positive rate of almost 90%)

arxiv.org/abs/2506.22610 @timpmorris.bsky.social

When do composite estimands answer non-causal questions?

Under a composite estimand strategy, the occurrence of the intercurrent event is incorporated into the endpoint definition, for instance by assigning a poor outcome value to patients who experience th...

arxiv.org

6 12

Gordon Forbes @gforb.bsky.social · May 22

On tabular health data, time and time again, I see linear (or generalised linear models) perform as well or better than machine learning algorithms that avoid linearity assumptions.

I am surprised by this, as the linearity assumption is unlikely to be true.

Does anyone else see this? Why is this?

The optimal machine learning model was linear regression

1

Gordon Forbes @gforb.bsky.social · May 9

Simpson's paradox

Jarrett Byrnes @jebyrnes.bsky.social · May 9

Could. Not. Resist.

Simpson’s paradox with the Simpsons graph

2

Reposted by Gordon Forbes

Julia M. Rohrer @dingdingpeng.the100.ci · Apr 9

Georgi Baklicharov asks: can treatment effect testing in trials with intercurrent events be nearly assumption-free? #EuroCIM2025

1 3 5

Reposted by Gordon Forbes

Rory Lawless @rorylawless.com · Mar 31

I wrote a short post on the benefits I have found using DuckDB and duckplyr in my day to day workflow. rorylawless.github.io/posts/r-duck... #rstats #data #duckdb #databs

R, DuckDB and Me

Over the past year, DuckDB has gradually become an important part of my data science workflow - at first clumsily, then seamlessly. I don’t typically work with large datasets, however, integrating Duc...

rorylawless.github.io

5 6 27

Gordon Forbes @gforb.bsky.social · Mar 28

Hard disagree! It is acceptable and probably preferable for a group to write a paper without understanding the details of each other's work.

eg. I want to be able to write "the model was estimated with restricted maximum likelihood" in a stats section without explaining REML to my collaborators.

1 1

Reposted by Gordon Forbes

Robert (Bob) Kubinec @rmkubinec.bsky.social · Mar 25

Happy to see that ordered beta regression reached 100 citations on Google Scholar!

The model has citations from work in climate science, ecology, medicine, psychology, & political science, just to name a few.

Thanks to all of you for using ordbetareg (or glmmTMB)!!

#rstats

4 30

Reposted by Gordon Forbes

Julia M. Rohrer @dingdingpeng.the100.ci · Mar 21

I'm working with data that I'm not allowed to share -- I'd like to generate synthetic data so that others at least have something that they can run my code on!

Any pointers to tutorials, favorite packages etc.?

a man wearing a mask is playing a keyboard with the words love synths written above him

ALT: a man wearing a mask is playing a keyboard with the words love synths written above him

media.tenor.com

15 6 29

Reposted by Gordon Forbes

Julia Silge @juliasilge.com · Mar 19

Check out my new screencast, where I walk through how I use #Positron for #rstats package development work. I decided to release a new version of an R package to CRAN ✨live✨ this time around!

youtu.be/uL3NZQIMrpk

Release an R package with Positron

YouTube video by Julia Silge

youtu.be

4 28 110

Reposted by Gordon Forbes

Norm Matloff (你有冇諗清楚呀?) @matloff.bsky.social · Mar 13

In coursework, the contrast between ridge regression and the LASSO is really emphasized. After all, the latter actually does feature selection, by virtue of having a sparse solution to the minimum l1 problem, versus ridge's nonsparse solution in l2, pretty cool. 1/2

1 3 9

Gordon Forbes @gforb.bsky.social · Mar 13

I totally agree that, theoretically, it makes no sense to shrink parameters to zero.

In practice, if it means a model can be applied without collecting mostly irrelevant data, this can be a huge win. Especially in health when that extra data can involve invasive or expensive tests.

1 1 2

Gordon Forbes @gforb.bsky.social · Mar 12

R packages for consort diagrams

Crystal Lewis @cghlewis.bsky.social · Mar 12

Lots of Word Doc templates out there if you just search "consort diagram template".

Also, several R packages to help develop these.
www.riinu.me/2024/02/cons...

Reposted by Gordon Forbes

Ben Van Calster @benvancalster.bsky.social · Mar 12

We tried to look at ways to obtain flexible calibration plots in clustered (e.g. multicenter) validation studies.
Work with @lasaibarrenada.bsky.social @laurewynants.bsky.social @bavodccampo.bsky.social

arxiv.org/abs/2503.08389

Clustered Flexible Calibration Plots For Binary Outcomes Using Random Effects Modeling

Evaluation of clinical prediction models across multiple clusters, whether centers or datasets, is becoming increasingly common. A comprehensive evaluation includes an assessment of the agreement betw...

arxiv.org

3 12

Reposted by Gordon Forbes

Darren Dahly @statsepi.bsky.social · Mar 12

The term "digital twin" as it is now used in medicine has no real relationship to how the term is used in engineering. Yet every paper on the former talks about the success of the latter as if that's relevant. They are not the same!

2 7 41

Gordon Forbes @gforb.bsky.social · Mar 11

My spell check is trying to change bootstrap to Boomer.

As in Boomer p-values.

I didn't know the generation wars had made it to statistical inference. What next? Millennial credible intervals?

Gordon Forbes @gforb.bsky.social · Mar 7

Surely, do what is computationally feasible.

For a random forest on a medium-sized data set, there is no reason not to use a resampling approach or cross-validation.

If you have just developed an LLM using data scraped from the whole Internet, you are not going to be running cross-validation.

1

Reposted by Gordon Forbes

Darren Dahly @statsepi.bsky.social · Mar 6

I think you mean "non-specific".

5 9 77

Gordon Forbes @gforb.bsky.social · Mar 6

When I've worked with REDCap databases I've always had to rely on a data manager pulling an extract and emailing it to me.

This could be a game changer.

1 1

Gordon Forbes @gforb.bsky.social · Mar 5

It's amazing how far back ideas go.

This paper from 1984 by @f2harrell.bsky.social discusses the need for train/test data splits and the importance of assessing model calibration and discrimination and instability in variable selection methods.

onlinelibrary.wiley.com/doi/epdf/10....

Regression modelling strategies for improved prognostic prediction

Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a larg...

onlinelibrary.wiley.com

1 6 22

Reposted by Gordon Forbes

Tim Morris @timpmorris.bsky.social · Feb 28

A minor stylistic preference I’ve recently found myself using: When introducing a key initialism or acronym in a paper, put the compressed version in the text and its expansion in parentheses.

Instead of ‘under missing at random (MAR)’, use ‘under MAR (missing at random)’.
1/

3 1 10

Reposted by Gordon Forbes

Hadley Wickham @hadley.nz · Feb 27

What advice do folks have for organising projects that will be deployed to production? How do you organise your directories? What do you do if you're deploying multiple "things" (e.g. an app and an api) from the same project?

27 29 100

Reposted by Gordon Forbes

Richard Riley (R²) @richarddriley.bsky.social · Feb 12

Just published a couple of pre-prints for those interested in sample size calculations for precise and fair individual-level predictions ... (not the end of the story, but a useful contribution we hope):

Binary outcomes: arxiv.org/abs/2407.09293

Survival outcomes: arxiv.org/abs/2501.14482

A decomposition of Fisher's information to inform sample size for developing fair and precise clinical prediction models -- part 1: binary outcomes

When developing a clinical prediction model, the sample size of the development dataset is a key consideration. Small sample sizes lead to greater concerns of overfitting, instability, poor performanc...

arxiv.org

1 8 26