Lightnews — Scholar-powered news

Blas M. Benito

@blasbenito.com

2.3K followers 2.6K following 440 posts

Posts Replies Media Videos

Blas M. Benito

@blasbenito.com

In the context of spatialRF, thinning is used to define the centers of contiguous training folds used in spatial cross-validation (shown as blue dots in the figure).

These ensure that the training data represents the spatial correlation structure of the full dataset.

Global scatter plot of geographic point locations shown on a world map using longitude (x-axis) and latitude (y-axis). Blue hollow circles labeled “Training” are concentrated mostly in the Southern Hemisphere, covering South America, southern Africa, Australia, and parts of Southeast Asia. Red hollow circles labeled “Testing” are concentrated mainly in the Northern Hemisphere, covering North America, Europe, northern Africa, and large parts of Asia. The two datasets are largely separated by latitude, with minimal overlap around the equatorial region.

January 11, 2026 at 9:42 AM

Blas M. Benito

@blasbenito.com

In case "thinning" doesn't ring a bell:

spatialRF::thinning() controls spatial clustering in point data to mitigate spatial autocorrelation and sampling bias.

The ugly figure shows the before and after of an extreme thinning run with a distance of 5 degrees on a global dataset with 30k points.

Two-panel world map illustrating spatial thinning. The top panel shows a very dense global distribution of points clustered across continents. The bottom panel shows the same data after spatial thinning with a 5° minimum distance, resulting in far fewer, more evenly spaced points that preserve broad geographic coverage while reducing clustering.

January 11, 2026 at 9:17 AM

Blas M. Benito

@blasbenito.com

I've been optimizing an #rstats function, as one usually does at 7AM on Sunday.

The benchmark uses 30k points to compare spatialRF::thinning() (plain R), its C++ version, and an optimized C++ algorithm using spatial indexing.

Result: ~500x speed-up 🚀

Additional outcome: I didn't waste my morning!

Results of the R function microbenchmark on three versions of a thinning algorithm: the naive algorithm written in plain R shows a median runtime of 49 seconds, a C plus plus version shows a median runtime of 11 seconds, and a C plus plus version with an optimized algorithm shows a median runtime of 0.1 second

January 11, 2026 at 8:50 AM

Blas M. Benito

@blasbenito.com

Claude Code running several independent agents (each one with their own context) in parallel to fix small bugs throughout a full #rstats package.

This feature sounded like scifi BS to me just weeks ago.

Claude Code spinning five instances of an agent specialized in debugging and testing one R function at a time.

Claude Code showing a report of the bugs fixed by the bug-hunting agents.

December 30, 2025 at 8:17 PM

Blas M. Benito

@blasbenito.com

At first I thought that having more than one instance of Claude Code running on the same codebase was a bit silly, but here we are.

Snapshot of the kitty terminal showing two instances of Claude Code, one working on code optimization, and other working on a test suite.

December 30, 2025 at 11:37 AM

Blas M. Benito

@blasbenito.com

Another one bites the dust.

The agrometeorological dataset AgERA5 (URL: cds.climate.copernicus.eu/datasets/sis...) goes off technical support.

Screenshot of an ECMWF forum announcement titled ‘Scientific and technical support for AgERA5 ending from 31-12-2025.’ A staff member named Michela states that scientific and technical support for the Agrometeorological indicators from 1979 to present derived from reanalysis (AgERA5) dataset will no longer be available from 01-01-2026 and is not recommended for production use. The post notes that users will be informed if support is reestablished and provides a link to the ECMWF Support portal for enquiries.

December 10, 2025 at 10:19 AM

Blas M. Benito

@blasbenito.com

4/5 tidymodels integration 🧩

The new function step_collinear() lets you add multicollinearity filtering directly into your {recipes} pipelines.

This integration omits target-encoding, as it doesn’t fit well with how recipes work.

RStudio editor showing R code that builds a recipe with step_collinear(), defines a linear regression model using parsnip, constructs a workflow, fits it to vi_smol, and then calls broom::tidy() on extract_fit_engine(vi_workflow). Below the code, the Viewer displays a tibble with model coefficients

December 9, 2025 at 8:04 AM

Blas M. Benito

@blasbenito.com

3/5 Enriched output 🚀

The output of collinear() now takes you from raw data to model-ready output:

✅ Filtered data frame
✅ Ranking of predictors resulting from preference_order()
✅ Names of the selected predictors
✅ Model formulas to kickstart exploratory modelling.

Screenshot showing R code calling collinear::collinear() with a data frame, response "vi_numeric", and numeric predictors. The printed output shows: response name, filtered data frame dimensions (610 rows, 12 cols), preference order with the ranking method (f_numeric_glm) and top predictors (growing_season_length, soil_ph, swi_mean, etc.), the final selection of 11 predictors after filtering, and ready-to-use model formulas for linear (lm) and smooth (GAM) models.

December 9, 2025 at 8:04 AM

Blas M. Benito

@blasbenito.com

2/5 Adaptive thresholds 📈

By default, {collinear} analyzes the data's correlation structure to configure multicollinearity thresholds automatically.

Learn more here: blasbenito.github.io/collinear/ar...

Scatterplot showing results from 10,000 simulations validating collinear's adaptive threshold system. X-axis shows input correlation (75th percentile), Y-axis shows output maximum VIF after filtering. Points are colored by number of input predictors (25-100). Two horizontal dashed lines mark VIF = 2.5 and VIF = 7.5. The plot shows that output VIF stays bounded within this range across diverse correlation structures, with higher input correlations producing higher but still bounded output VIF values.

December 9, 2025 at 8:04 AM

Blas M. Benito

@blasbenito.com

Here is my honest comment about this: LOL

Linkedin snapshot showing a data science lead job in Germany paying 14 dollars per hour. It's no longer accepting applicants, and shows that zero people clicked "apply" .

November 20, 2025 at 1:39 PM

Blas M. Benito

@blasbenito.com

Version 3.0 of the #rstats package {collinear} (coming soon) has a comprehensive test suite (~800 tests, >96% coverage).

Writing it was a long and rather boring effort (no LLMs were harmed), but it's helping me catch internal inconsistencies and quickly identify the splash area of new features.

Results of devtools::test() showing the 799 successful unit tests of the R package {collinear}.

November 17, 2025 at 9:15 AM

Blas M. Benito

@blasbenito.com

From the same job advert: ZERO applicants

November 14, 2025 at 11:37 AM

Blas M. Benito

@blasbenito.com

Somebody asked "how can we enshittify the shit out of the job market?" and someone else came out with this:

November 14, 2025 at 11:35 AM

Blas M. Benito

@blasbenito.com

I’ve been writing R since 2008 and somehow NEVER noticed that functions can literally call themselves in some sort of evil self-recursion.
I don't remember ever seeing this thing in the wild!
#rstats

R function named wtf() that calls itself in a self-recursion pattern.

November 6, 2025 at 11:36 AM

Blas M. Benito

@blasbenito.com

On the other hand, the R package {collinear} (URL: blasbenito.github.io/collinear/) saw an increase in downloads after release 2.0, a version with no breaking changes.

Version 3.0 is coming soon, with a few significant improvements and some changes, so we'll see how things go after that.

#rstats

Barplot showing the monthly downloads of the R package collinear, with a monthly average of 384 downloads, and a clear increase after November 2024, resulting from release 2.0.

November 1, 2025 at 8:55 AM

Blas M. Benito

@blasbenito.com

The download history of the R package {distantia} (URL: blasbenito.github.io/distantia/) isn't half bad for such a niche tool!

There is a clear drop after releasing v2.0 in Jan 2025, but I get it: this was a full rewrite with no backward compatibility.

Breaking changes break trust!

#rstats

Bar plot showing the download history of the R package distantia from summer 2019. It shows a steady average of 480 downloads per month, with a slight decrease after the release of version 2.0.0 in January 2025.

November 1, 2025 at 7:58 AM

Blas M. Benito

@blasbenito.com

Here's mine!

A rubber duck wearing sunglasses and drinking a mojito or whatever.

September 24, 2025 at 12:29 PM

Blas M. Benito

@blasbenito.com

Equivalence between pairwise correlation and VIF in multicollinearity filtering.

Experiment:

- Subset df (30k rows, 249 cols) to random dimensions.
- Filter using a random max correlation.
- Find VIF producing the most similar result to the step above.
- Repeat 10k times.

#rstats 📦 {collinear}

Scatterplot showing the relationship between pairwise Pearson correlation and variance inflation factors across 10000 repetitions on randomized datasets. The data is organized as a sigmoidal curve, and colored according to the Jaccard similarity between the variables resulting from a multicollinearity filtering using a given correlation and a given variance inflation factor.

September 15, 2025 at 11:52 AM

Blas M. Benito

@blasbenito.com

After a bit of fiddling, I finally have a functional Jenkins job to back up ~1TB of Dropbox data in my old but trusty NAS whenever I start my computer.

I've used it at work before, but now that I am using it for my own stuff, I can say this out loud: Jenkins is pretty cool!

Snapshot of a Jenkins job running a Dropbox backup.

September 5, 2025 at 1:10 PM

Blas M. Benito

@blasbenito.com

I still have snapshots of some of the Kepler workflows I worked with during these years.

These franken-workflows combined Bash, Grass GIS, R, and even Octave.

And ran simulations for months on a few of my lab's computers!

Snapshot of a workflow built with the defunct software Kepler.

September 5, 2025 at 11:09 AM

Blas M. Benito

@blasbenito.com

And a zoom on the southern populations here.

Topographic map showing the location of several plant populations, and lines between them representing potential dispersal routes.

September 5, 2025 at 11:01 AM

Blas M. Benito

@blasbenito.com

Aha, I found the whole figure!

Three D map of eastern Andalusia showing the locations of several plant populations, and lines between them representing potential dispersal routes.

September 5, 2025 at 11:00 AM

Blas M. Benito

@blasbenito.com

I found this old 3D representation of a dispersal simulation between populations that I coded during my PhD.

It combined species distribution models, cellular automata, and least-cost paths. If my memory doesn't fail me, I used OpenModeller for the SDM, and Grass GIS for the simulation.

Fun times!

September 5, 2025 at 10:56 AM

Blas M. Benito

@blasbenito.com

In case you didn't know:

You can run your package unit tests with {testthat} in parallel with two simple steps (see testthat.r-lib.org/articles/par...):

tldr:
1. Add `Config/testthat/parallel: true` to DESCRIPTION.
2. Add `TESTTHAT_CPUS=8` to your .Renviron and restart R.

#rstats

Running testthat unit tests in parallel with Rstudio.

August 18, 2025 at 7:24 AM

Blas M. Benito

@blasbenito.com

I also have a machine named 'razorback' (r u seeing the theme already?).

It runs Ubuntu 24.04 on a 16-core i9 and 62 GB RAM, with a total storage of 5 TB.

I installed Rstudio Server there last week and bookmarked the server's address in my laptop's browser.

It took me 5 minutes!

Snapshot of Rstudio Server running unit tests in parallel. The upper left panel shows an R function, the upper right panel shows htop on the console, the 'build' panel on the lower left shows the testing progress, and the lower right panel shows the file explorer.

August 18, 2025 at 7:07 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news