Lightnews — Scholar-powered news

Matteo Di Cristofaro

@matteodic.bsky.social

740 followers 1.2K following 44 posts

Researcher in Corpus Linguistics and Digital Humanities @ UniMoRe. Corpus and Cognitive Linguist, Python & R user. Overall nerd (posts not representative of employers). Website: https://infogrep.it Online materials: https://catlism.github.io

infogrep.it

Posts Media Videos Starter Packs

Pinned

Matteo Di Cristofaro @matteodic.bsky.social · Jul 3

Is 😵‍💫 one token or two?
To a human, it's one. To a corpus tool, it’s often split (😵 + 💫).
And 𝙊𝙉𝙇𝙄𝙉𝙀 ≠ online.
This preprint shows how emojis & homoglyphs challenge tokenisation and distort linguistic evidence.
🔍 arxiv.org/abs/2507.01764

#Emoji #Homoglyphs #CorpusLinguistics #AcademicSky #LangSky

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative met...

arxiv.org

1 1 12

Reposted by Matteo Di Cristofaro

Dr N. Brodie @brodiegal.bsky.social · 22d

I see so much of this in academic funding calls ‘we are looking for projects that explore how AI can help to solve … hunger, violence against women and children, poverty, etc.’ But there’s no space in there to say: ‘um, what if AI is not the right tool for this’

Dr Abeba Birhane @abeba.bsky.social · 23d

AI is the wrong tool to tackle complex societal & systemic problems. AI4SG is more about PR victories, boosting AI adoption (regardless of merit/usefulness) & laundering accountability for harmful tech, extractive practices, abetting atrocities. yours truly
www.project-syndicate.org/magazine/ai-...

The False Promise of “AI for Social Good”

Abeba Birhane refutes industry claims about the technology's potential to solve complex social problems.

www.project-syndicate.org

3 23 78

Reposted by Matteo Di Cristofaro

DAIHUM: Digital & AI Humanities @daihum.bsky.social · 26d

📖 ToolFindr - Lightweight Explorer for Discovering Research Tools in Digital Humanities

ToolFindr - Lightweight Explorer for Discovering Research Tools in Digital Humanities

[AI Summary]: ToolFindr is an open, community-curated platform for discovering Digital Humanities research tools, built on the Tool Registry Framework and integrating data from Wikidata and the...

ai-humanities.com

1 2

Matteo Di Cristofaro @matteodic.bsky.social · Sep 2

I have recently found "What are embeddings" by @vickiboykis.com, and I think it should become a #corpuslinguistics and #digitalhumanities must-read starting book. Plus it's free under CC by-nc-sa!

vickiboykis.com/what_are_emb...

What are embeddings?

A deep-dive into machine learning embeddings.

vickiboykis.com

1 7

Reposted by Matteo Di Cristofaro

Tech Policy Press @techpolicypress.bsky.social · Aug 22

Extreme speech thrives in encrypted spaces, but killing encryption won’t stop it, says a group of researchers who have studied the problem from multiple angles. We need context-driven governance, not backdoors, they say.

Policy Directions on Encrypted Messaging and Extreme Speech | TechPolicy.Press

Encryption, disinformation, and democracy: rethinking policy for messaging apps with rights-based safeguards.

buff.ly

3 5

Matteo Di Cristofaro @matteodic.bsky.social · Aug 14

wow, sounds super, thanks!

Reposted by Matteo Di Cristofaro

Craig Silverman @craigsilverman.bsky.social · Aug 10

“Wikipedia editors have had to deal with an onslaught of AI-generated content filled with false information and phony citations. Already, the community of Wikipedia volunteers has mobilized to fight back against AI slop”

www.theverge.com/report/75681...

How Wikipedia is fighting AI slop content

Wikipedians are wading through the muck.

www.theverge.com

3 58 150

Matteo Di Cristofaro @matteodic.bsky.social · Aug 9

5nxd redeemed, thanks

Matteo Di Cristofaro @matteodic.bsky.social · Aug 9

8s25 redeemed,thanks a lot

Matteo Di Cristofaro @matteodic.bsky.social · Aug 8

I love this!

1 5

Reposted by Matteo Di Cristofaro

Daniel van Strien @danielvanstrien.bsky.social · Aug 1

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

Screenshot of the app showing a page from a book + different views of existing and new ocr.

4 15 48

Matteo Di Cristofaro @matteodic.bsky.social · Jul 26

rvya redeemed, thanks

Matteo Di Cristofaro @matteodic.bsky.social · Jul 26

b83m redeemed, thanks a lot

Matteo Di Cristofaro @matteodic.bsky.social · Jul 26

fsdc redeemed, thanks!

Reposted by Matteo Di Cristofaro

Randall Munroe @xkcd.com · Jul 21

Replication Crisis

xkcd.com/3117/

4-panel comic. (1) [Person 1 with ponytail flanked by person with short hair and another person speaking into microphone at podium] PERSON 1: In the early 2010s, researchers found that many major scientific results couldn’t be reproduced. (2) PERSON 1: Over a decade into the replication crisis, we wanted to see if today’s studies have become more robust. (3) PERSON 1: Unfortunately, our replication analysis has found exactly the same problems that those 2010s researchers did. (4) [newspaper with image of speakers from previous panels] Headline: Replication Crisis Solved

28 660 4.9K

Reposted by Matteo Di Cristofaro

Philipp Markolin, PhD @philippmarkolin.bsky.social · Jul 11

I believe it is worth interrogating the fundamental forces re-shaping our information spheres away from liberal democracy towards myth, manipulation and magical thinking empowering autocracy and nihilism.

Here’s how it all falls apart—a 🧵 in 6 figures ⬇️
www.protagonist-science.com/p/how-social...

How social media destroys democratic discourse, explained in 6 easy figures

Where we all went wrong

www.protagonist-science.com

2 17 26

Reposted by Matteo Di Cristofaro

Hypervisible @hypervisible.blacksky.app · Jul 6

Stuffing ai into everything “isn’t just a forecast, it’s a libidinal fantasy — a capitalist dream of replacing relationships with code and scalable software, while public institutions are gutted in the name of ‘innovation.’”

Regulating AI Isn’t Enough. Let’s Dismantle the Logic That Put It in Schools.

AI in schools isn’t progress — it’s a sign of how far we’ve strayed from the purpose of education.

truthout.org

3 51 180

Reposted by Matteo Di Cristofaro

Hypervisible @hypervisible.blacksky.app · Jul 6

🤷🏿‍♂️

Companies That Tried to Save Money With AI Are Now Spending a Fortune Hiring People to Fix Its Mistakes

Companies that rushed to replace human labor with AI are now shelling out to have IRL workers to fix the technology's screwups.

futurism.com

69 660 2.2K

Reposted by Matteo Di Cristofaro

Ken Burnside @kenburnside.bsky.social · Jul 3

"The problem with AI isn't that it can do your job. It can't. The problem with AI is that your MBA-brained boss's boss doesn't know how your job works and thinks AI can do your job at fractions of a penny on the dollar, and hears the siren song of 'maximize shareholder value'."

MBA-brain is real.

David Ho @davidho.bsky.social · Jul 3

They’re literally advertising it.

CEOs Start Saying the Quiet Part Out Loud: AI Will Wipe Out Jobs

Ford chief predicts AI will replace “literally half of all white-collar workers.”

www.wsj.com

130 3.1K 7.5K

Matteo Di Cristofaro @matteodic.bsky.social · Jul 3

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

arxiv.org

1 1 12

Matteo Di Cristofaro @matteodic.bsky.social · Jul 2

wow, many thanks!

Matteo Di Cristofaro @matteodic.bsky.social · Jul 2

Fellow academics, can anyone help with obtaining an #endorsement on arXiv?
I have a preprint I'd like to upload to Computer Science > Computation and Language (cs.CL), but need someone to endorse my account.
Here's the endorsement link: arxiv.org/auth/endorse...

#corpuslinguistics #linguistics

arXiv user login

arxiv.org