Matteo Di Cristofaro
@matteodic.bsky.social
740 followers 1.2K following 44 posts
Researcher in Corpus Linguistics and Digital Humanities @ UniMoRe. Corpus and Cognitive Linguist, Python & R user. Overall nerd (posts not representative of employers). Website: https://infogrep.it Online materials: https://catlism.github.io
Posts Media Videos Starter Packs
Pinned
matteodic.bsky.social
Is 😵‍💫 one token or two?
To a human, it's one. To a corpus tool, it’s often split (😵 + 💫).
And 𝙊𝙉𝙇𝙄𝙉𝙀 ≠ online.
This preprint shows how emojis & homoglyphs challenge tokenisation and distort linguistic evidence.
🔍 arxiv.org/abs/2507.01764

#Emoji #Homoglyphs #CorpusLinguistics #AcademicSky #LangSky
Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative met...
arxiv.org
Reposted by Matteo Di Cristofaro
brodiegal.bsky.social
I see so much of this in academic funding calls ‘we are looking for projects that explore how AI can help to solve … hunger, violence against women and children, poverty, etc.’ But there’s no space in there to say: ‘um, what if AI is not the right tool for this’
abeba.bsky.social
AI is the wrong tool to tackle complex societal & systemic problems. AI4SG is more about PR victories, boosting AI adoption (regardless of merit/usefulness) & laundering accountability for harmful tech, extractive practices, abetting atrocities. yours truly
www.project-syndicate.org/magazine/ai-...
The False Promise of “AI for Social Good”
Abeba Birhane refutes industry claims about the technology's potential to solve complex social problems.
www.project-syndicate.org
matteodic.bsky.social
I have recently found "What are embeddings" by @vickiboykis.com, and I think it should become a #corpuslinguistics and #digitalhumanities must-read starting book. Plus it's free under CC by-nc-sa!

vickiboykis.com/what_are_emb...
What are embeddings?
A deep-dive into machine learning embeddings.
vickiboykis.com
Reposted by Matteo Di Cristofaro
techpolicypress.bsky.social
Extreme speech thrives in encrypted spaces, but killing encryption won’t stop it, says a group of researchers who have studied the problem from multiple angles. We need context-driven governance, not backdoors, they say.
Policy Directions on Encrypted Messaging and Extreme Speech | TechPolicy.Press
Encryption, disinformation, and democracy: rethinking policy for messaging apps with rights-based safeguards.
buff.ly
matteodic.bsky.social
wow, sounds super, thanks!
Reposted by Matteo Di Cristofaro
craigsilverman.bsky.social
“Wikipedia editors have had to deal with an onslaught of AI-generated content filled with false information and phony citations. Already, the community of Wikipedia volunteers has mobilized to fight back against AI slop”

www.theverge.com/report/75681...
How Wikipedia is fighting AI slop content
Wikipedians are wading through the muck.
www.theverge.com
matteodic.bsky.social
8s25 redeemed,thanks a lot
Reposted by Matteo Di Cristofaro
danielvanstrien.bsky.social
Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule
Screenshot of the app showing a page from a book + different views of existing and new ocr.
matteodic.bsky.social
b83m redeemed, thanks a lot
Reposted by Matteo Di Cristofaro
Reposted by Matteo Di Cristofaro
philippmarkolin.bsky.social
I believe it is worth interrogating the fundamental forces re-shaping our information spheres away from liberal democracy towards myth, manipulation and magical thinking empowering autocracy and nihilism.

Here’s how it all falls apart—a 🧵 in 6 figures ⬇️
www.protagonist-science.com/p/how-social...
How social media destroys democratic discourse, explained in 6 easy figures
Where we all went wrong
www.protagonist-science.com
Reposted by Matteo Di Cristofaro
hypervisible.blacksky.app
Stuffing ai into everything “isn’t just a forecast, it’s a libidinal fantasy — a capitalist dream of replacing relationships with code and scalable software, while public institutions are gutted in the name of ‘innovation.’”
Regulating AI Isn’t Enough. Let’s Dismantle the Logic That Put It in Schools.
AI in schools isn’t progress — it’s a sign of how far we’ve strayed from the purpose of education.
truthout.org
Reposted by Matteo Di Cristofaro
kenburnside.bsky.social
"The problem with AI isn't that it can do your job. It can't. The problem with AI is that your MBA-brained boss's boss doesn't know how your job works and thinks AI can do your job at fractions of a penny on the dollar, and hears the siren song of 'maximize shareholder value'."

MBA-brain is real.
matteodic.bsky.social
Is 😵‍💫 one token or two?
To a human, it's one. To a corpus tool, it’s often split (😵 + 💫).
And 𝙊𝙉𝙇𝙄𝙉𝙀 ≠ online.
This preprint shows how emojis & homoglyphs challenge tokenisation and distort linguistic evidence.
🔍 arxiv.org/abs/2507.01764

#Emoji #Homoglyphs #CorpusLinguistics #AcademicSky #LangSky
Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative met...
arxiv.org
matteodic.bsky.social
Fellow academics, can anyone help with obtaining an #endorsement on arXiv?
I have a preprint I'd like to upload to Computer Science > Computation and Language (cs.CL), but need someone to endorse my account.
Here's the endorsement link: arxiv.org/auth/endorse...

#corpuslinguistics #linguistics
arXiv user login
arxiv.org