Lightnews — Scholar-powered news

Amit Chaudhary

@amitness.com

For pre-training data, this thread has good paper recommendations
bsky.app/profile/mari...

Maria Antoniak @mariaa.bsky.social · May 9

Has anyone written anything about *scraping and text processing* for internet pretraining data? Practical details, which tools are used, which webpage elements are considered, how HTML to text conversion is done?

(I know about work on quality filters, relevant but not quite what I'm looking for)

June 4, 2025 at 7:27 AM

Amit Chaudhary

@amitness.com

Not academic work, but for evals and data, these survey articles are quite in-depth with links to papers.

LLM judge survey: eugeneyan.com/writing/llm-...
Synthetic pre-training/post-training survey: eugeneyan.com/writing/synt...

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

eugeneyan.com

June 4, 2025 at 7:26 AM

Amit Chaudhary

@amitness.com

I had watched a talk from @thomwolf.bsky.social of @hf.co and they use trafilatura.readthedocs.io for the HTML to text conversion in their library datatrove (github.com/huggingface/...).

The talk is more focused on filtering though but here it is:
www.youtube.com/watch?v=2-SP...

A Python package & command-line tool to gather text on the Web — Trafilatura 2.0.0 documentation

Trafilatura is a Python package and command-line tool designed to gather text on the Web. Its main applications are web crawling, downloads, scraping, and extraction of main texts, comments and metada...

trafilatura.readthedocs.io

May 10, 2025 at 8:02 AM

Amit Chaudhary

@amitness.com

👋

February 19, 2025 at 3:08 PM

Amit Chaudhary

@amitness.com

Sure, SFT is simulating annotators from those countries

But, you see this multiple times on reddit/linkedin, where people downvote and point out some comment as "sounds like chatgpt". Cause it has antislop phrase or syntax

Not accurate as you pointed, but that's what a layman is using as proxy

February 14, 2025 at 1:21 PM

Amit Chaudhary

@amitness.com

Picking a few keywords from this antislop list:

github.com/sam-paech/an...

github.com

February 14, 2025 at 12:57 PM

Amit Chaudhary

@amitness.com

You actually don't need multiple --with. A comma separated list of packages also works (though looks a bit uglier)

uvx --with llm,sqlite-utils ipython

February 14, 2025 at 12:52 PM

Amit Chaudhary

@amitness.com

You can do it with skyfeed + running your custom logic on github actions

bsky.app/profile/amit...

Amit Chaudhary @amitness.com · Dec 1

Wrote down the process to build your own custom feeds for Bluesky programmatically in Python and run it 100% free

Uses @skyfeed.app + @github.com actions to do periodic filtering and re-ranking and @cloudflare.social static pages to provide data to @bsky.app

Zero-Cost Custom Feeds on Bluesky

A simple stack for generating custom feeds for Bluesky programmatically without a backend server

amitness.com

January 6, 2025 at 1:48 PM

Amit Chaudhary

@amitness.com

Same energy (h/t @hamel.bsky.social )

x.com/HamelHusain/...

December 30, 2024 at 3:04 PM

Amit Chaudhary

@amitness.com

Reminded me of this paper: arxiv.org/abs/2310.06816

Text Embeddings Reveal (Almost) As Much As Text

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embed...

arxiv.org

December 25, 2024 at 12:49 PM

Amit Chaudhary

@amitness.com

I just rely on these:
- alphasignal for daily updates
- email subs to blogs (eugeneyan, simonw, hamel, jasonliu)
- read orielly for bird-eye surveys (chip huyen's ai eng, jay's hands-on llm etc.)
- deeplearning.ai "short" courses to know what's out there (topics I don't touch at work e.g. agents)

December 19, 2024 at 10:10 AM

Amit Chaudhary

@amitness.com

how are you tackling the last 2 points?

December 19, 2024 at 9:48 AM

Amit Chaudhary

@amitness.com

That's super cool, I'll give it a try and thank you for building Skyfeed!

December 3, 2024 at 8:03 AM

Amit Chaudhary

@amitness.com

cc: @pfrazee.com
@simonwillison.net (another git scraping avenue)

December 1, 2024 at 2:43 PM

Amit Chaudhary

@amitness.com

Would this be stance detection? A controversial post would have a high entropy of stance distribution in replies/quotes aka the "1M posts" drama.

paperswithcode.com/task/stance-...

Mutes to a post might also be a good proxy to downvotes but those are private and can't be accessed via API.

Papers with Code - Stance Detection

Stance detection is the extraction of a subject's reaction to a claim made by a primary actor. It is a core part of a set of approaches to fake news assessment. Example: * Source: "Apples are the mo...

paperswithcode.com

November 29, 2024 at 3:38 AM

Amit Chaudhary

@amitness.com

It's also why the feed loads super fast. Bluesky is simply making a request to this static endpoint on cloudflare when you open the feed and just fetches the JSON for the post ids and loads that into their UI.

bluesky-1tj.pages.dev/xrpc/app.bsk...

bluesky-1tj.pages.dev

November 28, 2024 at 9:41 AM

Amit Chaudhary

@amitness.com

Thanks; the trick is how bluesky protocol operates. It makes GET requests to 3 endpoints and expects JSON

So, instead of running a server 24/7, you can offload indexing to @skyfeed.app, periodically filter the feed via github actions and just dump that into cloudflare pages with correct paths

November 28, 2024 at 9:37 AM

Amit Chaudhary

@amitness.com

I fetch the feed created by skyfeed using bluesky sdk, and for posts with arxiv links, used the pyarxiv library to fetch the category and filtered items to these categories: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA

Here is the relevant code

The filtering runs every 30m for free via github actions

November 27, 2024 at 8:45 AM

Amit Chaudhary

@amitness.com

Hey @mariaa.bsky.social, I got it working. Here you go

bsky.app/profile/amit...

Amit Chaudhary @amitness.com · Nov 27

Built a custom feed that shows latest arxiv+acl papers that belong to AI/ML/NLP/Computer vision categories. No bots/random papers belonging to other fields now.

bsky.app/profile/amit...

Generated in python but runs 100% free without a server; I'll do a write-up soon
github.com/amitness/blu...

November 27, 2024 at 8:30 AM

Amit Chaudhary

@amitness.com

The most interesting part is the filtering and ranking; you can do a bunch of stuff. I went with hackernews ranking for as a start to balance recency vs popularity.

You could even train your own classifiers to make it more personalized; bluesky seems super hackable, love it!

November 27, 2024 at 8:28 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news