Amit Chaudhary
amitness.com
Amit Chaudhary
@amitness.com
ai engineer | blog: amitness.com

past: cogsci, low-res nlp, multimodality
For pre-training data, this thread has good paper recommendations
bsky.app/profile/mari...
Has anyone written anything about *scraping and text processing* for internet pretraining data? Practical details, which tools are used, which webpage elements are considered, how HTML to text conversion is done?

(I know about work on quality filters, relevant but not quite what I'm looking for)
June 4, 2025 at 7:27 AM
Not academic work, but for evals and data, these survey articles are quite in-depth with links to papers.

LLM judge survey: eugeneyan.com/writing/llm-...
Synthetic pre-training/post-training survey: eugeneyan.com/writing/synt...
Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.
eugeneyan.com
June 4, 2025 at 7:26 AM
I had watched a talk from @thomwolf.bsky.social of @hf.co and they use trafilatura.readthedocs.io for the HTML to text conversion in their library datatrove (github.com/huggingface/...).

The talk is more focused on filtering though but here it is:
www.youtube.com/watch?v=2-SP...
A Python package & command-line tool to gather text on the Web — Trafilatura 2.0.0 documentation
Trafilatura is a Python package and command-line tool designed to gather text on the Web. Its main applications are web crawling, downloads, scraping, and extraction of main texts, comments and metada...
trafilatura.readthedocs.io
May 10, 2025 at 8:02 AM
👋
February 19, 2025 at 3:08 PM
Sure, SFT is simulating annotators from those countries

But, you see this multiple times on reddit/linkedin, where people downvote and point out some comment as "sounds like chatgpt". Cause it has antislop phrase or syntax

Not accurate as you pointed, but that's what a layman is using as proxy
February 14, 2025 at 1:21 PM
Picking a few keywords from this antislop list:

github.com/sam-paech/an...
github.com
February 14, 2025 at 12:57 PM
You actually don't need multiple --with. A comma separated list of packages also works (though looks a bit uglier)

uvx --with llm,sqlite-utils ipython
February 14, 2025 at 12:52 PM
You can do it with skyfeed + running your custom logic on github actions

bsky.app/profile/amit...
Wrote down the process to build your own custom feeds for Bluesky programmatically in Python and run it 100% free

Uses @skyfeed.app + @github.com actions to do periodic filtering and re-ranking and @cloudflare.social static pages to provide data to @bsky.app
Zero-Cost Custom Feeds on Bluesky
A simple stack for generating custom feeds for Bluesky programmatically without a backend server
amitness.com
January 6, 2025 at 1:48 PM
December 30, 2024 at 3:04 PM
I just rely on these:
- alphasignal for daily updates
- email subs to blogs (eugeneyan, simonw, hamel, jasonliu)
- read orielly for bird-eye surveys (chip huyen's ai eng, jay's hands-on llm etc.)
- deeplearning.ai "short" courses to know what's out there (topics I don't touch at work e.g. agents)
December 19, 2024 at 10:10 AM
how are you tackling the last 2 points?
December 19, 2024 at 9:48 AM
That's super cool, I'll give it a try and thank you for building Skyfeed!
December 3, 2024 at 8:03 AM
cc: @pfrazee.com
@simonwillison.net (another git scraping avenue)
December 1, 2024 at 2:43 PM
Would this be stance detection? A controversial post would have a high entropy of stance distribution in replies/quotes aka the "1M posts" drama.

paperswithcode.com/task/stance-...

Mutes to a post might also be a good proxy to downvotes but those are private and can't be accessed via API.
Papers with Code - Stance Detection
Stance detection is the extraction of a subject's reaction to a claim made by a primary actor. It is a core part of a set of approaches to fake news assessment. Example: * Source: "Apples are the mo...
paperswithcode.com
November 29, 2024 at 3:38 AM
It's also why the feed loads super fast. Bluesky is simply making a request to this static endpoint on cloudflare when you open the feed and just fetches the JSON for the post ids and loads that into their UI.

bluesky-1tj.pages.dev/xrpc/app.bsk...
bluesky-1tj.pages.dev
November 28, 2024 at 9:41 AM
Thanks; the trick is how bluesky protocol operates. It makes GET requests to 3 endpoints and expects JSON

So, instead of running a server 24/7, you can offload indexing to @skyfeed.app, periodically filter the feed via github actions and just dump that into cloudflare pages with correct paths
November 28, 2024 at 9:37 AM
I fetch the feed created by skyfeed using bluesky sdk, and for posts with arxiv links, used the pyarxiv library to fetch the category and filtered items to these categories: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA

Here is the relevant code

The filtering runs every 30m for free via github actions
November 27, 2024 at 8:45 AM
Hey @mariaa.bsky.social, I got it working. Here you go

bsky.app/profile/amit...
Built a custom feed that shows latest arxiv+acl papers that belong to AI/ML/NLP/Computer vision categories. No bots/random papers belonging to other fields now.

bsky.app/profile/amit...

Generated in python but runs 100% free without a server; I'll do a write-up soon
github.com/amitness/blu...
November 27, 2024 at 8:30 AM
The most interesting part is the filtering and ranking; you can do a bunch of stuff. I went with hackernews ranking for as a start to balance recency vs popularity.

You could even train your own classifiers to make it more personalized; bluesky seems super hackable, love it!
November 27, 2024 at 8:28 AM