Cooper
banner
afedercooper.bsky.social
Cooper
@afedercooper.bsky.social
We extracted large portions of 12 books across all of our experiments. But we didn’t always extract training data.

Here’s some generated text from GPT-4.1 when we try to extract A Game of Thrones (GoT). On vibes, this text might look like GoT, but it isn’t near-verbatim.

This _isn’t_ extraction.
January 7, 2026 at 8:31 PM
(1) Instruct the LLM to continue a short prefix from a book (using Best-of-N jailbreaking for Claude 3.7 Sonnet + GPT-4.1)

(2) If successful (it wasn’t always), repeatedly query to continue the book

In our main results, the short prefix in (1) is the only ground-truth text provided to the LLM
January 7, 2026 at 8:31 PM
We extracted 4.0% of Harry Potter from jailbroken GPT-4.1.

For Gemini 2.5 Pro and Grok 3, we _didn't_ need to jailbreak, and got 76.8% and 70.3% from each.

It was relatively simple to evade guardrails with two steps:
January 7, 2026 at 8:31 PM
We extracted (parts of) 12 books in experiments with 4 frontier-lab, production LLMs.

We prompted the LLMs with a short prefix of a book and asked them to complete the rest. For Harry Potter and the Sorcerer’s Stone, we extracted 95.8% of the book from jailbroken Claude 3.7 Sonnet.
January 7, 2026 at 8:31 PM
Using just the first line of chapter 1 (60 tokens), we can deterministically generate a near-exact copy of the entire ~300 page book (!!!).

(~300 book-length pages of basically no diff! Cosine similarity of 0.9999; greedy approx. of word-level LCS of 0.992)

4/8
July 14, 2025 at 2:37 PM
made a friend
June 9, 2025 at 8:26 PM
oh to be a seal in a tide pool, instead of a grouch on the internet
June 8, 2025 at 5:46 PM
We’ve been receiving a bunch of questions about a CFP for GenLaw 2025.

We wanted to let you know that we chose not to submit a workshop proposal this year (we need a break!!). We’ll be at ICML though and look forward to catching up there!

You can watch our prior videos!
March 9, 2025 at 8:33 PM
Excited to announce that our paper on training-data memorization and copyright has been accepted for presentation at ACM CSLaw '25!

The law review version is in press, forthcoming in early 2025.

arxiv.org/abs/2404.12590
December 4, 2024 at 12:31 AM