Lightnews — Scholar-powered news

Jed Brown

@jedbrown.org

This preprint is a big caveat to (1) above. It suggests the plagiarism is common in LLM responses to organic prompts. If plagiarism detectors aren't flagging it, it may be because the passages are smaller or they aren't checking the original content.

Jed Brown @jedbrown.org · 1d

"Chatbots are routinely breaching the ethical standards that humans are normally held to."

It is often asked how often organic prompting returns near-verbatim content in the responses. This preprint shows it's very common, especially with expository writing and code.

arxiv.org/abs/2411.10242

Figure 1 showing how LLM responses patch together substantial passages from existing sources.

Write a tutorial about changing a tire.
[...]
Pump or crank the jack to lift the tire off the
ground. You need to lift it high enough to
remove the flat tire and replace it with a
spare.
#### Step 7: Remove the Lug Nuts and Tire
Now remove the lug nuts all the way. Since
you've already loosened them, you should be
able to unscrew them mostly by hand.
Remove the flat tire by [...]
Here are the steps for you to
change the tyre on your car
How to Change a Flat Tire
THU APRIL 1, 2021
...
Step seven – Pump or crank the jack to
lift the tire off the ground. You need to
lift it high enough to remove the flat
tire and replace it with a spare.
...
8. RAISE THE VEHICLE WITH THE JACK
...
9. UNSCREW THE LUG NUTS
Now it’s time to remove the lug nuts all
the way. Since you've already loosened
them, you should be able to unscrew
them mostly by hand.
Figure 1: LLMs often output text that overlaps with snippets of their training data when responding to benign
prompts. Red text indicates snippets that were found verbatim on the Web.

$Preprint. Under review.0% 5% 10% 15% 20% Mean Overlap Rate Gemini 1.5 Pro Gemini 1.5 Flash Llama 3.1 (405B) Llama 3.1 (8B) Claude 3 Opus Claude 3.5 Sonnet Claude 3 Haiku GPT-4o GPT-4o-mini (a) LLMs unintentionally reproduce training data. We measure the average overlap rate across all tasks and text types. All model’s generations consists of 7% to 15% existing text from the Internet.0% 5% 10% 15% 20% Mean Overlap Rate Llama 2 (13B) Llama 2 (7B) GPT-4 Turbo GPT-4 GPT-3.5 Turbo WildChat LMSYS-Chat-1M (b) Training data reproduction occurs in real, benign LLM conversations. We analyze two real-world con- versation datasets and find that non-adversarial repro- duction is not unique to our experimental setup. Notice that not all models exist in both datasets. Figure 2: LLMs reproduce training data for natural prompts. We define reproduced strings as text found verbatim on the Internet. For every LLM generation, we measure the overlap rate, that is, the fraction of text contained in a reproduced substring of at least 50 characters. We find non-trivial overlap rates for both our broad set of controlled prompts (a) and real-world interactions (b). Additional models are in Appendix B.2.$

January 10, 2026 at 12:42 AM

Jed Brown

@jedbrown.org

"Chatbots are routinely breaching the ethical standards that humans are normally held to."

It is often asked how often organic prompting returns near-verbatim content in the responses. This preprint shows it's very common, especially with expository writing and code.

arxiv.org/abs/2411.10242

$Preprint. Under review.0% 5% 10% 15% 20% Mean Overlap Rate Gemini 1.5 Pro Gemini 1.5 Flash Llama 3.1 (405B) Llama 3.1 (8B) Claude 3 Opus Claude 3.5 Sonnet Claude 3 Haiku GPT-4o GPT-4o-mini (a) LLMs unintentionally reproduce training data. We measure the average overlap rate across all tasks and text types. All model’s generations consists of 7% to 15% existing text from the Internet.0% 5% 10% 15% 20% Mean Overlap Rate Llama 2 (13B) Llama 2 (7B) GPT-4 Turbo GPT-4 GPT-3.5 Turbo WildChat LMSYS-Chat-1M (b) Training data reproduction occurs in real, benign LLM conversations. We analyze two real-world con- versation datasets and find that non-adversarial repro- duction is not unique to our experimental setup. Notice that not all models exist in both datasets. Figure 2: LLMs reproduce training data for natural prompts. We define reproduced strings as text found verbatim on the Internet. For every LLM generation, we measure the overlap rate, that is, the fraction of text contained in a reproduced substring of at least 50 characters. We find non-trivial overlap rates for both our broad set of controlled prompts (a) and real-world interactions (b). Additional models are in Appendix B.2.$

January 10, 2026 at 12:37 AM

Jed Brown

@jedbrown.org

Great contextualization of this work. When we let financial interests choose terminology and accept corporate testimony as though it were an honest and accurate depiction of the technology, we are perpetuating a lie to the public and abetting bad court rulings.

This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. In a 2023 letter to the U.S. Copyright Office, OpenAI said that “models do not store copies of the information that they learn from.” Google similarly told the Copyright Office that “there is no copy of the training data—whether text, images, or other formats—present in the model itself.” Anthropic, Meta, Microsoft, and others have made similar claims. (None of the AI companies mentioned in this article agreed to my requests for interviews.)

In copyright lawsuits, the learning metaphor lets companies make misleading comparisons between chatbots and humans. At least one judge has repeated these comparisons, likening an AI company’s theft and scanning of books to “training schoolchildren to write well.” There have also been two lawsuits in which judges ruled that training an LLM on copyrighted books was fair use, but both rulings were flawed in their handling of memorization: One judge cited expert testimony that showed that Llama could reproduce no more than 50 tokens from the plaintiffs’ books, though research has since been published that proves otherwise. The other judge acknowledged that Claude had memorized significant portions of books but said that the plaintiffs had failed to allege that this was a problem.

January 10, 2026 at 12:22 AM

Jed Brown

@jedbrown.org

Unsourced and improperly-sourced claims are rampant, as seen in the deluge of slop papers and legal briefs and government/Deloitte reports that people are constantly getting caught trying to fraudulently pass as human work. And note that these are not the crime, but merely evidence of the crime.

January 9, 2026 at 4:40 AM

Jed Brown

@jedbrown.org

I think it's a bad question for informing decisions (like "what's the chance I get stopped for speeding in this school zone?"), but the answer is that we really don't know. Only a subset of organic LLM interactions are checked for that purpose and current checkers are fallible in many ways.

January 9, 2026 at 4:40 AM

Jed Brown

@jedbrown.org

We know that:
1. organic prompting for content that is routinely run through plagiarism detectors (which access a subset of the LLM's training data) does not frequently turn red, and
2. some prompting elicits extensive verbatim content.

This is a recipe for lulling people into complacency.

January 9, 2026 at 3:29 AM

Jed Brown

@jedbrown.org

Ghost authorship and paraphrased plagiarism are rarely detected/enforced without other evidence (contracts, confessions/bragging, other process records), but it's still a clear professional norm, while a lot of people want to normalize LLMs as somehow being an exemption card for such norms.

January 9, 2026 at 3:29 AM

Jed Brown

@jedbrown.org

There is no consistent procedure for assessing plagiarism. Journals and institutions have internal protocols, but it's a subjective standard and not a legal matter (no court, no jury; that's only for copyright infringement). But it's still misconduct if you don't get caught.

January 9, 2026 at 3:29 AM

Jed Brown

@jedbrown.org

If you trust an LLM's "summary" (it isn't really a summary), you may commit misconduct by misstating their actual claims. If you take LLM output as a sort of fuzzy search/idea generator and track down original sources (don't trust LLM output), read them, and then write your own paper, that's fine.

January 9, 2026 at 3:29 AM

Jed Brown

@jedbrown.org

To "see that's not true" would be like wearing gloves when firing a gun and seeing that you didn't leave fingerprints on the weapon. You still pulled the trigger, but may be less likely to be caught. LLMs are like wearing gloves with holes: you never know if it's going to leave that evidence.

William Gunn @metasynthesis.net · 2d

I think it's reasonable to be concerned about prompts containing passages from existing writing, but I can just imagine a university taking a very conservative approach and saying "don't use LLMs, you're likely to get plagiarized responses", people seeing that's not true, and ignoring the warning.

January 9, 2026 at 1:37 AM

Jed Brown

@jedbrown.org

I think you were using "plagiarism" colloquially to mean "near-verbatim" or "red on a plagiarism detector" while I was using the academic definition as an epistemic/process violation (in which ghost authorship very much is plagiarism and verbatim/similarity is merely circumstantial evidence).

Jed Brown @jedbrown.org · Oct 28

Everyone should read The Two Victims of Plagiarism from @plagiarismtoday.com in the context of LLMs.

LLMs provide plausible deniability unless we recognize what it means to choose to use the plagiarism machine: non-consensual ghost authorship in a blender.

www.plagiarismtoday.com/2019/08/01/t...

January 9, 2026 at 1:37 AM

Jed Brown

@jedbrown.org

University admins are typically LLM boosters who make FOMO-driven commitments without understanding what the products are. Universities *should* recognize that LLM use in scholarly work is misconduct akin to ghost authorship because it misrepresents the author's epistemic relation to the work.

January 9, 2026 at 12:49 AM

Jed Brown

@jedbrown.org

My interpretation of that thread is that Git core maintainers believe you can't sign the DCO for a contribution derived from an LLM (a position I agree with and wrote about last summer 👇). One contributor using an LLM for a docs-only PR two years ago would be exempt, no?
hachyderm.io/@jedbrown/11...

Jed Brown (@[email protected])

Attached: 1 image What does the Signed-off-by tag mean? It is certifying the Developer Certificate of Origin (DCO). https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin I claim you cannot c...

hachyderm.io

January 9, 2026 at 12:40 AM

Jed Brown

@jedbrown.org

I appreciate your writing and I want to be constructive here. It's very confusing and involves fighting cognitive biases like the ELIZA Effect (Weizenbaum 1966) to accurately describe a synthetic text extruder as having no intent or basis in reality, but the anthropomorphization is so corrosive.

January 8, 2026 at 10:55 PM

Jed Brown

@jedbrown.org

This may seem like a pedantic point, but "Grok" did not "confirm". It was prompted by a user and replied like autocomplete. The response does not reflect awareness or a recounting of facts (the mechanism is indifferent to facts). It's confusing the public that so many journalists make this mistake.

Ketan Joshi @ketanjoshi.co · 8d

This is a thread of major media outlets falsely anthropomorphising the "Grok" chatbot program and in doing so, actively and directly removing responsibility and accountability from individual people working at X who created a child pornography generator (Elon Musk, Nikita Bier etc)

#1: Reuters

Grok says safeguard lapses led to images of 'minors in minimal clothing' on X
By Reuters
January 2, 20267:40 PM GMT+1Updated 1 hour ago

Illustration shows xAI and Grok logos
xAI and Grok logos are seen in this illustration taken, February 16, 2025. REUTERS/Dado Ruvic/Illustration Purchase Licensing Rights, opens new tab
Jan 2 (Reuters) - Elon Musk's xAI artificial intelligence chatbot Grok said on Friday lapses in safeguards had resulted in "images depicting minors in minimal clothing" on social media platform X and that improvements were being made to prevent this.
Screenshots shared by users on X showed Grok's public media tab filled with images that users said had been altered when they uploaded photos and prompted the bot to alter them.
"There are isolated cases where users prompted for and received AI images depicting minors in minimal clothing," Grok said in a post on X. "xAI has safeguards, but improvements are ongoing to block such requests entirely."
"As noted, we've identified lapses in safeguards and are urgently fixing them — CSAM is illegal and prohibited," Grok said, referring to Child Sexual Abuse Material.
Grok gave no further details.
When contacted by Reuters for comment by email, xAI replied with the message "Legacy Media Lies".
In a separate reply to a user on X on Thursday, Grok said most cases could be prevented through advanced filters and monitoring although it said "no system is 100% foolproof," adding that xAI was prioritising improvements and reviewing details shared by users.

January 8, 2026 at 10:49 PM

Jed Brown

@jedbrown.org

There is an ongoing natural experiment in which students use LLMs to generate papers and usually the plagiarism detector is green. (It's still plagiarism akin to ghost authorship, but hard to prove.) And that space of prompts and relevant training data may not be representative of professional uses.

January 8, 2026 at 10:41 PM

Jed Brown

@jedbrown.org

Thanks. It's hard to be confident that a given prompt (which might incidentally or intentionally contain a phrase appearing in the training data, such as a book that also quotes an attributed verbatim passage) won't elicit near-verbatim content.

January 8, 2026 at 10:41 PM

Jed Brown

@jedbrown.org

This focus on near-verbatim matches has already assumed the premise that we just don't want to get caught, not that we think copyright infringement or plagiarism are bad or dishonest practices. Meanwhile, OpenAI is arguing in court that legality of a prompt depends on what is returned by their model

Jed Brown @jedbrown.org · Nov 16

It's notable that OpenAI lawyers tried this because it undermines the indemnification clause in their services agreement. That indemnification clause backed up by heaps of money and hubris has been key to lawyers for business users allowing widespread use of "AI"-generated content.

January 8, 2026 at 10:18 PM

Jed Brown

@jedbrown.org

Any time you prompt an LLM, you get text of unknown provenance. Software may be the most direct and mature "continue this" application, but spicy-autocomplete for prose is also a thing. There is no simple/reliable rule to prevent an LLM from producing near-verbatim results.

January 8, 2026 at 10:18 PM

Jed Brown

@jedbrown.org

Software devs do it routinely: e.g., type `//sparse matrix transpose` and auto-complete a page of near-verbatim code with namespaces intact and copyright stripped. devclass.com/2022/10/17/g...
That litigation is ongoing githubcopilotlitigation.com/case-updates...

And "verbatim" isn't the standard.

January 8, 2026 at 9:58 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news