Author | Lightnews

Antislop: A framework for eliminating repetitive patterns in language models | Hacker News

Paper @paper.bsky.social · 8h

(2/2) 89 Likes, 85 Comments, 23 Oct 2025, Hacker News

news.ycombinator.com

From the SillyTavernAI community on Reddit: Holy hell, one of you guys wrote an anti-slop paper

Paper @paper.bsky.social · 8h

(1/2) 208 Likes, 24 Comments, 23 Oct 2025, Reddit

Explore this post and more from the SillyTavernAI community

Paper @paper.bsky.social · 8h

[12/30] 297 Likes, 109 Comments, 2 Posts
2510.15061, cs․LG | cs․CL, 21 Oct 2025

🆕Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

Samuel Paech, Allen Roush, Judah Goldfeder, Ravid Shwartz-Ziv

Widespread LLM adoption has introduced characteristic repetitive phraseology, termed "slop," which degrades output quality and makes AI-generated text immediately recognizable.

We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns.

Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace.

We demonstrate that some slop patterns appear over 1,000x more frequently in LLM output than human text.

The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000.

Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks.

In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression.

We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.

Paper @paper.bsky.social · 1d

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Paper @paper.bsky.social · 1d

2510.16888
しかし、教師ありの微調整によってのみ学習されたモデルは、しばしば注釈付きパターンに過剰に適合し、学習分布を超えて探索し汎化する能力を妨げている。この目的のために、我々はEdit-R1を紹介する。Edit-R1は、ポリシーの最適化に基づく、指示ベースの画像編集のための新しい事後学習フレームワークである...

しかし、教師ありの微調整によってのみ学習されたモデルは、しばしば注釈付きパターンに過剰に適合し、学習分布を超えて探索し汎化する能力を妨げている。

この目的のために、我々はEdit-R1を紹介する。Edit-R1は、ポリシーの最適化に基づく、指示ベースの画像編集のための新しい事後学習フレームワークである。

具体的には、フローマッチング前方プロセスと整合的な尤度フリーの政策最適化手法である拡散ネガティブアウェア・ファインチューニング（Diffusion Negative-aware Finetuning：DiffusionNFT）を利用することで、高次サンプラーの使用と、より効率的な学習を可能にしている。

ここでのもう一つの重要な課題は、編集指示やタスクの多様性に起因する普遍的な報酬モデルが存在しないことである。

このギャップを埋めるために、私たちはMultimodal Large Language Model (MLLM)を統一された訓練不要の報酬モデルとして採用し、その出力ロジットを活用してきめ細かいフィードバックを提供する。

さらに、MLLMのスコアリングノイズを減らし、最適化を安定させるために、低バランスのグループフィルタリングメカニズムを慎重に設計している。

このフレームワークで学習させたUniWorld-V2は、ImgEditとGEdit-Benchベンチマークで、それぞれ4.49と7.83という୧⃛(๑⃙⃘⁼̴̀꒳⁼̴́๑⃙⃘)୨⃛を達成しました。

重要なのは、我々のフレームワークがモデルに依存しないことで、Qwen-Image-EditやFLUX-Kontextのような多様な基本モデルに適用した場合に大幅な性能向上を実現し、その適用範囲の広さを実証している。

コードとモデルはhttps://github.com/PKU-YuanGroup/UniWorld-V2。

Paper @paper.bsky.social · 1d

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and ...

Paper page - Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Paper @paper.bsky.social · 1d

(2/2) 17 Likes, 2 Comments, 21 Oct 2025, Hugging Face

Join the discussion on this paper page

huggingface.co

From the StableDiffusion community on Reddit: UniWorld-V2: Reinforce Image Editing with Diffusion Negative-Aware Finetuning and MLLM Implicit Feedback - ( Finetuned versions of FluxKontext and Qwen-I...

Paper @paper.bsky.social · 1d

(1/2) 176 Likes, 16 Comments, 21 Oct 2025, Reddit

Explore this post and more from the StableDiffusion community

Paper @paper.bsky.social · 1d

[20/30] 193 Likes, 18 Comments, 2 Posts
2510.16888, cs․CV, 21 Oct 2025

🆕Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu,...

$Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.$

Glyph: Scaling Context Windows via Visual-Text Compression

Paper @paper.bsky.social · 1d

2510.17800
大規模言語モデル（LLM）は、文書理解、コード解析、多段階推論などのタスクにおいて、ロングコンテキストモデリングに依存することが多くなっている。しかし、コンテキストウィンドウを100万トークンレベルまで拡大すると、計算コストとメモリコストが膨大になり、ロングコンテキストLLMの実用性が制限され...

大規模言語モデル（LLM）は、文書理解、コード解析、多段階推論などのタスクにおいて、ロングコンテキストモデリングに依存することが多くなっている。

しかし、コンテキストウィンドウを100万トークンレベルまで拡大すると、計算コストとメモリコストが膨大になり、ロングコンテキストLLMの実用性が制限される。

本研究では、この課題に取り組むため、異なる視点から視覚的コンテクストのスケーリングを行う。

トークン・ベースのシーケンスを拡張する代わりに、長文を画像にレンダリングし、視覚言語モデル（VLM）で処理するフレームワーク、Glyphを提案する。

このアプローチは、意味情報を保持しながらテキスト入力を大幅に圧縮し、さらに、精度と圧縮のバランスをとるための最適な視覚レンダリング構成を特定するために、LLM駆動の遺伝的探索を設計する。

広範な実験を通じて、様々なロングコンテキストベンチマークにおいて、Qwen3-8Bのような主要なLLMに匹敵する精度を維持しながら、我々の手法が3-4倍のトークン圧縮を達成することを実証した。

この圧縮により、プリフィリングとデコードは約4倍速くなり、SFTトレーニングは約2倍速くなる。

さらに、極端な圧縮を行った場合、128KコンテキストのVLMは1Mトークンレベルのテキストタスクを処理することができる。

さらに、レンダリングされたテキストデータは、文書理解などの実世界のマルチモーダルなタスクに役立つ。

我々のコードとモデルはhttps://github.com/thu-coai/Glyph。

Paper @paper.bsky.social · 1d

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the milli...

Paper page - Glyph: Scaling Context Windows via Visual-Text Compression

Paper @paper.bsky.social · 1d

(2/2) 52 Likes, 4 Comments, 21 Oct 2025, Hugging Face

Join the discussion on this paper page

huggingface.co

From the LocalLLaMA community on Reddit

Paper @paper.bsky.social · 1d

(1/2) 94 Likes, 22 Comments, 21 Oct 2025, Reddit

Explore this post and more from the LocalLLaMA community

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Paper @paper.bsky.social · 1d

[29/30] 146 Likes, 26 Comments, 2 Posts
2510.17800, cs․CV | cs․CL | cs․LG, 21 Oct 2025

🆕Glyph: Scaling Context Windows via Visual-Text Compression

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongni...

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning.

However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs.

In this work, we take a different perspective-visual context scaling-to tackle this challenge.

Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs).

This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression.

Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks.

This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training.

Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks.

In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding.

Our code and model are released at https://github.com/thu-coai/Glyph.

1 1

Paper @paper.bsky.social · 2d

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

Paper @paper.bsky.social · 2d

2510.15742
インストラクションベースのビデオ編集は、コンテンツ制作の民主化を約束するものだが、その進歩は、大規模で高品質なトレーニングデータの不足によって著しく妨げられている。この基本的な課題に取り組むためにデザインされた総合的なフレームワーク、Dittoを紹介する。Dittoの核心は、主要な画像エディター...

インストラクションベースのビデオ編集は、コンテンツ制作の民主化を約束するものだが、その進歩は、大規模で高品質なトレーニングデータの不足によって著しく妨げられている。

この基本的な課題に取り組むためにデザインされた総合的なフレームワーク、Dittoを紹介する。

Dittoの核心は、主要な画像エディターの創造的な多様性とインコンテクスト動画ジェネレーターを融合させた斬新なデータ生成パイプラインであり、既存モデルの限られた範囲を克服している。

このプロセスを実行可能にするために、我々のフレームワークは、時間的エンハンサーによって補強された効率的で蒸留されたモデルアーキテクチャを採用することによって、法外なコストと品質のトレードオフを解決する。

最後に、完全なスケーラビリティを達成するために、このパイプライン全体は、多様な命令を作成し、厳格に出力をフィルタリングするインテリジェントなエージェントによって駆動され、スケールでの品質管理を保証する。

このフレームワークを使って、私たちは12,000GPU日以上を費やして、100万例の忠実度の高いビデオ編集の新しいデータセットであるDitto-1Mを構築した。

EdittoはDitto-1M上でカリキュラム学習ストラテジーを用いて学習させた。

その結果、優れた指示追従能力が実証され、指示ベースのビデオ編集における新たな最先端が確立された。

Paper @paper.bsky.social · 2d

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holist...

Paper page - Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Paper @paper.bsky.social · 2d

(2/2) 43 Likes, 2 Comments, 20 Oct 2025, Hugging Face

Join the discussion on this paper page

huggingface.co

From the StableDiffusion community on Reddit: EDitto -a video editing model released ( safetensors available on huggingface ) ; lot of examples on project page.

Paper @paper.bsky.social · 2d

(1/2) 205 Likes, 11 Comments, 20 Oct 2025, Reddit

Explore this post and more from the StableDiffusion community

Paper @paper.bsky.social · 2d

[15/30] 248 Likes, 13 Comments, 2 Posts
2510.15742, cs․CV, 17 Oct 2025

🆕Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Sh...

Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data.

We introduce Ditto, a holistic framework designed to tackle this fundamental challenge.

At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models.

To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence.

Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale.

Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples.

We trained our model, Editto, on Ditto-1M with a curriculum learning strategy.

The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

VISTA: A Test-Time Self-Improving Video Generation Agent

Paper @paper.bsky.social · 2d

2510.15831
テキストからビデオへの合成が急速に進歩しているにもかかわらず、生成されるビデオの品質は、正確なユーザープロンプトに決定的に依存している。他の領域で成功している既存のテスト時間最適化手法は、ビデオの多面的な性質に苦戦している。この研究では、VISTA(Video Iterative Self-improvemenT Agent)を...

テキストからビデオへの合成が急速に進歩しているにもかかわらず、生成されるビデオの品質は、正確なユーザープロンプトに決定的に依存している。

他の領域で成功している既存のテスト時間最適化手法は、ビデオの多面的な性質に苦戦している。

この研究では、VISTA(Video Iterative Self-improvemenT Agent)を紹介する。VISTAは、反復的なループの中でプロンプトを改良することにより、ビデオ生成を自律的に改善する新しいマルチエージェントシステムである。

VISTAはまず、ユーザーのアイデアを構造化された時間的計画に分解する。

生成後、ロバストなペアワイズ・トーナメントによって最良のビデオが特定される。

この入賞ビデオは、映像、音声、文脈の忠実度に焦点を当てた3人の専門エージェントによって批評される。

最後に、推論エージェントがこのフィードバックを総合して、次の世代サイクルのためにプロンプトを内省的に書き換え、強化する。

シングルシーンとマルチシーンのビデオ生成シナリオで実験を行った結果、従来の手法では一貫した効果が得られなかったのに対し、VISTAは一貫してビデオの品質とユーザーの意図との整合性を向上させ、最新のベースラインに対してペアワイズで最大60%の勝率を達成した。

人間の評価者も同じ意見で、66.4％の比較でVISTAの出力を好んだ。

Paper @paper.bsky.social · 2d

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, s...