论文雷达日报｜2026-05-12

一句话结论：今日论文层主线是 inference cost / KV-cache 重设计（MELT / SlimSpec / ConQuR / GCAD 四篇同日落地）+ agent benchmark 与多智能体协同（TMAS / AssayBench / PhoneSafety），HF Daily 当日 Rank 2 / 6 / 12 / 13 / 15 五个位置全部命中 Top picks，但 S2 similar_papers 与 HF 机构字段缺失，延伸阅读与新作者扫描两段降级。

摘要

inference 主题集中爆发：循环 Transformer（MELT）把推理深度与 KV 缓存内存解耦、SlimSpec 给推测解码 draft LM-head 加低秩压缩、ConQuR 用角点对齐旋转修复活化量化、GCAD 把 activation steering 与 KV cache 污染解耦——四篇都把"加深推理但别再线性付出内存 / 速度"作为统一目标。
agent benchmark 进入跨域阶段：TMAS 用 multi-agent synergy 做 test-time scaling，AssayBench 把 LLM agent 拉到 virtual cell phenotypic screen，PhoneSafety 把"避害"重新拆成"真懂"与"没本事"两类；驱动 agent 评测从抽象任务往具体场景（生物 / 移动端 / 协同）扩散。
三源全部正常返回：arXiv 141 篇候选携完整 abstract、HF Daily 50 篇带 upvote 信号、S2 返回 10 条 tldr 但 similar_papers 全空。Top 8 中 5 篇命中 HF Daily 当日 trending 前 15。
候选与最近 5 期（2026-05-06 → 2026-05-11）的 40 个 Top picks 全部不重叠，seen-pool 14 天滚动窗口对今日 8 条 Top picks 命中率 0%。
缺口：HF JSON 不附 affiliation、S2 未返回 similar_papers、tracked_authors 名单无命中——新作者发现段写空，延伸阅读段降级到"无高置信度增量信号"。

📌 Top picks (交叉命中)

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy（HF Daily Rank 12；42 upvotes ｜ George Wu, Nan Jing, Qing Yi 等 10 人｜ ranking hf_trending_rank:12, watchlist_keyword:reasoning,agent,inference, nice_to_have:benchmark）
- 中文一句速读：多智能体协同放大测试时算力以稳健提升推理表现。
- 入选理由：HF Daily Rank 12 + 42 upvotes 当日榜首热度；watchlist agent / reasoning / inference 三命中，且自带 benchmark。
ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs（cs.LG ｜ Chayne Thrash, Ali Abbasi, Soheil Kolouri ｜ ranking watchlist_keyword:reasoning,quantization,inference, nice_to_have:benchmark）
- 中文一句速读：角点对齐旋转矩阵实现低比特激活量化几乎无损。
- 入选理由：cs.LG 单类高分；watchlist 同时命中 quantization + inference，是 RTN 路线后的角点对齐改进。
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding（HF Daily Rank 13；5 upvotes ｜ Anton Plaksin, Sergei Krutikov, Sergei Skvortsov 等 4 人｜ ranking hf_trending_rank:13, watchlist_keyword:inference,speculative decoding, nice_to_have:benchmark）
- 中文一句速读：低秩 LM-head 减半推测解码草稿端开销。
- 入选理由：HF Daily Rank 13；命中 speculative decoding + inference 关键词，给草稿端 LM-head 提供一阶降本路径。
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents（cs.LG / cs.AI / q-bio.QM ｜ Edward De Brouwer, Carl Edwards, Alexander Wu 等 12 人｜ ranking watchlist_keyword:agent,dpo, nice_to_have:benchmark,fine-tuning,evaluation）
- 中文一句速读：面向虚拟细胞 in silico 表型筛查的 LLM 评测基准。
- 入选理由：agent benchmark 命中 + cs.LG/cs.AI 跨学科生物 benchmark，是 5/6 报告中 agent benchmark 主线的延续。
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents（cs.CL ｜ Shijue Huang, Hangyu Guo, Chenxin Li 等 10 人｜ ranking watchlist_keyword:reasoning,agent, nice_to_have:benchmark,sft,fine-tuning）
- 中文一句速读：在线策略数据进化驱动视觉原生多模态深搜智能体。
- 入选理由：cs.CL 单类高分；on-policy + multimodal 深搜双关键词，自带 SFT 与 benchmark。
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models（HF Daily Rank 15；18 upvotes ｜ Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo 等 6 人｜ ranking hf_trending_rank:15, watchlist_keyword:reasoning,kv cache）
- 中文一句速读：MELT 让循环 Transformer 推理深度与内存解耦。
- 入选理由：HF Daily Rank 15 + 18 upvotes；唯一直接给 s2_tldr 的 Top pick，Ouro 系派生 KV cache 主题。
- S2 tldr：Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption, is proposed and shown that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro’s.
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents（HF Daily Rank 6；0 upvotes ｜ Zhengyang Tang, Yi Zhang, Chenxin Li 等 21 人｜ ranking hf_trending_rank:6, watchlist_keyword:agent, nice_to_have:benchmark,evaluation）
- 中文一句速读：PhoneSafety 揭示手机智能体的安全是“真懂”还是“没本事”。
- 入选理由：HF Daily Rank 6 + s2_tldr 已发布；evaluation 主线，把 agent 安全性指标和能力指标解耦。
- S2 tldr：PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps, reveals two main patterns: stronger general phone-use ability does not reliably imply safer choices at risky moments, and models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions（HF Daily Rank 2；7 upvotes ｜ Diancheng Kang, Zheyuan Liu, Ningshan Ma 等 6 人｜ ranking hf_trending_rank:2, watchlist_keyword:inference, nice_to_have:benchmark）
- 中文一句速读：GCAD 修复激活操控的 KV 缓存污染问题。
- 入选理由：HF Daily Rank 2，热度最高；watchlist inference 命中且修复 KV cache 污染问题。

🏷 Watchlist 分类命中

按 arXiv 主分类聚合（取本日命中 watchlist 关键词但未进 Top picks 的候选，每类最多 4 条）。

cs.LG

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices — 稀疏 MoE 在端侧维持稠密近似性能。（HF Rank 19 ｜ 1 upvotes ｜ hf_trending_rank:19, watchlist_keyword:moe,inference）
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities — 用似然评分衡量数学续写的合理性。（watchlist_keyword:reasoning,inference, nice_to_have:benchmark）
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization — MASS-DPO 用多负样本主动选择强化偏好对齐。（watchlist_keyword:dpo,preference optimization, nice_to_have:benchmark）
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning — （HF Rank 36 ｜ 11 upvotes ｜ watchlist_keyword:agent,inference）

cs.AI

The Generalized Turing Test: A Foundation for Comparing Intelligence — 广义图灵测试为智能体比对提供框架。（watchlist_keyword:reasoning,agent, nice_to_have:benchmark,evaluation）
MaD Physics: Evaluating information seeking under constraints in physical environments — MaD Physics 考察约束下的信息搜索能力。（watchlist_keyword:reasoning,agent, nice_to_have:benchmark）
The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning — 研究非线性扰动对早期训练的影响。（watchlist_keyword:reasoning,agent）
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD — （watchlist_keyword:reasoning, nice_to_have:benchmark,fine-tuning）

cs.CV

Is Your Driving World Model an All-Around Player? — 驾驶世界模型多面手能力综合评估。（watchlist_keyword:agent,world model, nice_to_have:benchmark,evaluation）
PhyGround: Benchmarking Physical Reasoning in Generative World Models — PhyGround：生成式世界模型物理推理基准。（watchlist_keyword:reasoning,world model, nice_to_have:benchmark,evaluation）
Personal Visual Context Learning in Large Multimodal Models — 个人化视觉上下文学习扩展多模态模型。（watchlist_keyword:agent,inference, nice_to_have:benchmark）
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models — （HF Rank 14 ｜ 0 upvotes ｜ hf_trending_rank:14, watchlist_keyword:vla, nice_to_have:sft）

cs.CL

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization — DGPO 用方向一致性超越成对偏好比较。（watchlist_keyword:reasoning,preference optimization, nice_to_have:benchmark）
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement — 免训练文化对齐借助代表性 prompt 完成。（watchlist_keyword:agent,inference, nice_to_have:fine-tuning）
Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding — 更少数据更多收获：反事实数据的利用。（watchlist_keyword:preference optimization, nice_to_have:benchmark,sft,fine-tuning）
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation — （watchlist_keyword:agent, nice_to_have:benchmark,evaluation）

🔗 延伸阅读 (Semantic Scholar 相似论文)

本段今日无高置信度增量信号（S2 相似论文未返回）：今日 141 条候选中 0 篇拿到 similar_papers 字段，Top picks 里只有 2605.07721（MELT）和 2605.07630（PhoneSafety）取得了 S2 tldr 但同样无相似列表。按硬性约束，今日延伸阅读 extended_reading: []，等下个 batch S2 索引补齐再做扩散。详见覆盖缺口段。

🧑‍🔬 新出现的作者 / 团队

本日发现扫描未发现达标候选人：HF Daily JSON 不附 affiliation，S2 未返回 affiliation 字段，arXiv abstract 段没有结构化机构标签——按 references/discovery_rules.md 的命中规则（机构 + 跨源验证），今日无候选满足"过去 48h 首次命中 watchlist 且有 ≥1 条 affiliation 证据"。tracked_authors 名单同样无命中。详见覆盖缺口。

📉 覆盖缺口与不确定性

s2_similar_unavailable — Semantic Scholar Graph API 今日对 141 条候选均未返回 similar_papers 字段，导致延伸阅读段没有可锚定的扩散种子。该问题在 2026-05-06、05-07、05-08 三期同样出现，疑为 batch 内 S2 索引节流；待 S2 端补齐后将自动恢复。
hf_affiliation_missing — HF Daily JSON contract 仍只返回作者姓名而无机构字段，结合 arXiv abstract 内的机构未结构化抽取，导致 new_authors / tracked_labs_seen 两段降级为空。
s2_tldr_partial:8/141 — S2 仅对 8 篇返回 tldr（其中 2 篇命中 Top picks），其余候选 tldr_en 为空，本期 Top picks 的 tldr_en 字段稀疏（6/8 留空），符合 “不许自行翻译/创作 S2 tldr” 的约束。

来源与交叉验证说明

arXiv primary（141 条全部携 abstract）+ HF Daily curated（50 条带 upvote / 25 篇前 trending 名列）+ Semantic Scholar metadata（10 条返回 s2_tldr / 0 条返回 similar_papers）。结论锚定 arXiv abstract，HF 仅作热度信号、S2 仅作 tldr 与 venue 补充。

Top picks 全部由 arXiv primary 提供 abstract 与作者列表；6 条同步出现在 HF Daily curated 列表（Rank 2/6/12/13/15）形成跨源交叉验证；2 条（2605.07721、2605.07630）取得 S2 tldr 作 metadata 二次验证。无论文出现在过去 14 天 seen-pool 内（seen_before=false 全员），也无任何 ID 与 2026-05-06 → 2026-05-11 的历史 Top picks 重叠（不同抓取批次的 arxiv_id 区段）。

confidence_flags：no_seen_pool_overlap, primary_anchored:all_top_picks, no_frontier_lab_named_signal。所有 Top picks 均通过 arXiv abstract 二次校对，无来源冲突；HF Daily 仅作热度参考，未把 trending 排名当作论文结果证据。Watchlist 关键词共出现：agent×36 / reasoning×32 / inference×22 / vla×6 / world model×4，与最近 5 期主线一致。

Hanzhi's BLOG

[论文·2026-05-12]