论文雷达日报｜2026-05-18

一句话结论：今日 45 篇候选 Top 8 把『推理评测被审计』(ProofGrid + Physics-R1 + CoRD)、『推理硬件适配』(GQLA + Follow the Mean) 与『agent / 世界模型层』(Pinductor + MMSkills) 三条主线同日摆上台；最强单点观察是 ProofGrid 把 LLM 推理评测从『答案对不对』推到『过程能否被机器核验』。

摘要

本日 45 条 HF Daily curated 候选交叉 arXiv primary 与 Semantic Scholar metadata 后，Top 8 集中在三条主线：

推理评测被审计——ProofGrid (2605.12524) 用机器可校验证明替代终答案打分；Physics-R1 (2605.14040) 三阶段审计揭示 UGPhysics-Train / SciInstruct / MMK12 与公共 eval 间 134 个近重复样例；CoRD (2605.02290) 用多教师 step-wise 协同解码蒸馏长链推理。三篇同日，把 reasoning 评估的『可信度』和『合成 trace 的质量』推到台面。
推理硬件适配——GQLA (2605.15250) 直击 DeepSeek-V2/V3 MLA 只能走 H100 absorbed-MQA 的痛点，恢复 head-axis tensor parallel 与 MTP 收益，对 H20 等出口管制硬件意义直接；Follow the Mean (2605.10302) 提出『换参考集而非更新参数』的可控生成新接口，绕开常规 fine-tune 路径。
Agent / 世界模型层——Pinductor (2605.13740) 让 LLM 当结构先验提议 POMDP 候选；MMSkills (2605.13527) 把可复用程序性知识扩展到多模态视觉 agent；Dense Metric Depth (2605.15876) 让 VLM 直接输出稠密度量深度，弥补 text-only 监督在 3D 感知上的缺口。

45 篇候选全部 seen_before=false，但 affiliations 全空 + S2 similar_papers 字段缺失：延伸阅读段降级，新作者/团队段降级。三源主通道无失败，无 arxiv/hf/s2_unavailable。

📌 Top picks (交叉命中)

按 ranking_score 排序（前 8）。每条 reason 引用 ranking_reasons，evidence_links 用候选 JSON 自带 URL。

2605.12524 Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism — 用机器可校验证明 stress-test LLM 推理力。
- reason：HF Daily Rank 3 当日热度榜前列 + watchlist reasoning/inference 双命中，自带 benchmark/evaluation 加成；把『答案对不对』改成『过程能否被自动核验』，对 reasoning 评估方法学有正面冲击。
- tldr_en：ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone, is introduced and an instrumented proof-checking pipeline… improving measurement resolution and separating proof planning from low-level execution noise.
- 链接：arXiv · HF · S2
2605.10302 Follow the Mean: Reference-Guided Flow Matching — 不微调，换参考集即可掌控 flow matching 方向。
- reason：HF Daily Rank 10 + watchlist inference/dpo 命中；提出『靠数据集而非参数更新』的可控生成接口，绕开 fine-tune / adapter 路径，对偏好对齐与 test-time 调控有方法学价值。
- tldr_en：This work instantiates a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows, and points to a broader direction: generative models that adapt through data, not parameter updates.
- 链接：arXiv · HF · S2
2605.15250 GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding — 群查询潜在注意力解耦 MLA，适配多种硬件解码。
- reason：HF Daily Rank 17 + watchlist inference/kv cache 命中；直接针对 DeepSeek-V2/V3 的 MLA 在非 H100（如 H20）上失效问题，恢复 head-axis tensor parallel 与 MTP 收益，对部署经济学影响明确。
- tldr_en：（S2 未返回 tldr）
- 链接：arXiv · HF
2605.13740 Learning POMDP World Models from Observations with Language-Model Priors — LLM 先验提议 POMDP 模型，迭代精修少量交互。
- reason：HF Daily Rank 18 + watchlist agent/world-model 命中；用 LLM 当结构先验生成 POMDP 候选并对置信度评分迭代细化，对 sample-efficient 世界模型建模思路有方法学增量。
- tldr_en：This work asks whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduces Pinductor (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score.
- 链接：arXiv · HF · S2
2605.14040 Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning — 审计多模态物理评测，揭训练污染与翻译漂移。
- reason：HF Daily Rank 4 + watchlist reasoning 命中 + evaluation 加成；三阶段审计揭示 UGPhysics-Train / SciInstruct / MMK12 与公共 eval 间 134 个近重复样例，对当前 multimodal physics 评测的可信度形成直接挑战。
- tldr_en：The multimodal-physics evaluation pipeline end-to-end is audits end-to-end and three undetected construction practices that distort how the field measures vision-language reasoning are documented: train-eval contamination, translation drift, and MCQ saturation.
- 链接：arXiv · HF · S2
2605.13527 MMSkills: Towards Multimodal Skills for General Visual Agents — 多模态可复用技能，给视觉 agent 运行时决策。
- reason：HF Daily 99 upvotes 当日 curated 热门 + watchlist agent/inference 双命中；把『可复用程序性知识』从文本/代码扩展到多模态，是视觉 agent skill-library 路线的方法学增量。
- tldr_en：MMSkills are introduced, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making and a branch-loaded multimodal skill agent is introduced…
- 链接：arXiv · HF · S2
2605.02290 Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding — 多教师 step-wise 协同解码蒸馏长链推理。
- reason：HF Daily Rank 21 + 32 upvotes + watchlist reasoning/inference 双命中；针对 post-hoc 选 trace 的浪费提出协同 step-wise 解码，对长链 distillation 的数据合成路径有方法学增量。
- tldr_en：CoRD is introduced, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search…
- 链接：arXiv · HF · S2
2605.15876 Unlocking Dense Metric Depth Estimation in VLMs — 让 VLM 直接产稠密度量深度。
- reason：HF Daily Rank 26 + watchlist reasoning/inference 命中 + benchmark 加成；不靠外部视觉模型蒸馏，让 VLM 直接学稠密 metric depth，弥补 text-only 监督在 3D 感知上的缺口。
- tldr_en：（S2 未返回 tldr）
- 链接：arXiv · HF

🏷 Watchlist 分类命中

inference / agent / reasoning 三大主题各 9–10 篇命中，world-model 3 篇形成小群，其它关键词仅有零星旁证。

inference (10) — Top picks 已纳：ProofGrid、Follow the Mean、GQLA、CoRD；MMSkills 同样命中。共同关切点是 test-time efficiency 与 controllability。
agent (10) — Top picks 已纳：Pinductor、MMSkills；旁证：MetaAgent-X（端到端 RL 突破多智能体天花板）、Known By Their Actions（用 UI 轨迹指纹识别 LLM 浏览器 agent）、Solvita（agentic 进化提升 LLM 竞赛编程）。
reasoning (9) — Top picks 已纳：ProofGrid、Physics-R1、CoRD、Dense Metric Depth；旁证：Solvita。今日主旋律是把 reasoning 评测搬到机器可验证 / 经审计的轨道。
world model (3) — Top picks 已纳：Pinductor；旁证：WorldAct（把整块 3D 世界激活为可交互对象中心场景）；ReactiveGWM（NPC 反应式游戏世界模型）作边缘命中。
kv cache (2) — 主命中：GQLA；FashionChameleon 仅边缘命中（real-time human-garment 视频）。
moe / quantization / dpo / vla — 单篇旁证：HodgeCover（高阶拓扑覆盖驱动稀疏 MoE 压缩）、InsightTok（离散 tokenization 量化）、Follow the Mean (dpo)、PhysBrain 1.0 / MobileEgo Anywhere (vla)。

🔗 延伸阅读 (Semantic Scholar 相似论文)

本段今日无高置信度增量信号（Semantic Scholar 相似论文未返回）。

S2 本次为 26/45 候选返回 s2_tldr 与 s2_url，但全部缺失 similar_papers 字段——按 SKILL 硬约束不发起额外 WebFetch / curl；延伸阅读段降级为空。
已记录 coverage_gaps: s2_similar_unavailable。如需手动扩展，请走 Top picks 各自的 s2_url 入口。

🧑‍🔬 新出现的作者 / 团队

本日发现扫描未发现达标候选人。

45 篇候选的 affiliations 字段在本次抓取中全空——按 discovery_rules.md，机构归属是判定 frontier-labs / oss-ai-labs 等团队归属的关键信号，缺失情况下无法套用规则。
已记录 coverage_gaps: affiliations_missing，不强行写新作者条目以免凑数。paper_fetch.py 下次抓取若 arXiv affiliation 解析恢复，本段会自动重新生效。

📉 覆盖缺口与不确定性

s2_similar_unavailable — S2 metadata 返回 s2_tldr/s2_url 但缺 similar_papers；延伸阅读段降级为空。
affiliations_missing — 45 篇候选 affiliations 字段全空；新作者/团队段降级。不影响 Top picks 排序（ranking_score 不依赖机构信号）。
无主通道失败 — arXiv / HF Daily / Semantic Scholar 三源均健康，未触发 arxiv_unavailable / hf_daily_unavailable / s2_unavailable；候选 JSON 111 KB，候选数 45 条，全部 seen_before=false，14 天 seen-pool 254 条无碰撞。

来源与交叉验证说明

Tier	来源	角色
primary	arXiv	论文本体证据；本次 45 篇 abs/pdf URL 全部解析
curated	HF Daily Papers	热度信号 (`hf_upvotes`/`hf_trending_rank`)，不作为论文方法学结论证据
metadata	Semantic Scholar	提供 26/45 篇 `s2_tldr` 与 `s2_url`；本次 `similar_papers` 字段缺失

冲突优先级：primary > metadata > curated > other。本日 Top picks 结论均锚 arXiv primary（abstract）+ Semantic Scholar metadata（s2_tldr）；HF Daily 仅用于排序与曝光信号。

Hanzhi's BLOG

[论文·2026-05-18]