论文雷达日报｜2026-05-08

一句话结论：HF Daily ∩ Semantic Scholar 共 43 条候选（arXiv listing 超时降级），主线集中在 reasoning（13）+ agent（9），top 3 全部「以 agent 改写检索/工具/世界回路」——RemoteZero 把 GRPO 推到无坐标地理推理，DCI 让 agent 命令行直查语料绕开 embedding，CreativityBench 揭示 affordance 重用仍是模型短板。

摘要

候选规模：43 条，全部由 HF Daily ∩ Semantic Scholar 交叉命中（arXiv listing 接口超时）。
S2 覆盖：43/43 拿到 paper_id，40/43 带 s2_tldr。
Seen-pool：14 天滚动池中 0 条命中，今日无去重压力。
主题分布：reasoning 13、agent 9、inference 6、dpo 2、preference optimization 1、quantization 1、moe 1。
强观察点：top 3（RemoteZero / DCI / CreativityBench）全部把 agent 当作改写传统接口（标注、检索、工具）的载体，标志「在固定接口上加 reranker / SFT 数据」的范式正在被 HF Daily 当日热度让位。

📌 Top picks (交叉命中)

2605.04451 RemoteZero: Geospatial Reasoning with Zero Human Annotations — 无标注地理推理框架，语义自验证驱动 GRPO 自进化。
- reason：HF trending #13 + 命中 reasoning/inference/dpo 三关键词；把 RL 后训练扩展到 Earth observation 这种长期靠人工坐标的领域，方法论对 spatial agent 有外溢价值。
- score 7.7 ｜ ranking_reasons：hf_trending_rank:13 + watchlist_keyword:reasoning,inference,dpo
- tldr_en：RemoteZero is introduced, a box-supervision-free framework for geospatial reasoning that replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations, and supports iterative self-evolution.
- links：arXiv ｜ HF ｜ S2
2605.05242 Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — agent 用终端工具直查原语料，绕过 embedding 检索。
- reason：HF trending #3 + reasoning/agent + benchmark；从 retrieval 接口侧推翻「先 top-k 再推理」范式，对 agentic search 工程栈是正面冲击。
- score 7.2 ｜ ranking_reasons：hf_trending_rank:3 + watchlist_keyword:reasoning,agent + nice_to_have:benchmark
- tldr_en：Direct corpus interaction (DCI) is studied, where an agent searches the raw corpus directly with general-purpose terminal tools, without any embedding model, vector index, or retrieval API.
- links：arXiv ｜ HF ｜ S2
2605.02910 CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing — 工具创造性重用基准，揭示当前模型短板。
- reason：三个 watchlist 关键词同时命中（reasoning/agent/inference）+ benchmark/evaluation；HF upvotes 18，是当日单篇最高赞，可作为 agent planning 模块的标准评测候选。
- score 7.0 ｜ ranking_reasons：watchlist_keyword:reasoning,agent,inference + nice_to_have:benchmark,evaluation
- tldr_en：The results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence.
- links：arXiv ｜ HF ｜ S2
2605.05758 BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models — 生物医学工具调用数据集，专家评测显著提升答题质量。
- reason：HF trending #5 + agent + fine-tuning/evaluation；把通用 tool-calling 数据集思路下沉到 biomedical，专家而非自动评测，垂域 agent 训练数据稀缺补位。
- score 5.5 ｜ ranking_reasons：hf_trending_rank:5 + watchlist_keyword:agent + nice_to_have:fine-tuning,evaluation
- tldr_en：Human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage.
- links：arXiv ｜ HF ｜ S2
2605.04956 KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels — Triton 内核生成基准，覆盖 15 类 176 任务。
- reason：HF trending #7 + watchlist:quantization + benchmark/evaluation；把 LLM 生成 GPU kernel 的失败模式类目化，是评估代码-编译-硬件这条链路的少数公开 testbed。
- score 5.3 ｜ ranking_reasons：hf_trending_rank:7 + watchlist_keyword:quantization + nice_to_have:benchmark,evaluation
- tldr_en：KernelBench-X, a benchmark designed to answer where Triton kernel generation breaks down through category-aware evaluation of correctness and hardware efficiency.
- links：arXiv ｜ HF ｜ S2
2510.04142 Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments — 把多模型推理分歧当负约束做对齐，配套 17 万轨迹基准。
- reason：watchlist:reasoning + preference optimization + benchmark + 已被早期引用 (citation_velocity:0.195)；把 concept drift 框架搬进多源 MLLM 对齐，CXR-MAX 是少数同时披露 7 个 MLLM 推理轨迹的公开数据集。
- score 4.695 ｜ ranking_reasons：watchlist_keyword:reasoning,preference optimization + nice_to_have:benchmark + citation_velocity:0.195
- tldr_en：Autonomous Preference Optimization (APO) treats inter-model divergences as dynamic negative constraints, releasing CXR-MAX (170,982 reasoning trajectories from seven MLLMs).
- links：arXiv ｜ HF ｜ S2
2602.00095 EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions — 大学 STEM 手写解答的 MLLM 评测基准。
- reason：HF trending #15 + watchlist:reasoning + benchmark/evaluation；填补「真实手写 + 公式 + 图示混排」这一被 GPT-4V 类基准漏掉的评测维度。
- score 4.5 ｜ ranking_reasons：hf_trending_rank:15 + watchlist_keyword:reasoning + nice_to_have:benchmark,evaluation
- tldr_en：（S2 未返回 tldr）
- links：arXiv ｜ HF ｜ S2
2605.05922 Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling — 视频奖励模型解耦推理与打分，CoT 提质直接转化为奖励精度。
- reason：HF trending #25 + reasoning/inference；针对生成式视频后训练的 reward model 瓶颈，给出 CoT 与 score 解耦的训练框架，比单纯加 MLLM feature 回归更有效率。
- score 4.5 ｜ ranking_reasons：hf_trending_rank:25 + watchlist_keyword:reasoning,inference
- tldr_en：DeScore is a training-efficient and generalizable video reward model that independently refines CoT reasoning quality and calibrates the final reward.
- links：arXiv ｜ HF ｜ S2

🏷 Watchlist 分类命中

主题分布：reasoning 13 / agent 9 / inference 6 / dpo 2 / preference optimization 1 / quantization 1 / moe 1。reasoning + agent 双热，inference 主要伴生于规划与世界模型方向。

reasoning（top picks 之外的延伸命中）

2605.04330 The Scaling Properties of Implicit Deductive Reasoning in Transformers — 在足够深的双向前缀 mask 下，隐式推理可逼近显式 CoT 性能。（score 4.4）
2605.04077 Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO — 在 GRPO 正负样本子集内分别做 token 级均值，drop-in 修复聚合偏差。（score 4.3）
2605.05566 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration — Lorem 风格扰动作为简单训练框架打破探索瓶颈。（score 2.8）

agent（top picks 之外的延伸命中）

2605.06614 SkillOS: Learning Skill Curation for Self-Evolving Agents — 多轮 agent + 单轮推理任务上稳定胜过 memory-free / memory-based baseline。（score 4.2）
2605.06200 A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping — 保留 IG 内在信号但重设计 turn-level clip。（score 3.2）

inference

2605.06222 When to Trust Imagination: Adaptive Action Execution for World Action Models — FFDC 验证器联合推理未来动作 / 视觉动力学 / 真实观察。（score 4.1）
2605.04647 ReflectDrive-2: RL-Aligned Self-Editing for Discrete Diffusion Planners — 自动驾驶离散扩散 planner，独立 action expert。（score 2.6）

dpo / preference optimization

2605.06196 The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles — 社会角色粒度是结构化、可因果操控的潜空间方向。（score 4.0）
2510.04142（已入 Top picks）— APO 把多模型分歧建模为动态负约束。

quantization / GPU kernel

2605.04956（已入 Top picks）— KernelBench-X。

moe

2605.06665 UniPool: A Globally Shared Expert Pool for Mixture-of-Experts — 全局共享专家池替代每层独立 router。（score 2.4）

🔗 延伸阅读 (Semantic Scholar 相似论文)

本段今日无高置信度增量信号（S2 相似论文未返回）。详见 coverage_gaps: ["s2_similar_unavailable"]，不为补足而做外部补抓。

🧑‍🔬 新出现的作者 / 团队

本日发现扫描未发现达标候选人——候选 JSON 里的 affiliations 字段为空（HF Daily 未提供机构元数据，S2 在新预印本上常缺 affiliation），无法对照 tracked_authors / tracked_affiliations 做严格命中；不为凑数硬塞作者条目。

低置信度观察（仅作为下次扫描线索，不计入官方发现）：

DCI 论文（2605.05242）作者列表里出现了 Pan Lu / Shangbin Feng 等过去在 retrieval-augmented LLM 方向较活跃的姓名，下次有 affiliation 元数据时可优先补查。
KernelBench-X 作者 Jianfei Chen / Jun Zhu 是清华系常见署名，方向上是少数同时做 GPU kernel + LLM 评测的组合，值得 sync-seed 时关注。

📉 覆盖缺口与不确定性

arxiv_unavailable：arXiv listing 接口本次抓取 read 超时（见 /tmp/paper_fetch.err）。候选池退化为 HF Daily ∩ S2 单一通道，可能漏掉未上 HF Daily 的当日 arXiv 新作；下一次抓取若恢复，应回看 24 小时窗口以补回新增条目。
s2_similar_unavailable：S2 此次未返回 similar_papers 字段，延伸阅读段无 cross-link 增量；按 SKILL 约定不外部补抓。
候选 ID 中存在 2602.xxxxx / 2510.xxxxx 等非当日时间戳条目（HF Daily 重新捞起的回流论文），不影响 seen-pool（14 天窗口内无命中），但说明今日 HF 热度榜含较强 long-tail 成分。

来源与交叉验证说明

Primary：arXiv 预印本 — 本次 listing 接口超时，但所有 top picks 的 arxiv_url 已经过 HF Daily ∩ S2 双通道交叉验证存在。
Curated：HuggingFace Daily Papers — 提供 trending rank 与 upvotes 信号，仅作为热度指示，不作为论文结果证据。
Metadata：Semantic Scholar — 提供 paper_id / s2_tldr / citation_velocity；43/43 候选均拿到 paper_id，40/43 带 s2_tldr；similar_papers 本次未返回。
冲突优先级：primary > metadata > curated > other。本日所有结论均锚在 arXiv 预印本（abstract / s2_tldr），HF 热度仅做 ranking_score 加权而非内容声明。

Hanzhi's BLOG

[论文·2026-05-08]