论文雷达日报|2026-05-08
一句话结论:HF Daily ∩ Semantic Scholar 共 43 条候选(arXiv listing 超时降级),主线集中在 reasoning(13)+ agent(9),top 3 全部「以 agent 改写检索/工具/世界回路」——RemoteZero 把 GRPO 推到无坐标地理推理,DCI 让 agent 命令行直查语料绕开 embedding,CreativityBench 揭示 affordance 重用仍是模型短板。
摘要
- 候选规模:43 条,全部由 HF Daily ∩ Semantic Scholar 交叉命中(arXiv listing 接口超时)。
- S2 覆盖:43/43 拿到 paper_id,40/43 带 s2_tldr。
- Seen-pool:14 天滚动池中 0 条命中,今日无去重压力。
- 主题分布:reasoning 13、agent 9、inference 6、dpo 2、preference optimization 1、quantization 1、moe 1。
- 强观察点:top 3(RemoteZero / DCI / CreativityBench)全部把 agent 当作改写传统接口(标注、检索、工具)的载体,标志「在固定接口上加 reranker / SFT 数据」的范式正在被 HF Daily 当日热度让位。
📌 Top picks (交叉命中)
-
2605.04451 RemoteZero: Geospatial Reasoning with Zero Human Annotations — 无标注地理推理框架,语义自验证驱动 GRPO 自进化。
- reason:HF trending #13 + 命中 reasoning/inference/dpo 三关键词;把 RL 后训练扩展到 Earth observation 这种长期靠人工坐标的领域,方法论对 spatial agent 有外溢价值。
- score 7.7 | ranking_reasons:
hf_trending_rank:13 + watchlist_keyword:reasoning,inference,dpo - tldr_en:RemoteZero is introduced, a box-supervision-free framework for geospatial reasoning that replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations, and supports iterative self-evolution.
- links:arXiv | HF | S2
-
2605.05242 Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — agent 用终端工具直查原语料,绕过 embedding 检索。
- reason:HF trending #3 + reasoning/agent + benchmark;从 retrieval 接口侧推翻「先 top-k 再推理」范式,对 agentic search 工程栈是正面冲击。
- score 7.2 | ranking_reasons:
hf_trending_rank:3 + watchlist_keyword:reasoning,agent + nice_to_have:benchmark - tldr_en:Direct corpus interaction (DCI) is studied, where an agent searches the raw corpus directly with general-purpose terminal tools, without any embedding model, vector index, or retrieval API.
- links:arXiv | HF | S2
-
2605.02910 CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing — 工具创造性重用基准,揭示当前模型短板。
- reason:三个 watchlist 关键词同时命中(reasoning/agent/inference)+ benchmark/evaluation;HF upvotes 18,是当日单篇最高赞,可作为 agent planning 模块的标准评测候选。
- score 7.0 | ranking_reasons:
watchlist_keyword:reasoning,agent,inference + nice_to_have:benchmark,evaluation - tldr_en:The results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence.
- links:arXiv | HF | S2
-
2605.05758 BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models — 生物医学工具调用数据集,专家评测显著提升答题质量。
- reason:HF trending #5 + agent + fine-tuning/evaluation;把通用 tool-calling 数据集思路下沉到 biomedical,专家而非自动评测,垂域 agent 训练数据稀缺补位。
- score 5.5 | ranking_reasons:
hf_trending_rank:5 + watchlist_keyword:agent + nice_to_have:fine-tuning,evaluation - tldr_en:Human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage.
- links:arXiv | HF | S2
-
2605.04956 KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels — Triton 内核生成基准,覆盖 15 类 176 任务。
- reason:HF trending #7 + watchlist:quantization + benchmark/evaluation;把 LLM 生成 GPU kernel 的失败模式类目化,是评估代码-编译-硬件这条链路的少数公开 testbed。
- score 5.3 | ranking_reasons:
hf_trending_rank:7 + watchlist_keyword:quantization + nice_to_have:benchmark,evaluation - tldr_en:KernelBench-X, a benchmark designed to answer where Triton kernel generation breaks down through category-aware evaluation of correctness and hardware efficiency.
- links:arXiv | HF | S2
-
2510.04142 Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments — 把多模型推理分歧当负约束做对齐,配套 17 万轨迹基准。
- reason:watchlist:reasoning + preference optimization + benchmark + 已被早期引用 (citation_velocity:0.195);把 concept drift 框架搬进多源 MLLM 对齐,CXR-MAX 是少数同时披露 7 个 MLLM 推理轨迹的公开数据集。
- score 4.695 | ranking_reasons:
watchlist_keyword:reasoning,preference optimization + nice_to_have:benchmark + citation_velocity:0.195 - tldr_en:Autonomous Preference Optimization (APO) treats inter-model divergences as dynamic negative constraints, releasing CXR-MAX (170,982 reasoning trajectories from seven MLLMs).
- links:arXiv | HF | S2
-
2602.00095 EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions — 大学 STEM 手写解答的 MLLM 评测基准。
-
2605.05922 Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling — 视频奖励模型解耦推理与打分,CoT 提质直接转化为奖励精度。
- reason:HF trending #25 + reasoning/inference;针对生成式视频后训练的 reward model 瓶颈,给出 CoT 与 score 解耦的训练框架,比单纯加 MLLM feature 回归更有效率。
- score 4.5 | ranking_reasons:
hf_trending_rank:25 + watchlist_keyword:reasoning,inference - tldr_en:DeScore is a training-efficient and generalizable video reward model that independently refines CoT reasoning quality and calibrates the final reward.
- links:arXiv | HF | S2
🏷 Watchlist 分类命中
主题分布:reasoning 13 / agent 9 / inference 6 / dpo 2 / preference optimization 1 / quantization 1 / moe 1。reasoning + agent 双热,inference 主要伴生于规划与世界模型方向。
reasoning(top picks 之外的延伸命中)
- 2605.04330 The Scaling Properties of Implicit Deductive Reasoning in Transformers — 在足够深的双向前缀 mask 下,隐式推理可逼近显式 CoT 性能。(score 4.4)
- 2605.04077 Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO — 在 GRPO 正负样本子集内分别做 token 级均值,drop-in 修复聚合偏差。(score 4.3)
- 2605.05566 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration — Lorem 风格扰动作为简单训练框架打破探索瓶颈。(score 2.8)
agent(top picks 之外的延伸命中)
- 2605.06614 SkillOS: Learning Skill Curation for Self-Evolving Agents — 多轮 agent + 单轮推理任务上稳定胜过 memory-free / memory-based baseline。(score 4.2)
- 2605.06200 A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping — 保留 IG 内在信号但重设计 turn-level clip。(score 3.2)
inference
- 2605.06222 When to Trust Imagination: Adaptive Action Execution for World Action Models — FFDC 验证器联合推理未来动作 / 视觉动力学 / 真实观察。(score 4.1)
- 2605.04647 ReflectDrive-2: RL-Aligned Self-Editing for Discrete Diffusion Planners — 自动驾驶离散扩散 planner,独立 action expert。(score 2.6)
dpo / preference optimization
- 2605.06196 The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles — 社会角色粒度是结构化、可因果操控的潜空间方向。(score 4.0)
- 2510.04142(已入 Top picks)— APO 把多模型分歧建模为动态负约束。
quantization / GPU kernel
- 2605.04956(已入 Top picks)— KernelBench-X。
moe
- 2605.06665 UniPool: A Globally Shared Expert Pool for Mixture-of-Experts — 全局共享专家池替代每层独立 router。(score 2.4)
🔗 延伸阅读 (Semantic Scholar 相似论文)
本段今日无高置信度增量信号(S2 相似论文未返回)。详见 coverage_gaps: ["s2_similar_unavailable"],不为补足而做外部补抓。
🧑🔬 新出现的作者 / 团队
本日发现扫描未发现达标候选人——候选 JSON 里的 affiliations 字段为空(HF Daily 未提供机构元数据,S2 在新预印本上常缺 affiliation),无法对照 tracked_authors / tracked_affiliations 做严格命中;不为凑数硬塞作者条目。
低置信度观察(仅作为下次扫描线索,不计入官方发现):
- DCI 论文(2605.05242)作者列表里出现了 Pan Lu / Shangbin Feng 等过去在 retrieval-augmented LLM 方向较活跃的姓名,下次有 affiliation 元数据时可优先补查。
- KernelBench-X 作者 Jianfei Chen / Jun Zhu 是清华系常见署名,方向上是少数同时做 GPU kernel + LLM 评测的组合,值得 sync-seed 时关注。
📉 覆盖缺口与不确定性
arxiv_unavailable:arXiv listing 接口本次抓取 read 超时(见/tmp/paper_fetch.err)。候选池退化为 HF Daily ∩ S2 单一通道,可能漏掉未上 HF Daily 的当日 arXiv 新作;下一次抓取若恢复,应回看 24 小时窗口以补回新增条目。s2_similar_unavailable:S2 此次未返回similar_papers字段,延伸阅读段无 cross-link 增量;按 SKILL 约定不外部补抓。- 候选 ID 中存在
2602.xxxxx/2510.xxxxx等非当日时间戳条目(HF Daily 重新捞起的回流论文),不影响 seen-pool(14 天窗口内无命中),但说明今日 HF 热度榜含较强 long-tail 成分。
来源与交叉验证说明
- Primary:arXiv 预印本 — 本次 listing 接口超时,但所有 top picks 的 arxiv_url 已经过 HF Daily ∩ S2 双通道交叉验证存在。
- Curated:HuggingFace Daily Papers — 提供 trending rank 与 upvotes 信号,仅作为热度指示,不作为论文结果证据。
- Metadata:Semantic Scholar — 提供 paper_id / s2_tldr / citation_velocity;43/43 候选均拿到 paper_id,40/43 带 s2_tldr;similar_papers 本次未返回。
- 冲突优先级:primary > metadata > curated > other。本日所有结论均锚在 arXiv 预印本(abstract / s2_tldr),HF 热度仅做 ranking_score 加权而非内容声明。