论文雷达日报|2026-05-06
一句话结论:本日候选层主旋律是「Agent benchmark 大爆发」——Top 8 里有 5 篇是 benchmark/eval 工作(iWorld-Bench、ESARBench、Workspace-Bench、Healthcare AI GYM、WindowsWorld),从交互式世界模型扩展到 UAV 搜救、跨应用 GUI、医院、工作区文件依赖五条赛道,配合 1 篇 VLM 主动探索方法(What You Think is What You See)和 1 篇 skill-loop GRPO(Skills-Coach),显示 agent 评测体系正在从「单任务 isolated」往「过程级、跨应用、长期 skill 演化」位移。
摘要
- 三源全部正常返回,共 132 条候选;其中 hf+s2 双命中 32 条、arxiv-only 97 条、arxiv+hf 3 条;seen-pool 14 天滚动池里 0 条命中。
- Top picks 由
ranking_score决定,前 8 全部带watchlist_keyword:reasoning|agent|world model;6/8 来自 HF Daily trending(rank 3–21),1/8 仅 arXiv 抓到(Audio-Visual Intelligence 综述)。 - 主题集中度高:benchmark 关键词命中 51 次(Top 8 内 5 次),agent 命中 18 次,world model 显式命中 2 篇(iWorld-Bench、What You Think is What You See)。
- S2 全部候选均未返回
similar_papers→ 延伸阅读段无增量信号;HF JSON 不附 affiliation → 新作者扫描缺机构维度。 - 无大厂 frontier-lab 明确署名命中(Google DeepMind / OpenAI / Anthropic / Meta 等 affiliation 字段均空),主线由 HF Daily 上的中国高校/独立小组贡献。
📌 Top picks (交叉命中)
- A Benchmark for Interactive World Models with a Unified Action Generation Framework(HF #17 / 1 upvote / cs.CV+cs.AI) → iWorld-Bench 提出统一动作生成评测交互式世界模型。
入选原因:ranking_score=8.3、hf_trending_rank:17 + watchlist_keyword:reasoning,agent,world model + nice_to_have:benchmark,evaluation,是今天唯一显式标注「世界模型 + 统一动作生成」交叉命中的 benchmark。 - ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue(HF #6 / 1 upvote) → 首个 MLLM 驱动 UAV 搜救场景 benchmark。
入选原因:ranking_score=7.9、HF 高排名 +embodied + benchmark双命中,今日 8 条里少数附 S2 TLDR 的论文。
tldr_en (S2): This work proposes the novel task of Embodied Search and Rescue, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making, and presents ESARBench, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. - Audio-Visual Intelligence in Large Foundation Models(cs.CV / 综述类) → 统一梳理音视频联合基座模型方法与 benchmark。
入选原因:ranking_score=7.5,三关键词reasoning + agent + preference optimization同时命中 + benchmark/embodied 加权;arXiv-only,未上 HF Daily。 - Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies(HF #18 / 2 upvotes) → 100-task lite 子集减 70% 评测开销的工作区依赖基准。
入选原因:ranking_score=6.2、附完整 S2 TLDR,瞄准「跨文件长 context agent」场景,与 OSWorld 系列形成互补。
tldr_en (S2): Workspace-Bench is introduced, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies and Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. - What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity(cs.AI) → 视觉-语言好奇心驱动 VLM agent 主动探索稀疏奖励任务。
入选原因:ranking_score=6.0、reasoning + agent + world model三命中,方法层而非 benchmark,是 Top 8 里少数纯算法贡献。 - Healthcare AI GYM for Medical Agents(HF #15 / 1 upvote) → 多轮临床 RL 训练环境 + Turn-level Truncated On-Policy Distillation。
入选原因:ranking_score=6.0、HF 高排名 + benchmark + 医疗 vertical agent 主题;附 S2 TLDR(聚焦 TT-OPD 蒸馏方法)。
tldr_en (S2): To improve training efficiency and stability, this work proposes Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. - WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments(HF #21 / 9 upvotes) → 跨应用工作流的 process-centric GUI agent 基准。
入选原因:ranking_score=5.9、HF upvotes=9 是 Top 8 里最高之一,明确把 OSWorld 的「单应用」局限推到「跨应用 profession-specific 流水线」。
tldr_en (S2): A computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities, is presented. - Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO(HF #3 / 0 upvotes) → 免训练 GRPO 让 LLM agent 技能自演化。
入选原因:ranking_score=5.7、HF trending 第 3、agent + benchmark/evaluation命中;与上面 HeavySkill / Skills-Coach 形成「skill-as-inner-loop」小族群。
tldr_en (S2): Experimental results demonstrate that Skills-Coach achieves significant performance improvements in skill capability across a wide range of categories, highlighting its potential to advance the development of more robust and adaptable LLM-based agents.
🏷 Watchlist 分类命中
下列为 Top picks 之外、本日 watchlist 关键词命中的候选(每类上限 4 条,按 ranking_score 降序)。
cs.CL
- Safety and accuracy follow different scaling laws in clinical large language models(score 5.5,
agent + inference + benchmark + scaling law):临床 LLM 安全性与正确率呈不同缩放规律。 - Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in the Era of LLMs(score 5.0,
reasoning + benchmark):重审 reasoning-intensive 检索任务并提升 retriever。 - Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to Modern LLMs(score 3.5):NLP 教学综述,从分词到现代 LLM。
- MCJudgeBench: Constraint-Level Judge Evaluation in Multi-Constraint Scenarios(score 3.0):多约束场景下 LLM-as-judge 评测。
cs.AI
- OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Tasks(score 4.6, HF #19, 36 upvotes):搜索 agent 在高难度任务上的 SFT/fine-tune 推进。
- Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems(score 4.5):前沿 AI 系统多目标优化的语境化。
- An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval(score 4.0):经验驱动检索的可插拔 RAG 技能。
- QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs(score 4.0):端侧多 agent 间量化 KV-cache 交接。
cs.CV
- StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning(score 5.0):状态感知 VLM 用于机器人 affordance 推理。
- RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation(score 4.5):循环-深度 ViT 降低分割开销。
- Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Generation(score 4.5):CoQ 思路增强 VQA。
- Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures(score 3.0):从基础学习失败角度重审遗忘。
cs.RO
- RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models(score 5.0):机器人视频世界模型的多模态奖励对齐蒸馏。
- Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision-Language Reasoning(score 4.5):用 VLM 推理配置巡检机器人扫描参数。
- Evaluating Generative Models as Interactive Emergent Representations of Human-Like Agency(score 2.5):把生成模型当作交互式涌现表征评估。
cs.LG
- GEM-FI: Gated Evidential Mixtures with Fisher Modulation(score 2.5):Fisher 调制的门控证据混合。
- From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline(score 2.0):process-aware 风险估计流水线。
- Spatiotemporal Convolutions on EEG signal — A Representation Learning Perspective(score 2.0):EEG 时空卷积表征学习。
- Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers(score 2.0):Transformer 任务推理双模式的 task-vector 几何。
cs.MA
- Physics-Grounded Multi-Agent Architecture for Traceable, Risk-Aware Human-AI Decisions(score 2.5):物理基线 + 可追溯多 agent 决策架构。
cs.CR
- MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents(score 3.0):度量 coding agent 的组合漏洞诱导。
cs.HC
- Deco: Extending Personal Physical Objects into Pervasive AI Companion(score 2.0):个人物件扩展为泛在 AI 伴侣。
cs.DC
- Implementing True MPI Sessions and Evaluating MPI Initialization Scalability(score 2.0):真正的 MPI Sessions 实现与初始化扩展性评估。
🔗 延伸阅读 (Semantic Scholar 相似论文)
本段今日无高置信度增量信号(S2 相似论文未返回):本次 132 条候选中 0 条携带 similar_papers 字段,因此延伸阅读段空载,无法在不外部抓取的前提下补充引文图邻居。详见「📉 覆盖缺口」中 s2_similar_unavailable。
🧑🔬 新出现的作者 / 团队
本日发现扫描未发现达标候选人:HF Daily JSON 不附 author affiliation,arXiv 抓取的作者列表也无法关联到 watchlist tracked_authors / tracked_affiliations / tracked_lab 任一规则;今日 132 条候选 ranking_reasons 中无任何 tracked_* 命中。按 discovery_rules.md 的「无候选时显式说明,不强凑」规定,本段留空。
📉 覆盖缺口与不确定性
s2_similar_unavailable—— S2 metadata 这次没有为任一候选返回similar_papers,因此「延伸阅读」段空载。处置:写明降级原因,不外部 fetch。affiliation_unavailable—— HF Daily JSON 不附 affiliation,arXiv 抓取层在affiliations字段返回空数组(132 条全空),导致tracked_affiliations/tracked_labs无法匹配,新作者发现段降级。frontier_lab_signal_low—— 今日 Top picks 与 Watchlist 命中里无任何明确署名 frontier-lab(Google / Anthropic / OpenAI / Meta / DeepSeek / Mistral / xAI 等)的候选;信号偏向 HF trending 上的中国高校 + 独立 PI 小组的 agent benchmark 投稿。
来源与交叉验证说明
本期候选构成(132 条):arxiv-only 97 / hf+s2 32 / arxiv+hf 3。三源全部正常返回,无 endpoint 降级。
权重原则(见 source_policy.md):结论锚 primary (arXiv 预印本) → metadata (Semantic Scholar TLDR / venue) → curated (HF Daily trending + upvotes) → other。本期 6/8 Top picks 的 trending 信号来自 HF(curated),但每条结论的事实陈述均回锚到 arXiv abstract / S2 TLDR。citation_count==0 不作为降权依据(5/2026 投稿的论文 S2 索引尚未追上)。
冲突说明:Healthcare AI GYM 的 abstract 主线讲多轮临床 RL 环境,而 S2 TLDR 强调 Turn-level Truncated On-Policy Distillation (TT-OPD) 方法 —— 两者并不矛盾(TT-OPD 是该工作的训练算法贡献),但 tldr_cn 选择从 abstract 浓缩出环境层主线、将 TT-OPD 留在 tldr_en 中保留 S2 原文。
seen-pool(14 天)含 241 keys,本日 0 命中 → 候选与近 14 天日报无重复。