论文雷达日报｜2026-05-19

一句话结论：今日论文层是「agent 评测基准爆发日」——Top 8 中 6 篇为新基准，集体指向 agent 在闭环工具使用 / 长程记忆 / 跨视角推理上的可信度缺口；最硬数据点是 MM-ToolBench 上 Claude Opus 4.6 仅 32.0% 任务成功率，远低于 94.0% 人类基线。

摘要

今日 135 条候选三源齐备（arXiv + HF Daily + Semantic Scholar），0 条命中 14 天 seen-pool；Top picks 上限 8 条已取满。
主线：评测基准爆发——8 篇 Top picks 中 6 篇是新 benchmark（ChildAgentEval / MM-ToolBench / GIM / CrossViewBench / LongMINT，外加 VideoSeeker 带评测），集体在量化 agent 能力可信度缺口。
硬数据点：MM-ToolBench 上 Claude Opus 4.6 仅 32.0% vs 人类 94.0%；LongMINT 7 套记忆系统平均 27.9%；GIM 发现思考预算/量化与选型同等重要。
效率侧：Measuring Maximum Activations 给出 MoE 峰值比同规模 dense 低 14–23 倍的部署经验律。
S2 similar_papers 全 135 条未返回，延伸阅读段为空（见覆盖缺口）。

📌 Top picks (交叉命中)

Evaluating Cognitive Age Alignment in Interactive AI Agents（2605.17894｜HF#1 / ▲1 / score 7.4｜hf_trending_rank:1 + watchlist_keyword:reasoning,agent；HF 当日热度第一，首个心理测量学接地的交互式认知年龄基准。） → 心理测量学交互基准，量化 MLLM 智能体认知年龄差距
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents（2605.16909｜HF#6 / ▲2 / score 7.4｜hf_trending_rank:6 + watchlist_keyword:reasoning,agent + nice_to_have:benchmark,evaluation；闭环全模态工具使用基准，给出最强模型 vs 人类的硬数据点。） → 全模态闭环工具使用基准，最强模型仅 32% 远逊人类
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation（2605.16079｜HF#4 / ▲1 / score 7.1｜hf_trending_rank:4 + watchlist_keyword:reasoning,agent；把 agentic 推理内化进实例级视频理解，附四阶段全自动数据合成管线。） → 原生工具调用实现实例级视频理解，超 GPT-4o
GIM: Evaluating models via tasks that integrate multiple cognitive domains（2605.18663｜score 7.0｜watchlist_keyword:reasoning,quantization,test-time compute + nice_to_have:benchmark,evaluation；IRT 校准的整合型基准，最详尽的 test-time compute 权衡研究。） → 整合型推理基准，量化测试时算力与模型能力权衡
CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark（2605.18621｜score 7.0｜watchlist_keyword:reasoning,agent,inference + nice_to_have:benchmark,evaluation；跨视角空间智能的数据集/模型/基准完整三件套。） → 跨视角空间智能：1.6M 数据集+对齐模型+基准
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems（2605.18565｜score 7.0｜watchlist_keyword:reasoning,agent,long context + nice_to_have:benchmark,evaluation；高干扰长程记忆基准，系统性暴露记忆系统检索/构建短板。） → 长程记忆抗干扰基准，7 套系统平均仅 27.9%
MA $^{2}$ P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion（2605.18572｜score 6.5｜watchlist_keyword:reasoning,agent,inference + nice_to_have:evaluation；元认知配置器缓解跨域性能波动的自治多智能体框架。） → 元认知多智能体框架，跨域选策略提升劝说成功率
Measuring Maximum Activations in Open Large Language Models（2605.15572｜HF#33 / ▲14 / score 6.0｜watchlist_keyword:moe,quantization,inference；面向部署的最大激活幅度系统测量，给出 MoE 与 dense 的对比经验律。） → 测开源 LLM 最大激活，MoE 峰值比 dense 低一个量级

🏷 Watchlist 分类命中

rank≥9 且 watchlist 关键词命中、未进 Top picks 的 fresh 论文，按 arXiv 主分类分组（每类≤4）：

cs.CL

Code as Agent Harness（2605.18747｜score 5.8｜hf_trending_rank:22 + watchlist_keyword:reasoning,agent + nice_to_have:evaluation,embodied） → 把代码当作智能体 harness，统一感知-推理-行动循环
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL（2605.18703｜score 5.0｜watchlist_keyword:reasoning,agent + nice_to_have:benchmark,sft） → 可执行环境合成+鲁棒 RL，规模化工具使用智能体训练
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention（2605.18753｜score 4.0｜watchlist_keyword:long context,inference） → 可微自适应稀疏分层注意力，改进 top-k KV 块选择
Forecasting Downstream Performance of LLMs With Proxy Metrics（2605.18607｜score 2.5｜watchlist_keyword:reasoning + nice_to_have:evaluation） → 用代理指标预测 LLM 下游性能，辅助架构/语料决策

cs.AI

Latent Action Reparameterization for Efficient Agent Inference（2605.18597｜score 4.5｜watchlist_keyword:agent,inference + nice_to_have:benchmark） → 潜在动作重参数化压缩决策视野，降低智能体推理成本
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents（2605.18693｜score 3.0｜watchlist_keyword:agent + nice_to_have:benchmark,evaluation） → 评测智能体从仓库/文档生成可复用可执行 skill 的能力
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science（2605.18630｜score 3.0｜watchlist_keyword:reasoning + nice_to_have:benchmark,evaluation） → 评测 LLM 在计算科学任务中多轮澄清提问的能力
When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State（2605.18580｜score 3.0｜watchlist_keyword:agent + nice_to_have:benchmark,evaluation） → 隐藏竞争状态下基于轨迹的评测，揭露仅看结果的不安全策略

cs.LG

Post-Trained MoE Can Skip Half Experts via Self-Distillation（2605.18643｜score 4.8｜hf_trending_rank:27 + watchlist_keyword:moe,inference + nice_to_have:benchmark） → 后训练 MoE 经自蒸馏可跳过半数专家而不掉点
General Preference Reinforcement Learning（2605.18721｜score 4.0｜watchlist_keyword:reasoning,preference optimization） → 通用偏好强化学习，统一可验证奖励与偏好两条对齐路线
Distilling Tabular Foundation Models for Structured Health Data（2605.18702｜score 2.0｜watchlist_keyword:inference） → 蒸馏表格基础模型到轻量模型，降低结构化健康数据推理成本
Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts（2605.18606｜score 2.0｜watchlist_keyword:inference） → 对称性诱导分布偏移下的物理对齐等变傅里叶神经算子

cs.CV

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop（2605.18746｜score 5.0｜watchlist_keyword:reasoning,agent + nice_to_have:benchmark,embodied） → 闭合感知-动作循环的具身空间智能基准 ESI-Bench
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory（2605.18733｜score 5.0｜watchlist_keyword:quantization,inference + nice_to_have:benchmark,evaluation） → 免训练身份感知记忆，改进叙事型长视频生成一致性
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents（2605.18652｜score 4.8｜hf_trending_rank:2 + watchlist_keyword:agent） → 学习智能体式多模态记忆控制，支撑长程 GUI 智能体
LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation（2605.18739｜score 4.6｜hf_trending_rank:29 + watchlist_keyword:inference,kv cache + nice_to_have:benchmark） → NVFP4 并行基础设施，加速长视频生成训练与推理

cs.RO

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction（2605.18729｜score 5.0｜watchlist_keyword:agent,world model + nice_to_have:evaluation,embodied） → 双粒度认知记忆+自主知识归纳的自演化具身智能体
DexHoldem: Playing Texas Hold’em with Dexterous Embodied System（2605.18727｜score 3.0｜watchlist_keyword:agent + nice_to_have:benchmark,embodied） → 用灵巧手具身系统玩德州扑克，评测闭环具身决策
Dexora: Open-source VLA for High-DoF Bimanual Dexterity（2605.18722｜score 3.0｜watchlist_keyword:vla + nice_to_have:benchmark,embodied） → 开源高自由度双臂灵巧操作 VLA 模型 Dexora

cs.DC

Ranking Opinions with Few States in Population Protocols（2605.18707｜score 4.0｜watchlist_keyword:agent,scheduler） → 群体协议中少状态意见排序，对抗调度器下分布式计算
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet（2605.18683｜score 2.0｜watchlist_keyword:inference） → 以太网内网集合通信的抽象与多态，优化训练/推理

stat.ML

SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate（2605.18745｜score 5.0｜watchlist_keyword:mixture of experts,inference + nice_to_have:benchmark,evaluation） → 免近似免训练粒子滤波，做扩散代理的推理时引导

🔗 延伸阅读 (Semantic Scholar 相似论文)

本段今日无高置信度增量信号（S2 相似论文未返回）。Semantic Scholar 对全部 135 条候选均未返回 similar_papers 字段，按 skill 约束不单独外部检索补全，coverage_gaps 记 s2_similar_unavailable。

🧑‍🔬 新出现的作者 / 团队

Mohit Bansal（UNC Chapel Hill (机构未在候选元数据中提供，按公开认知标注)｜group: oss-ai-labs｜cross_checked=false）：今日 Top pick #5 LongMINT 的资深/末位作者，长程记忆抗干扰基准，未在 tracked_authors 列表中，符合「Top picks 通讯作者且未追踪」规则。证据：https://arxiv.org/abs/2605.18565
Aditya Tanna（未知（候选元数据未附机构；与 Vinay Kumar Sankarapu / Pratinav Seth 同组）｜group: oss-ai-labs｜cross_checked=false）：与 Nassim Bouarour / Mohamed Bouadi / Pratinav Seth / Vinay Kumar Sankarapu 同一团队，今日在 4 篇同日预印本（2605.18702/18696/18635/18654，表格基础模型与可解释性方向）重复出现，满足「不同论文重复出现 ≥2 次」规则。证据：https://arxiv.org/abs/2605.18702

📉 覆盖缺口与不确定性

s2_similar_unavailable：Semantic Scholar 相似论文图谱对全部 135 条候选均未返回，延伸阅读段为空。
s2_tldr_sparse：135 候选仅 9 条带 S2 tldr，且均为低分非 Top pick；8 篇 Top picks 的 tldr_en 全空，中文速读由 abstract 浓缩。
affiliations_empty：arXiv listing 与 HF JSON 均未附机构，机构级发现与 tracked_affiliations 匹配本期未触发；tracked_labs_seen 为空。
arXiv / HF Daily 本期均正常，无单源全挂。

来源与交叉验证说明

本期以 arXiv 预印本为 primary 结论锚，HF Daily trending（curated）用于热度与命中排序，Semantic Scholar（metadata）仅提供少量 tldr 且 similar_papers 全空。冲突优先级 primary > metadata > curated > other，未出现单源全挂。

Top picks 结论均锚定 arXiv abstract 原文（如 MM-ToolBench 的 32.0% vs 94.0%、LongMINT 的 27.9%）；HF trending rank 仅作排序信号不作结果证据；citation_count 全空属新预印本未索引，不作降权。

Hanzhi's BLOG

[论文·2026-05-19]