论文雷达日报|2026-05-19
一句话结论:今日论文层是「agent 评测基准爆发日」——Top 8 中 6 篇为新基准,集体指向 agent 在闭环工具使用 / 长程记忆 / 跨视角推理上的可信度缺口;最硬数据点是 MM-ToolBench 上 Claude Opus 4.6 仅 32.0% 任务成功率,远低于 94.0% 人类基线。
摘要
- 今日 135 条候选三源齐备(arXiv + HF Daily + Semantic Scholar),0 条命中 14 天 seen-pool;Top picks 上限 8 条已取满。
- 主线:评测基准爆发——8 篇 Top picks 中 6 篇是新 benchmark(ChildAgentEval / MM-ToolBench / GIM / CrossViewBench / LongMINT,外加 VideoSeeker 带评测),集体在量化 agent 能力可信度缺口。
- 硬数据点:MM-ToolBench 上 Claude Opus 4.6 仅 32.0% vs 人类 94.0%;LongMINT 7 套记忆系统平均 27.9%;GIM 发现思考预算/量化与选型同等重要。
- 效率侧:Measuring Maximum Activations 给出 MoE 峰值比同规模 dense 低 14–23 倍的部署经验律。
- S2 similar_papers 全 135 条未返回,延伸阅读段为空(见覆盖缺口)。
📌 Top picks (交叉命中)
- Evaluating Cognitive Age Alignment in Interactive AI Agents(2605.17894|HF#1 / ▲1 / score 7.4|hf_trending_rank:1 + watchlist_keyword:reasoning,agent;HF 当日热度第一,首个心理测量学接地的交互式认知年龄基准。) → 心理测量学交互基准,量化 MLLM 智能体认知年龄差距
- TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents(2605.16909|HF#6 / ▲2 / score 7.4|hf_trending_rank:6 + watchlist_keyword:reasoning,agent + nice_to_have:benchmark,evaluation;闭环全模态工具使用基准,给出最强模型 vs 人类的硬数据点。) → 全模态闭环工具使用基准,最强模型仅 32% 远逊人类
- VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation(2605.16079|HF#4 / ▲1 / score 7.1|hf_trending_rank:4 + watchlist_keyword:reasoning,agent;把 agentic 推理内化进实例级视频理解,附四阶段全自动数据合成管线。) → 原生工具调用实现实例级视频理解,超 GPT-4o
- GIM: Evaluating models via tasks that integrate multiple cognitive domains(2605.18663|score 7.0|watchlist_keyword:reasoning,quantization,test-time compute + nice_to_have:benchmark,evaluation;IRT 校准的整合型基准,最详尽的 test-time compute 权衡研究。) → 整合型推理基准,量化测试时算力与模型能力权衡
- CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark(2605.18621|score 7.0|watchlist_keyword:reasoning,agent,inference + nice_to_have:benchmark,evaluation;跨视角空间智能的数据集/模型/基准完整三件套。) → 跨视角空间智能:1.6M 数据集+对齐模型+基准
- LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems(2605.18565|score 7.0|watchlist_keyword:reasoning,agent,long context + nice_to_have:benchmark,evaluation;高干扰长程记忆基准,系统性暴露记忆系统检索/构建短板。) → 长程记忆抗干扰基准,7 套系统平均仅 27.9%
- MAP: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion(2605.18572|score 6.5|watchlist_keyword:reasoning,agent,inference + nice_to_have:evaluation;元认知配置器缓解跨域性能波动的自治多智能体框架。) → 元认知多智能体框架,跨域选策略提升劝说成功率
- Measuring Maximum Activations in Open Large Language Models(2605.15572|HF#33 / ▲14 / score 6.0|watchlist_keyword:moe,quantization,inference;面向部署的最大激活幅度系统测量,给出 MoE 与 dense 的对比经验律。) → 测开源 LLM 最大激活,MoE 峰值比 dense 低一个量级
🏷 Watchlist 分类命中
rank≥9 且 watchlist 关键词命中、未进 Top picks 的 fresh 论文,按 arXiv 主分类分组(每类≤4):
cs.CL
- Code as Agent Harness(2605.18747|score 5.8|hf_trending_rank:22 + watchlist_keyword:reasoning,agent + nice_to_have:evaluation,embodied) → 把代码当作智能体 harness,统一感知-推理-行动循环
- EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL(2605.18703|score 5.0|watchlist_keyword:reasoning,agent + nice_to_have:benchmark,sft) → 可执行环境合成+鲁棒 RL,规模化工具使用智能体训练
- DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention(2605.18753|score 4.0|watchlist_keyword:long context,inference) → 可微自适应稀疏分层注意力,改进 top-k KV 块选择
- Forecasting Downstream Performance of LLMs With Proxy Metrics(2605.18607|score 2.5|watchlist_keyword:reasoning + nice_to_have:evaluation) → 用代理指标预测 LLM 下游性能,辅助架构/语料决策
cs.AI
- Latent Action Reparameterization for Efficient Agent Inference(2605.18597|score 4.5|watchlist_keyword:agent,inference + nice_to_have:benchmark) → 潜在动作重参数化压缩决策视野,降低智能体推理成本
- SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents(2605.18693|score 3.0|watchlist_keyword:agent + nice_to_have:benchmark,evaluation) → 评测智能体从仓库/文档生成可复用可执行 skill 的能力
- SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science(2605.18630|score 3.0|watchlist_keyword:reasoning + nice_to_have:benchmark,evaluation) → 评测 LLM 在计算科学任务中多轮澄清提问的能力
- When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State(2605.18580|score 3.0|watchlist_keyword:agent + nice_to_have:benchmark,evaluation) → 隐藏竞争状态下基于轨迹的评测,揭露仅看结果的不安全策略
cs.LG
- Post-Trained MoE Can Skip Half Experts via Self-Distillation(2605.18643|score 4.8|hf_trending_rank:27 + watchlist_keyword:moe,inference + nice_to_have:benchmark) → 后训练 MoE 经自蒸馏可跳过半数专家而不掉点
- General Preference Reinforcement Learning(2605.18721|score 4.0|watchlist_keyword:reasoning,preference optimization) → 通用偏好强化学习,统一可验证奖励与偏好两条对齐路线
- Distilling Tabular Foundation Models for Structured Health Data(2605.18702|score 2.0|watchlist_keyword:inference) → 蒸馏表格基础模型到轻量模型,降低结构化健康数据推理成本
- Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts(2605.18606|score 2.0|watchlist_keyword:inference) → 对称性诱导分布偏移下的物理对齐等变傅里叶神经算子
cs.CV
- ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop(2605.18746|score 5.0|watchlist_keyword:reasoning,agent + nice_to_have:benchmark,embodied) → 闭合感知-动作循环的具身空间智能基准 ESI-Bench
- Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory(2605.18733|score 5.0|watchlist_keyword:quantization,inference + nice_to_have:benchmark,evaluation) → 免训练身份感知记忆,改进叙事型长视频生成一致性
- MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents(2605.18652|score 4.8|hf_trending_rank:2 + watchlist_keyword:agent) → 学习智能体式多模态记忆控制,支撑长程 GUI 智能体
- LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation(2605.18739|score 4.6|hf_trending_rank:29 + watchlist_keyword:inference,kv cache + nice_to_have:benchmark) → NVFP4 并行基础设施,加速长视频生成训练与推理
cs.RO
- Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction(2605.18729|score 5.0|watchlist_keyword:agent,world model + nice_to_have:evaluation,embodied) → 双粒度认知记忆+自主知识归纳的自演化具身智能体
- DexHoldem: Playing Texas Hold’em with Dexterous Embodied System(2605.18727|score 3.0|watchlist_keyword:agent + nice_to_have:benchmark,embodied) → 用灵巧手具身系统玩德州扑克,评测闭环具身决策
- Dexora: Open-source VLA for High-DoF Bimanual Dexterity(2605.18722|score 3.0|watchlist_keyword:vla + nice_to_have:benchmark,embodied) → 开源高自由度双臂灵巧操作 VLA 模型 Dexora
cs.DC
- Ranking Opinions with Few States in Population Protocols(2605.18707|score 4.0|watchlist_keyword:agent,scheduler) → 群体协议中少状态意见排序,对抗调度器下分布式计算
- EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet(2605.18683|score 2.0|watchlist_keyword:inference) → 以太网内网集合通信的抽象与多态,优化训练/推理
stat.ML
- SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate(2605.18745|score 5.0|watchlist_keyword:mixture of experts,inference + nice_to_have:benchmark,evaluation) → 免近似免训练粒子滤波,做扩散代理的推理时引导
🔗 延伸阅读 (Semantic Scholar 相似论文)
本段今日无高置信度增量信号(S2 相似论文未返回)。Semantic Scholar 对全部 135 条候选均未返回 similar_papers 字段,按 skill 约束不单独外部检索补全,coverage_gaps 记 s2_similar_unavailable。
🧑🔬 新出现的作者 / 团队
- Mohit Bansal(UNC Chapel Hill (机构未在候选元数据中提供,按公开认知标注)|group: oss-ai-labs|cross_checked=false):今日 Top pick #5 LongMINT 的资深/末位作者,长程记忆抗干扰基准,未在 tracked_authors 列表中,符合「Top picks 通讯作者且未追踪」规则。 证据:https://arxiv.org/abs/2605.18565
- Aditya Tanna(未知(候选元数据未附机构;与 Vinay Kumar Sankarapu / Pratinav Seth 同组)|group: oss-ai-labs|cross_checked=false):与 Nassim Bouarour / Mohamed Bouadi / Pratinav Seth / Vinay Kumar Sankarapu 同一团队,今日在 4 篇同日预印本(2605.18702/18696/18635/18654,表格基础模型与可解释性方向)重复出现,满足「不同论文重复出现 ≥2 次」规则。 证据:https://arxiv.org/abs/2605.18702
📉 覆盖缺口与不确定性
s2_similar_unavailable:Semantic Scholar 相似论文图谱对全部 135 条候选均未返回,延伸阅读段为空。s2_tldr_sparse:135 候选仅 9 条带 S2 tldr,且均为低分非 Top pick;8 篇 Top picks 的tldr_en全空,中文速读由 abstract 浓缩。affiliations_empty:arXiv listing 与 HF JSON 均未附机构,机构级发现与tracked_affiliations匹配本期未触发;tracked_labs_seen为空。- arXiv / HF Daily 本期均正常,无单源全挂。
来源与交叉验证说明
本期以 arXiv 预印本为 primary 结论锚,HF Daily trending(curated)用于热度与命中排序,Semantic Scholar(metadata)仅提供少量 tldr 且 similar_papers 全空。冲突优先级 primary > metadata > curated > other,未出现单源全挂。
Top picks 结论均锚定 arXiv abstract 原文(如 MM-ToolBench 的 32.0% vs 94.0%、LongMINT 的 27.9%);HF trending rank 仅作排序信号不作结果证据;citation_count 全空属新预印本未索引,不作降权。