论文雷达日报｜2026-04-26

一句话结论：今日 Top picks 集中在 agent 元能力 与 生成-理解协同评估——若你只看一篇，DR-Venus（4B 端侧 deep research agent）的工程门槛最低、复用价值最高。

摘要

今日候选共 32 篇（HF+S2 双源命中），全部未在过去 14 天 seen-pool 内，按 ranking_score 取前 8 入 Top picks。
Top picks 主线：长上下文 agent 自演化压缩、LLM 科学家元评估、生成式空间智能、端到端 RL 网页生成、KD 统一视角、4B 边缘 deep research agent、EEG TTA 基准——agent + reasoning + 多模态评估三股力量。
Watchlist 命中按主题展开：agent 系（OpenMobile / CluE / DeVI / CreativeGame）、reasoning 系（MMCORE / Abstain-R1 / 视觉表征）、inference 系（Flash-SemiCRF kernel、Audio 微调安全）。
本日候选 JSON 未携带 S2 similar_papers，延伸阅读段为空并写入 coverage_gaps；同时 candidates 的 arXiv categories 字段全空，已记入 confidence_flags。

📌 Top picks (交叉命中)

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression（HF↑18 / trend#33 / score 4.5 / watchlist_keyword:reasoning,agent；nice_to_have:benchmark）
→ 自演化压缩规则压实终端 Agent 轨迹，提升长 horizon 表现
- tldr_en: TACO is proposed, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents that consistently improves performance across mainstream agent frameworks and strong backbone models.
- 入选理由：HF + S2 双源命中，watchlist:reasoning,agent；自演化压缩切中长上下文 agent 痛点
- 链接：HF · S2
- 作者：Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng …
AI scientists produce results without reasoning scientifically（HF↑4 / trend#35 / score 4.5 / watchlist_keyword:reasoning,agent；nice_to_have:evaluation）
→ 25K runs 揭示 LLM 科学 agent 缺乏自我纠错推理
- tldr_en: Evaluated LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, are evaluated through more than 25,000 agent runs and two complementary lenses, observing that the base model is the primary determinant of both performance and behavior.
- 入选理由：HF + S2 双源命中，watchlist:reasoning,agent；25K runs 大规模评估 LLM 科研 agent 元能力
- 链接：HF · S2
- 作者：Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani …
Encoder-Free Human Motion Understanding via Structured Motion Descriptions（HF↑1 / trend#6 / score 4.4 / hf_trending_rank:6；watchlist_keyword:reasoning）
→ 关节序列转结构化语言描述，免编码器做动作理解
- tldr_en: Structured Motion Description (SMD) is proposed, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory and enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules.
- 入选理由：HF trending #6 + watchlist:reasoning；编码器替代方案值得跟进多模态融合路径
- 链接：HF · S2
- 作者：Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning（HF↑3 / trend#9 / score 4.1 / hf_trending_rank:9；watchlist_keyword:agent）
→ WebGen-R1：端到端 RL 网页生成媲美 DeepSeek-R1
- tldr_en: This work proposes WebGen-R1, an end-to-end RL framework tailored for project-level website generation that not only consistently outperforms heavily scaled open-source models, but also rivals the state-of-the-art DeepSeek-R1 in functional success, while substantially exceeding it in valid rendering and aesthetic alignment.
- 入选理由：HF trending #9 + watchlist:agent；端到端 RL 在 project-level 网页生成有 baseline 价值
- 链接：HF · S2
- 作者：Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li …
Hybrid Policy Distillation for LLMs（HF↑10 / trend#10 / score 4.0 / hf_trending_rank:10；watchlist_keyword:reasoning）
→ KD 统一为 token-level 对数似然，混合策略蒸馏
- tldr_en: This work breaks down the design of existing KD methods and presents a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level, and proposes Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking.
- 入选理由：HF trending #10 + watchlist:reasoning；KD 统一视角对蒸馏方向有方法论意义
- 链接：HF · S2
- 作者：Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu
Exploring Spatial Intelligence from a Generative Perspective（HF↑21 / trend#43 / score 3.5 / watchlist_keyword:reasoning；nice_to_have:benchmark,fine-tuning,evaluation）
→ GSI-Syn：生成式空间智能并能强化空间理解
- tldr_en: Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding, providing the first clear evidence that generative training can tangibly strengthen spatial reasoning.
- 入选理由：HF upvotes 21 + watchlist:reasoning + benchmark/eval；GSI 评估生成式空间能力
- 链接：HF · S2
- 作者：Muzhi Zhu, Shunyao Jiang, Huanyi Zheng, Zekai Luo, Hao Zhong, Anzhou Li …
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data（HF↑47 / trend#48 / score 3.5 / watchlist_keyword:agent；nice_to_have:benchmark,sft,fine-tuning）
→ DR-Venus：10K 开源数据训 4B 边缘 deep research agent
- tldr_en: This work study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization, and presents DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data.
- 入选理由：HF upvotes 47（当日最高 tier）+ watchlist:agent；4B 端侧 deep research agent 工程价值高
- 链接：HF · S2
- 作者：Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yusheng Song, Guoqing Wang …
Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts（HF↑2 / trend#22 / score 3.3 / hf_trending_rank:22；watchlist_keyword:inference；nice_to_have:benchmark）
→ NeuroAdapt-Bench：EEG 基础模型 TTA 系统性基准
- tldr_en: NeuroAdapt-Bench is introduced, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts and shows that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation.
- 入选理由：HF trending #22 + watchlist:inference + benchmark；EEG 基础模型 TTA 系统性诊断
- 链接：HF · S2
- 作者：Gabriel Jason Lee, Jathurshan Pradeepkumar, Jimeng Sun

🏷 Watchlist 分类命中

本期候选 JSON 中 arXiv categories 字段全部为空（HF+S2 路径不回填 arXiv 分类元数据，已记 confidence_flags），因此按 watchlist 关键词主题分组替代 cs.* 严格分组。

agent / 多步骤工具

Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks（score 3.0 / HF↑6 / watchlist_keyword:agent；nice_to_have:benchmark,evaluation）
→ CluE 以聚类自演化合成跨任务 LLM 记忆抽取数据
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis（score 2.6 / HF↑27 / hf_trending_rank:29；watchlist_keyword:agent；nice_to_have:benchmark）
→ OpenMobile：开源移动 agent 任务与轨迹合成框架
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation（score 2.0 / HF↑24 / watchlist_keyword:agent）
→ DeVI：合成视频引导物理可行人手物交互策略
CreativeGame:Toward Mechanic-Aware Creative Game Generation（score 2.0 / HF↑2 / watchlist_keyword:agent）
→ CreativeGame：多 agent 迭代生成机制感知 HTML5 游戏

reasoning / 推理与多模态

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings（score 3.0 / HF↑2 / watchlist_keyword:reasoning；nice_to_have:benchmark,evaluation）
→ MMCORE：对齐潜空间提升文图生成与编辑
Seeing Fast and Slow: Learning the Flow of Time in Videos（score 2.9 / HF↑15 / hf_trending_rank:21；watchlist_keyword:reasoning）
→ 视频自监督学习时间流与播放速度
Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL（score 2.5 / HF↑8 / watchlist_keyword:reasoning；nice_to_have:fine-tuning）
→ Abstain-R1：可验证 RL 实现校准拒答与澄清
Image Generators are Generalist Vision Learners（score 2.0 / HF↑9 / watchlist_keyword:reasoning）
→ 图像生成训练即通用视觉表征预训练

inference / 推理优化

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs（score 2.7 / HF↑1 / hf_trending_rank:28；watchlist_keyword:inference；nice_to_have:fine-tuning）
→ 善意微调即可破坏 Audio LLM 安全对齐
Streaming Structured Inference with Flash-SemiCRF（score 2.3 / HF↑2 / hf_trending_rank:27；watchlist_keyword:inference）
→ Flash-SemiCRF：Triton 融合 kernel 加速 semi-CRF 推理

RL & preference optimization

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training（score 3.2 / HF↑10 / watchlist_keyword:preference optimization；nice_to_have:benchmark；citation_velocity:0.7）
→ 自适应混合后训练让语音对话语义+表达双升

🔗 延伸阅读 (Semantic Scholar 相似论文)

本段今日无高置信度增量信号（S2 相似论文未返回；候选 JSON 未预取 similar_papers 字段，按 SKILL 约定不单独发请求补抓）。已在 coverage_gaps 写入 s2_similar_unavailable。

🧑‍🔬 新出现的作者 / 团队

本日发现扫描未发现达标候选人。说明：今日 Top picks / 命中候选的 affiliations 全部为空（HF+S2 来源未附机构），无法按 discovery_rules.md 的「机构跨核 + 重复出现」规则做交叉验证；同时无 must-read venue 命中、无 tracked_authors 命中。继续保持名单原样，待后续 arXiv-direct 抓取恢复机构填充再尝试沉淀。

📉 覆盖缺口与不确定性

s2_similar_unavailable：候选 JSON 不含 similar_papers 字段，延伸阅读段为空。
arxiv_only_tier_absent：32 篇候选 source 全部为 hf+s2，无纯 arXiv-direct 抓回的「冷门但命中分类」论文，可能漏掉低社区曝光但 watchlist 关键词强匹配的工作。
arxiv_category_metadata_unpopulated（confidence_flag）：所有候选 categories 为空，导致无法按 cs.CL/cs.LG 严格分组，本期已用 watchlist keyword 主题分组兜底。

来源与交叉验证说明

primary（arXiv）：候选 JSON 中所有 arXiv URL/PDF 均可解析；结论引用按 source_policy.md 锚定 arXiv abs。
curated（HuggingFace Daily Papers）：本期所有候选都附 hf_url 与 hf_upvotes/hf_trending_rank，社区策展信号完整；不把 trending rank 当结果证据，仅作排序权重。
metadata（Semantic Scholar）：S2 提供 s2_tldr 与 citation_velocity，但本次未返回 similar_papers，已降级延伸阅读段。
三源 stderr 无 WARN，未触发 arxiv_unavailable / hf_daily_unavailable / s2_unavailable 任一硬降级；冲突优先级 primary > metadata > curated > other 维持默认。

Hanzhi's BLOG

[论文·2026-04-26]