← Back to Feed

Researchers propose new benchmarks to test AI agents

Category: AI

A wave of new AI papers is pushing for more realistic ways to evaluate—and improve—large language model (LLM) “agents” and multimodal systems, especially when they must use tools, interact with users, and operate under uncertainty. Several studies warn that today’s evaluations can mislead: user simulators can be too “nice,” unlearning can look successful until queries get more complex, and safety or reliability can erode in high-stakes settings where outputs are hard to verify [3][4][7]. The same papers also offer practical ways forward: generate more diverse, executable tool-use tasks to boost out-of-distribution generalization; add verification and replanning loops to multi-agent orchestration; and build benchmarks that capture real trade-offs, like false positives in code review or how closely a simulated user matches actual people [2][6][17]. Together, the work points to a field that’s maturing—from “does it answer?” to “does it keep working reliably, safely, and efficiently in the messy real world?” [4][49].

Primary Source: View original article

Did you know: One study found that simply increasing inference-time samples can shift jailbreak success scaling from polynomial to exponential under prompt injection—suggesting that “try more times” can sharply raise attack risk in some regimes [23].

Key Points

Perspectives

Helicoid dynamics authors: They argue that when outputs can’t be reliably checked (for example, irreversible clinical or financial choices under fundamental uncertainty), frontier systems can fall into a “helicoid” loop—performing competently, drifting into error, correctly explaining what went wrong, then repeating the pattern at a higher level of sophistication. In their view, reliability can degrade most in the settings where verification is hardest [7].

Sources: arXiv

Sim2Real user-simulation authors: They argue that LLM-based user simulators create an “easy mode” because they’re stylistically uniform and overly cooperative. They call for human validation when simulations shape agent development and evaluation [4].

Sources: arXiv

ReadSecBench authors: They frame README/documentation prompt injection as a structural “Trusted Executor Dilemma,” and argue the core issue isn’t just isolated attacks but gaps in evaluation coverage. To address that, they add a three-axis taxonomy (linguistic disguise, structural obfuscation, semantic abstraction) and a 500-README benchmark. They then show that a wide range of defenses (12 rule-based and 6 LLM-based) still struggles to detect attacks without triggering unacceptable false positives [35].

Sources: arXiv

ResearchGym authors: They report a capability–reliability gap in end-to-end research agents. A GPT-5-based agent improved over repository baselines in 1 of 15 evaluations and completed only 26.5% of subtasks on average, while occasionally surpassing a strong Spotlight-level task [49].

Sources: arXiv

Technical Details

Scientific Significance

Action Items

Additional Sources:

Sources: arxiv.org, confident-ai.com, google.com, twitter.com

Metadata: Cluster #3, 4 unique domains, 112 articles

View Full JSON Data
{
  "articles": [
    {
      "date": "2026-03-13T03:56:22+00:00",
      "domain": "google.com",
      "image": "",
      "image_caption": "",
      "link": "https://news.google.com/atom/articles/CBMijAFBVV95cUxQeGpEY3VSTDlWZWxKamFoN1JqdXhRM1JOM05jZXdFS1hXOGpzWTNDTjFmN2dxMTBqcENCMEJKMFZVTzQ0UDBfT2xXajI0b3JvMjhQajd4UjlWZU94Um9QSjV5NXdEaVFaaWRKcjhCRlVTRzNwWE5sOUFyTkM2T0M1RnZvSUdoX0NZNUhBZQ",
      "title": "Evaluating Large Language Models with Scientific Literature - BIOENGINEER.ORG"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11076",
      "title": "DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11245",
      "title": "Mind the Sim2Real Gap in User Simulation for Agentic Tasks"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11266",
      "title": "The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11279",
      "title": "AI Psychometrics: Evaluating the Psychological Reasoning of Large Language Models with Psychometric Validities"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11445",
      "title": "Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11559",
      "title": "AI Knows What\u0027s Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11601",
      "title": "See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11631",
      "title": "VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11679",
      "title": "LLMs can construct powerful representations and streamline sample-efficient supervised learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11689",
      "title": "Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11709",
      "title": "Scaling Laws for Educational AI Agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12056",
      "title": "XSkill: Continual Learning from Experience and Skills in Multimodal Agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12109",
      "title": "On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12246",
      "title": "Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11067",
      "title": "Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11078",
      "title": "CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11099",
      "title": "Graph Tokenization for Bridging Graphs and Transformers"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11110",
      "title": "ResWM: Residual-Action World Model for Visual RL"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11211",
      "title": "A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11228",
      "title": "Markovian Generation Chains in Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11321",
      "title": "Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11331",
      "title": "Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11351",
      "title": "Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11395",
      "title": "ARROW: Augmented Replay for RObust World models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11542",
      "title": "ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11545",
      "title": "One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11558",
      "title": "RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11583",
      "title": "UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11597",
      "title": "Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11625",
      "title": "MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11677",
      "title": "From Control to Foresight: Simulation as a New Paradigm for Human-Agent Collaboration"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11682",
      "title": "Entropy-Preserving Reinforcement Learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11687",
      "title": "SemBench: A Universal Semantic Framework for LLM Evaluation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11862",
      "title": "You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11896",
      "title": "Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11914",
      "title": "Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11935",
      "title": "MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12094",
      "title": "Human-Centred LLM Privacy Audits: Findings and Frictions"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12110",
      "title": "Taming the Adversary: Stable Minimax Deep Deterministic Policy Gradient via Fractional Objectives"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12117",
      "title": "SommBench: Assessing Sommelier Expertise of Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12145",
      "title": "Automatic Generation of High-Performance RL Environments"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12151",
      "title": "IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12176",
      "title": "BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.18607",
      "title": "From Entity-Centric to Goal-Oriented Graphs: Enhancing LLM Knowledge Retrieval in Minecraft"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2511.12254",
      "title": "Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.04634",
      "title": "WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.15112",
      "title": "ResearchGym: Evaluating Language Model Agents on Real-World AI Research"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.08561",
      "title": "RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.09151",
      "title": "Deep Tabular Research via Continual Experience-Driven Execution"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.09203",
      "title": "Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2402.03627",
      "title": "Partially Recentralization Softmax Loss for Vision-Language Models Robustness"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2503.23830",
      "title": "OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.10900",
      "title": "Tuning-Free LLM Can Build A Strong Recommender Under Sparse Connectivity And Knowledge Gap Via Extracting Intent"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.13820",
      "title": "Structured Agent Distillation for Large Language Model"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.16211",
      "title": "AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.18675",
      "title": "ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2505.19240",
      "title": "LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2506.16584",
      "title": "Measuring Intent Comprehension in LLMs"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2506.21599",
      "title": "Refine-POI: Reinforcement Fine-Tuned Large Language Models for Next Point-of-Interest Recommendation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2507.16083",
      "title": "Efficient Compositional Multi-tasking for On-device Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2508.04604",
      "title": "TURA: Tool-Augmented Unified Retrieval Agent for AI Search"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2510.18632",
      "title": "Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2511.00617",
      "title": "Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2511.16846",
      "title": "ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2601.06550",
      "title": "LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.05474",
      "title": "LLM-driven Multimodal Recommendation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.07075",
      "title": "LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.20197",
      "title": "Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.23653",
      "title": "ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.04459",
      "title": "Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.05598",
      "title": "On the Value of Tokeniser Pretraining in Physics Foundation Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.09731",
      "title": "EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.09982",
      "title": "AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11295",
      "title": "Temporal Text Classification with Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11415",
      "title": "BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11495",
      "title": "Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11665",
      "title": "Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11780",
      "title": "Large Language Models for Biomedical Article Classification"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11838",
      "title": "DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11957",
      "title": "CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12105",
      "title": "To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12152",
      "title": "LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12191",
      "title": "Long-Context Encoder Models for Polish Language Understanding"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11126",
      "title": "Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11220",
      "title": "Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11327",
      "title": "Meta-Reinforcement Learning with Self-Reflection for Agentic Search"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11947",
      "title": "Resurfacing Paralinguistic Awareness in Large Audio Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12252",
      "title": "EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2506.20793",
      "title": "Multi-lingual Functional Evaluation for Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2507.11412",
      "title": "Seq vs Seq: An Open Suite of Paired Encoders and Decoders"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2601.02907",
      "title": "Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2601.03464",
      "title": "Prompting Underestimates LLM Capability for Time Series Classification"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2601.07796",
      "title": "Learning Through Dialogue: Engagement and Efficacy Matter More Than Explanations"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2601.22511",
      "title": "Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.01716",
      "title": "Mechanistic Indicators of Steering Effectiveness in Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.04509",
      "title": "Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2602.23440",
      "title": "Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.10000",
      "title": "Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought"
    },
    {
      "date": "2026-03-12T05:40:19+00:00",
      "domain": "confident-ai.com",
      "image": "",
      "image_caption": "",
      "link": "https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation",
      "title": "LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11139",
      "title": "H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11600",
      "title": "Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11653",
      "title": "Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11784",
      "title": "Language Generation with Replay: A Learning-Theoretic View of Model Collapse"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11901",
      "title": "FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12087",
      "title": "Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11212",
      "title": "Security-by-Design for LLM-Based Code Generation: Leveraging Internal Representations for Concept-Driven Steering Mechanisms"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.11332",
      "title": "On the Computational Hardness of Transformers"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.12055",
      "title": "Continual Learning with Vision-Language Models via Semantic-Geometry Preservation"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2509.06322",
      "title": "Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics, Revealing a Three-Stage In-Context Learning Mechanism"
    },
    {
      "date": "2026-03-13T04:00:00+00:00",
      "domain": "arxiv.org",
      "image": "",
      "image_caption": "",
      "link": "https://arxiv.org/abs/2603.09427",
      "title": "Impact of Markov Decision Process Design on Sim-to-Real Reinforcement Learning"
    },
    {
      "date": "2026-03-12T06:55:20+00:00",
      "domain": "twitter.com",
      "image": "",
      "image_caption": "",
      "link": "https://twitter.com/ChristosTzamos/status/2031845134577406426",
      "title": "LLMs as Computers? Executing programs inside transformers exponentially faster"
    }
  ],
  "category": "AI",
  "cluster_number": 3,
  "did_you_know": "One study found that simply increasing inference-time samples can shift jailbreak success scaling from polynomial to exponential under prompt injection\u2014suggesting that \u201ctry more times\u201d can sharply raise attack risk in some regimes [arxiv.org#22].",
  "domains": [
    {
      "favicon": "https://kagiproxy.com/img/rsRQfRVs0p6xK_waK6LvebRrUFzZzQhrRPV11xbW0n3HQrNhlK0vy8hwiKciJ0pnz_8KqKHgmHDAqzwHZLdP6wl2zfJs3iWxiLszH2Wy-A",
      "name": "arxiv.org"
    },
    {
      "favicon": "https://kagiproxy.com/img/fD4MEjImWIddb1Eb1fBWMjsmBP9qYxw2vC2g8PED_UGbC6PkfQp0-kbto60fxBQ-Y6l4QQyP7yZrv7-0Ag5IouqGMQe4s3aFB5XalUVSkrRh5lrDncE",
      "name": "confident-ai.com"
    },
    {
      "favicon": "https://kagiproxy.com/img/wYIAidQ_mQ3bHmLt17CmfzwCKCYwFE9N0Ilf-PS2pwA7aQoVQGBVSmO3DBvqN9uBlnBVSHRTD-Cw0JbLimVAIsveglOc8fXGS_-RtQ77u1c",
      "name": "google.com"
    },
    {
      "favicon": "https://kagiproxy.com/img/V9778ui_NSvs1ppqp-YHpIMFvhHwztrY0kKWRBIep_bHqSYyTYhwMdO6e7MSqDkgCHSI0PVpSkjzKkgKYNPuI64RhSRB_kpXbqSBuZJkJBzY",
      "name": "twitter.com"
    }
  ],
  "economic_implications": "",
  "emoji": "\ud83e\uddea",
  "feed_category": "AI",
  "future_outlook": "",
  "geopolitical_context": "",
  "heading_level": 1,
  "historical_background": null,
  "humanitarian_impact": "",
  "industry_impact": [],
  "international_reactions": [],
  "item_category": "Llm Evaluation",
  "key_players": [],
  "location": "",
  "number_of_titles": 112,
  "perspectives": [
    {
      "sources": [
        {
          "name": "arXiv",
          "url": "https://arxiv.org/abs/2603.11559"
        }
      ],
      "text": "Helicoid dynamics authors: They argue that when outputs can\u2019t be reliably checked (for example, irreversible clinical or financial choices under fundamental uncertainty), frontier systems can fall into a \u201chelicoid\u201d loop\u2014performing competently, drifting into error, correctly explaining what went wrong, then repeating the pattern at a higher level of sophistication. In their view, reliability can degrade most in the settings where verification is hardest [arxiv.org#6]."
    },
    {
      "sources": [
        {
          "name": "arXiv",
          "url": "https://arxiv.org/abs/2603.11245"
        }
      ],
      "text": "Sim2Real user-simulation authors: They argue that LLM-based user simulators create an \u201ceasy mode\u201d because they\u2019re stylistically uniform and overly cooperative. They call for human validation when simulations shape agent development and evaluation [arxiv.org#3]."
    },
    {
      "sources": [
        {
          "name": "arXiv",
          "url": "https://arxiv.org/abs/2603.11862"
        }
      ],
      "text": "ReadSecBench authors: They frame README/documentation prompt injection as a structural \u201cTrusted Executor Dilemma,\u201d and argue the core issue isn\u2019t just isolated attacks but gaps in evaluation coverage. To address that, they add a three-axis taxonomy (linguistic disguise, structural obfuscation, semantic abstraction) and a 500-README benchmark. They then show that a wide range of defenses (12 rule-based and 6 LLM-based) still struggles to detect attacks without triggering unacceptable false positives [arxiv.org#34]."
    },
    {
      "sources": [
        {
          "name": "arXiv",
          "url": "https://arxiv.org/abs/2602.15112"
        }
      ],
      "text": "ResearchGym authors: They report a capability\u2013reliability gap in end-to-end research agents. A GPT-5-based agent improved over repository baselines in 1 of 15 evaluations and completed only 26.5% of subtasks on average, while occasionally surpassing a strong Spotlight-level task [arxiv.org#48]."
    }
  ],
  "primary_image": null,
  "published": 1773397297,
  "quote": "",
  "quote_attribution": "",
  "quote_author": "",
  "scientific_significance": [
    "Meaning for AI research: Across multiple papers, the core message is that evaluation quality\u2014not just model scale\u2014often sets the pace of progress. Overly cooperative simulators, static unlearning tests, and blunt success metrics can systematically overestimate real-world agent capability [arxiv.org#2][arxiv.org#3][arxiv.org#16].",
    "Evidence strength: Several studies lean on large or structured evaluations, including \u03c4-bench with 451 human participants and 165 tasks to gauge user-simulation realism, as well as expert-curated query sets to evaluate orchestration [arxiv.org#2][arxiv.org#5].",
    "Limitations and future work: The papers repeatedly point to the need for stress tests under distribution shift (new toolsets, multi-hop queries, long-horizon tasks) and for defenses that reduce instruction-following exploitation in high-privilege agents [arxiv.org#1][arxiv.org#3][arxiv.org#34]."
  ],
  "source_urls": [
    "https://arxiv.org/abs/2603.12151",
    "https://arxiv.org/abs/2508.04604",
    "https://arxiv.org/abs/2603.04459",
    "https://arxiv.org/abs/2603.09731",
    "https://arxiv.org/abs/2511.16846",
    "https://arxiv.org/abs/2603.12094",
    "https://arxiv.org/abs/2603.11395",
    "https://arxiv.org/abs/2603.12176",
    "https://arxiv.org/abs/2603.11709",
    "https://arxiv.org/abs/2603.11935",
    "https://arxiv.org/abs/2603.11600",
    "https://arxiv.org/abs/2603.09151",
    "https://news.google.com/atom/articles/CBMijAFBVV95cUxQeGpEY3VSTDlWZWxKamFoN1JqdXhRM1JOM05jZXdFS1hXOGpzWTNDTjFmN2dxMTBqcENCMEJKMFZVTzQ0UDBfT2xXajI0b3JvMjhQajd4UjlWZU94Um9QSjV5NXdEaVFaaWRKcjhCRlVTRzNwWE5sOUFyTkM2T0M1RnZvSUdoX0NZNUhBZQ",
    "https://arxiv.org/abs/2506.21599",
    "https://arxiv.org/abs/2602.23440",
    "https://twitter.com/ChristosTzamos/status/2031845134577406426",
    "https://arxiv.org/abs/2602.04509",
    "https://arxiv.org/abs/2603.05598",
    "https://arxiv.org/abs/2602.23653",
    "https://arxiv.org/abs/2603.12117",
    "https://arxiv.org/abs/2505.13820",
    "https://arxiv.org/abs/2603.12087",
    "https://arxiv.org/abs/2602.20197",
    "https://arxiv.org/abs/2603.11914",
    "https://arxiv.org/abs/2603.11679",
    "https://arxiv.org/abs/2603.11545",
    "https://arxiv.org/abs/2603.11445",
    "https://arxiv.org/abs/2603.11896",
    "https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation",
    "https://arxiv.org/abs/2603.09982",
    "https://arxiv.org/abs/2603.12055",
    "https://arxiv.org/abs/2603.11583",
    "https://arxiv.org/abs/2603.11542",
    "https://arxiv.org/abs/2603.11067",
    "https://arxiv.org/abs/2602.05474",
    "https://arxiv.org/abs/2510.18632",
    "https://arxiv.org/abs/2505.19240",
    "https://arxiv.org/abs/2603.12109",
    "https://arxiv.org/abs/2511.12254",
    "https://arxiv.org/abs/2603.11687",
    "https://arxiv.org/abs/2603.11351",
    "https://arxiv.org/abs/2603.12152",
    "https://arxiv.org/abs/2603.11597",
    "https://arxiv.org/abs/2603.11126",
    "https://arxiv.org/abs/2603.11245",
    "https://arxiv.org/abs/2506.20793",
    "https://arxiv.org/abs/2603.12246",
    "https://arxiv.org/abs/2603.11957",
    "https://arxiv.org/abs/2603.09427",
    "https://arxiv.org/abs/2603.11495",
    "https://arxiv.org/abs/2602.04634",
    "https://arxiv.org/abs/2602.07075",
    "https://arxiv.org/abs/2603.11212",
    "https://arxiv.org/abs/2601.02907",
    "https://arxiv.org/abs/2603.11677",
    "https://arxiv.org/abs/2603.11665",
    "https://arxiv.org/abs/2507.11412",
    "https://arxiv.org/abs/2603.11901",
    "https://arxiv.org/abs/2505.18675",
    "https://arxiv.org/abs/2603.11327",
    "https://arxiv.org/abs/2603.12145",
    "https://arxiv.org/abs/2505.10900",
    "https://arxiv.org/abs/2603.12105",
    "https://arxiv.org/abs/2603.11682",
    "https://arxiv.org/abs/2603.11266",
    "https://arxiv.org/abs/2601.06550",
    "https://arxiv.org/abs/2603.12056",
    "https://arxiv.org/abs/2603.11947",
    "https://arxiv.org/abs/2601.22511",
    "https://arxiv.org/abs/2603.11558",
    "https://arxiv.org/abs/2603.11211",
    "https://arxiv.org/abs/2603.11625",
    "https://arxiv.org/abs/2507.16083",
    "https://arxiv.org/abs/2503.23830",
    "https://arxiv.org/abs/2603.12252",
    "https://arxiv.org/abs/2402.03627",
    "https://arxiv.org/abs/2603.11139",
    "https://arxiv.org/abs/2603.11559",
    "https://arxiv.org/abs/2603.11862",
    "https://arxiv.org/abs/2603.11838",
    "https://arxiv.org/abs/2603.11784",
    "https://arxiv.org/abs/2603.11332",
    "https://arxiv.org/abs/2505.18607",
    "https://arxiv.org/abs/2603.11076",
    "https://arxiv.org/abs/2505.16211",
    "https://arxiv.org/abs/2603.11780",
    "https://arxiv.org/abs/2603.11110",
    "https://arxiv.org/abs/2603.11099",
    "https://arxiv.org/abs/2603.11279",
    "https://arxiv.org/abs/2601.07796",
    "https://arxiv.org/abs/2603.09203",
    "https://arxiv.org/abs/2511.00617",
    "https://arxiv.org/abs/2603.11295",
    "https://arxiv.org/abs/2603.12191",
    "https://arxiv.org/abs/2603.11631",
    "https://arxiv.org/abs/2603.11228",
    "https://arxiv.org/abs/2603.12110",
    "https://arxiv.org/abs/2603.11415",
    "https://arxiv.org/abs/2603.11078",
    "https://arxiv.org/abs/2602.15112",
    "https://arxiv.org/abs/2603.11321",
    "https://arxiv.org/abs/2602.01716",
    "https://arxiv.org/abs/2603.08561",
    "https://arxiv.org/abs/2506.16584",
    "https://arxiv.org/abs/2603.11220",
    "https://arxiv.org/abs/2603.10000",
    "https://arxiv.org/abs/2603.11601",
    "https://arxiv.org/abs/2603.11331",
    "https://arxiv.org/abs/2603.11653",
    "https://arxiv.org/abs/2603.11689",
    "https://arxiv.org/abs/2509.06322",
    "https://arxiv.org/abs/2601.03464"
  ],
  "suggested_qna": [],
  "summary": "A wave of new AI papers is pushing for more realistic ways to evaluate\u2014and improve\u2014large language model (LLM) \u201cagents\u201d and multimodal systems, especially when they must use tools, interact with users, and operate under uncertainty. Several studies warn that today\u2019s evaluations can mislead: user simulators can be too \u201cnice,\u201d unlearning can look successful until queries get more complex, and safety or reliability can erode in high-stakes settings where outputs are hard to verify [arxiv.org#2][arxiv.org#3][arxiv.org#6].\n\nThe same papers also offer practical ways forward: generate more diverse, executable tool-use tasks to boost out-of-distribution generalization; add verification and replanning loops to multi-agent orchestration; and build benchmarks that capture real trade-offs, like false positives in code review or how closely a simulated user matches actual people [arxiv.org#1][arxiv.org#5][arxiv.org#16]. Together, the work points to a field that\u2019s maturing\u2014from \u201cdoes it answer?\u201d to \u201cdoes it keep working reliably, safely, and efficiently in the messy real world?\u201d [arxiv.org#3][arxiv.org#48].",
  "talking_points": [
    "Tool-use diversity: DIVE derives tasks from real tool-execution traces across 373 tools. It reports that training Qwen3-8B on its data improves performance by +22 points on nine OOD benchmarks, and that scaling diversity can beat scaling quantity even with 4\u00d7 less data [arxiv.org#1].",
    "Human realism gap: A Sim2Real study ran the full \u03c4-bench protocol with 451 participants and found that 31 LLM user simulators act overly cooperative and deliver uniformly positive feedback. That \u201ceasy mode\u201d can push agent success above the human baseline; the authors introduce the User-Sim Index (USI) to quantify realism [arxiv.org#2].",
    "Unlearning stress tests: \u201cThe Unlearning Mirage\u201d argues that unlearning can look effective on static tests yet fail after small query tweaks. It proposes dynamically generating structured probe sets (including semantically equivalent variants and controlled multi-hop chains) that both align with prior evaluations and reveal additional failures\u2014especially on harder, multi-step queries [arxiv.org#3].",
    "Agent oversight loop: Verified Multi-Agent Orchestration (VMAO) breaks complex queries into sub-questions, checks for completeness, and replans as needed. On 25 expert-curated market research queries, it reports higher completeness (3.1\u21924.2) and source quality (2.6\u21924.1) than a single-agent baseline [arxiv.org#5].",
    "Security reality check: ReadSecBench tests README-embedded instruction injection for high-privilege agents. It reports end-to-end exfiltration success rates as high as 85% in tests of a commercially deployed computer-use agent, alongside poor detection in a 15-participant study [arxiv.org#34]."
  ],
  "technical_details": [
    "Evidence-driven task derivation: DIVE generates supervision by executing real tools, then reverse-deriving the tasks implied by those traces. It controls diversity through tool-pool coverage and per-task toolset variety, and reports a 48k SFT + 3.2k RL training recipe for Qwen3-8B [arxiv.org#1].",
    "Dynamic unlearning probes: The \u201cUnlearning Mirage\u201d framework elicits pre-unlearning knowledge to auto-construct probe sets ranging from single-hop to multi-hop chains. It also uses activation analyses to argue that multi-hop queries can route through alternative computation pathways that remain intact after unlearning [arxiv.org#3].",
    "DAG plan-execute-verify-replan: VMAO represents a query as a directed acyclic graph of sub-questions, runs dependency-aware parallel execution with context propagation, and uses an LLM-based verifier as a coordination signal for adaptive replanning with configurable stop conditions [arxiv.org#5]."
  ],
  "timeline": [],
  "title": "Researchers propose new benchmarks to test AI agents",
  "unique_domains": 4,
  "url": "https://news.google.com/atom/articles/CBMijAFBVV95cUxQeGpEY3VSTDlWZWxKamFoN1JqdXhRM1JOM05jZXdFS1hXOGpzWTNDTjFmN2dxMTBqcENCMEJKMFZVTzQ0UDBfT2xXajI0b3JvMjhQajd4UjlWZU94Um9QSjV5NXdEaVFaaWRKcjhCRlVTRzNwWE5sOUFyTkM2T0M1RnZvSUdoX0NZNUhBZQ",
  "user_action_items": [
    "Audit your agent evaluation setup for \u201ceasy mode\u201d bias: If you use LLM user simulators, compare results against real-user runs or add realism checks like USI-style behavior and feedback metrics described in \u03c4-bench human studies [arxiv.org#2].",
    "Stress-test safety measures with harder queries: For unlearning or \u201cright to be forgotten\u201d workflows, add multi-hop and aliasing probes using dynamic, structured query generation\u2014not just static Q\u0026A sets [arxiv.org#3].",
    "Harden documentation-to-execution pipelines: If you run high-privilege coding or computer-use agents, treat READMEs and docs as untrusted input, and evaluate against README-embedded injection scenarios like ReadSecBench before deployment [arxiv.org#34]."
  ]
}