CMU researchers are presenting 127 papers on the Forty-Second Worldwide Convention on Machine Studying (ICML 2025), held from July Thirteenth-Nineteenth on the Vancouver Conference Middle. Here’s a fast overview of the areas our researchers are engaged on:
Listed below are our most frequent collaborator establishments:
Oral Papers
Anticipated Variational Inequalities
This paper introduces anticipated variational inequalities (EVIs), a relaxed model of variational inequalities (VIs) the place the objective is to discover a distribution that satisfies the VI situation in expectation. Whereas VIs are typically exhausting to resolve, the authors present that EVIs may be solved effectively, even underneath difficult, non-monotone circumstances, by leveraging concepts from sport principle. EVIs generalize the idea of correlated equilibria and unify varied outcomes throughout easy video games, constrained video games, and settings with non-concave utilities, making them broadly relevant past conventional game-theoretic contexts.
Exploring and Mitigating Adversarial Manipulation of Voting-Primarily based Leaderboards
This paper reveals that voting-based benchmarks for evaluating LLMs (akin to Chatbot Area) may be susceptible to adversarial manipulation if correct defenses aren’t in place. The authors present that an attacker can determine which mannequin generated a response after which strategically vote to spice up or demote particular fashions, altering the leaderboard with solely round a thousand votes in a simulated surroundings. They collaborate with Chatbot Area’s builders to suggest and implement safety measures akin to reCAPTCHA and login necessities that considerably increase the price of such assaults and improve the platform’s robustness.
Excessive-Dimensional Prediction for Sequential Determination Making
This paper presents a brand new algorithmic framework for making dependable, multi-dimensional forecasts in adversarial, nonstationary environments. In contrast to present on-line studying strategies, this strategy gives simultaneous efficiency ensures for a lot of brokers, even after they face totally different goals, act over giant motion areas, or care about particular circumstances (e.g. climate or route alternative). The algorithm ensures low bias throughout many conditional occasions and allows every agent to attain sturdy ensures like diminishing remorse. Functions embody environment friendly options for on-line combinatorial optimization and multicalibration.
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Massive Language Fashions
This paper introduces LLM-SRBench, a brand new benchmark designed to carefully consider the flexibility of LLMs to find scientific equations (moderately than merely recall them from coaching knowledge). Present assessments typically depend on well-known equations, making it exhausting to inform whether or not fashions are actually reasoning or simply memorizing. LLM-SRBench addresses this by together with 239 difficult issues throughout 4 scientific domains, cut up into two classes: one which disguises acquainted physics equations (LSR-Remodel) and one other that options absolutely artificial, reasoning-driven duties (LSR-Synth). Evaluations present that even one of the best present fashions solely obtain 31.5% accuracy, highlighting the problem of the duty and establishing LLM-SRBench as a priceless software for driving progress in LLM-based scientific discovery.
On Differential Privateness for Adaptively Fixing Search Issues through Sketching
This paper explores use differential privateness to guard towards data leakage in adaptive search queries, a tougher downside than conventional non-public estimation duties. In contrast to prior work that solely returns numerical summaries (e.g., value), the authors design algorithms that return precise options, like nearest neighbors or regression vectors, even when the inputs or queries change over time. They present how key downside parameters (just like the variety of approximate close to neighbors or situation variety of the info matrix) have an effect on the efficiency of those non-public algorithms. This work has sensible implications for AI programs that depend on non-public database searches or real-time regression, enabling them to offer helpful outcomes whereas safeguarding delicate data from attackers.
Roll the cube & look earlier than you leap: Going past the artistic limits of next-token prediction
This paper proposes a set of straightforward, summary duties designed to probe the artistic limits of as we speak’s language fashions in a managed and measurable approach. These duties mimic real-world open-ended challenges like producing analogies or designing puzzles, the place success requires discovering new connections or establishing novel patterns. The authors present that commonplace next-token prediction tends to be short-sighted and overly reliant on memorization, whereas different approaches like teacherless coaching and diffusion fashions produce extra numerous, unique outputs. Additionally they introduce a method referred to as seed-conditioning, which provides randomness on the enter moderately than the output and may enhance coherence with out sacrificing creativity.
Coaching a Usually Curious Agent
This paper introduces Paprika, a fine-tuning methodology that equips language fashions with common decision-making and exploration methods, enabling them to adapt to new duties by means of interplay alone (i.e. with out additional coaching). Paprika trains fashions on artificial environments requiring totally different exploration behaviors, encouraging them to study versatile methods moderately than memorizing options. To enhance effectivity, it makes use of a curriculum learning-based strategy that prioritizes duties with excessive studying worth, taking advantage of restricted interplay knowledge. Fashions educated with Paprika present sturdy switch to utterly new duties, suggesting a promising route for constructing AI brokers that may study to resolve unfamiliar, sequential issues with minimal supervision.
Highlight Papers
GMAIL: Generative Modality Alignment for generated Picture Studying
Generative fashions can create lifelike photographs that would assist prepare machine studying fashions, however utilizing them as in the event that they had been actual photographs can result in issues due to variations between the 2. This paper introduces a way referred to as GMAIL that treats actual and generated photographs as separate varieties (or modalities) and aligns them in a shared latent area throughout coaching, moderately than simply mixing them on the pixel stage. The strategy fine-tunes fashions on generated knowledge utilizing a particular loss to bridge the hole, then makes use of these aligned fashions to enhance coaching on duties like picture captioning and retrieval. The outcomes present that GMAIL improves efficiency on a number of vision-language duties and scales nicely as extra generated knowledge is added.
LOCATE 3D: Actual-World Object Localization through Self-Supervised Studying in 3D
LOCATE 3D is a mannequin that may discover particular objects in 3D scenes primarily based on pure language descriptions (like “the small espresso desk between the couch and the lamp”). It achieves state-of-the-art efficiency on commonplace benchmarks and works nicely in real-world settings, like on robots or AR gadgets, through the use of RGB-D sensor knowledge. A key element is 3D-JEPA, a brand new self-supervised studying methodology that makes use of options from 2D imaginative and prescient fashions (like CLIP or DINO) to know 3D level clouds by means of masked prediction duties. The mannequin is educated on a newly launched giant dataset (130K+ examples), serving to it generalize higher throughout totally different environments.
Masked Autoencoders Are Efficient Tokenizers for Diffusion Fashions
This paper introduces MAETok, a masked autoencoder designed to create a high-quality, semantically significant latent area for diffusion fashions. The authors present that having a well-structured latent area, that means fewer Gaussian modes and extra discriminative options, results in higher picture technology with no need advanced variational autoencoders. MAETok outperforms present strategies on ImageNet utilizing simply 128 tokens, and it’s additionally a lot quicker: 76× faster to coach and 31× quicker throughout inference. The important thing takeaway is that the construction of the latent area, not variational constraints, is what actually issues for high-quality diffusion-based technology.
This paper highlights the shortage of strong programs for figuring out and reporting flaws in general-purpose AI (GPAI), particularly in comparison with mature fields like software program safety. The authors suggest three key options: (1) standardized reporting codecs and engagement guidelines to streamline flaw reporting and triaging, (2) formal disclosure applications with authorized protections for researchers (much like bug bounties), and (3) higher infrastructure for distributing flaw stories to related stakeholders. These steps goal to deal with rising dangers like jailbreaks and cross-system vulnerabilities, finally bettering the protection and accountability of GPAI programs.
Scaling Take a look at-Time Compute With out Verification or RL is Suboptimal
This paper explores finest scale test-time compute for giant language fashions (LLMs), evaluating two methods: (1) distilling search traces (verifier-free, or VF) and (2) utilizing verifiers or rewards to information studying (verifier-based, or VB). The authors present—each theoretically and thru experiments—that VB strategies considerably outperform VF ones when working with restricted compute or knowledge. They clarify that this efficiency hole grows as fashions and duties get extra advanced, particularly when resolution paths fluctuate in model or high quality. In the end, the paper argues that verification is important for successfully scaling LLM efficiency, particularly for reasoning duties.
ShadowKV: KV Cache in Shadows for Excessive-Throughput Lengthy-Context LLM Inference
As long-context LLMs grow to be extra widespread, their rising reminiscence calls for throughout inference decelerate efficiency, particularly because of the increasing key-value (KV) cache. This paper introduces ShadowKV, a system that considerably improves throughput by compressing the important thing cache utilizing low-rank representations and offloading the worth cache with out main latency prices. It reconstructs solely the mandatory KV pairs throughout decoding to take care of velocity and accuracy. Experiments present ShadowKV helps a lot bigger batch sizes (as much as 6×) and improves throughput by over 3× on commonplace {hardware}, all whereas preserving mannequin high quality throughout a number of LLMs and benchmarks.