Introducing ARFBench: A time collection question-answering benchmark primarily based on actual incidents – Machine Studying Weblog

Greater than a trillion {dollars} are misplaced yearly resulting from system failures. To resolve them, engineers should troubleshoot outages shortly.

An essential activity in incident response entails analyzing observability metrics, or time collection knowledge that snapshot the well being of software program programs. For instance, an engineer for a service might use Datadog to reply questions like “When did latency begin rising?” and “What metrics exterior of latency are additionally behaving abnormally?” to localize the basis explanation for the anomalous habits. These time collection question-answering (TSQA) duties are important for engineers, and current difficult and vital duties for SRE fashions and brokers to carry out. On this work, we discover the diploma to which AI fashions can carry out TSQA duties.

To this finish, we’re excited to introduce the Anomaly Reasoning Framework Benchmark (ARFBench), a TSQA benchmark derived from actual inside incidents at Datadog, utilizing Datadog’s personal inside telemetry (Determine 1). On this weblog publish, we’ll current three key takeaways from our benchmarking experiments:

Current fashions wrestle: Main LLMs, vision-language fashions (VLMs), and time collection basis fashions (TSFMs) have substantial room for enchancment on ARFBench.
Hybrid fashions assist: We introduce a brand new hybrid TSFM-VLM mannequin that yields comparable general efficiency to prime frontier fashions on ARFBench, demonstrating promising new approaches to TSQA modeling.
Human–AI complementarity: We observe markedly completely different error profiles between our prime TSFM-VLM mannequin and human specialists on ARFBench. These outcomes recommend that their strengths are complementary. We introduce a mannequin–skilled oracle that establishes a brand new superhuman frontier for LLMs, VLMs, and TSFMs.

Determine 1: A. Workflow of ARFBench question-answer era. Engineers use business messaging platforms to reply to incidents, the place they sometimes ship time collection widgets that visualize related metrics. Time collection and incident timelines from internally monitored incidents are used as enter to an LLM pipeline and match to eight completely different query templates testing varied elements of anomalies. The ensuing a number of selection question-answer pairs can be utilized to guage varied predictive fashions.

ARFBench: Utilizing real-world incident knowledge to create a TSQA benchmark

ARFBench is a TSQA benchmark primarily based on actual incidents inside to Datadog, utilizing our personal inside telemetry. In comparison with present benchmarks, ARFBench differs in three key elements. First, it makes use of actual time collection knowledge from manufacturing programs. Second, every question-answer (QA) instance is grounded in skilled annotations and extra context. And third, duties are designed to check compositional reasoning: questions are organized into three tiers of accelerating issue, with higher-tier duties relying on right reasoning on decrease tiers (Determine 2).

*Determine 2: Instance questions from every tier of ARFBench. ARFBench questions are designed in three tiers of accelerating issue, with larger tier duties relying on right reasoning on decrease tiers.*

ARFBench consists of 750 QA pairs drawn from 142 time collection and 63 incidents. Time collection in ARFBench have a most of 2283 variates (or dimensions) and 40k time steps, which current a difficult setting for context-limited fashions.

To create ARFBench, we constructed a VLM pipeline for extracting the time collection widgets from inside incident dialogue threads to assist generate and filter question-answer pairs. We then manually verified each generated query for correctness and privateness considerations, and threw away questions that we discovered unsuitable.

Reasoning about time collection and anomalies requires utilization of significant context throughout knowledge modalities. ARFBench enriches time collection with two kinds of context: time collection captions, which describe what the time collection characterize, and multivariate groupings, which contextualize every channel relative to a bigger related assortment of time collection channels. As an example, whereas it could not all the time matter {that a} single pod fails and restarts in a service, the mix of many pods failing and restarting concurrently may point out a major anomaly. This degree of complexity displays real-world situations that many present unimodal, artificial datasets fail to seize (Determine 3).

Determine 3: When analyzed alone, variates of a time collection is probably not anomalous. Nevertheless, within the context of a grouping of variates, the identical variate could also be thought-about anomalous. The multivariate time collection on this determine relies on the common remaining TLS certificates lifetime throughout completely different clusters and IDs of a specific service.

Main LLMs, VLMs, and TSFMs have substantial room for enchancment

We evaluated three classes of present fashions on ARFBench:

LLMs, which take time collection as textual content enter
VLMs, which take time collection plots as picture enter
Time collection LLMs. which use a time collection encoder with an LLM spine.

We in contrast the fashions to 2 human baselines: observability specialists, and time collection researchers with out intensive observability expertise. The human specialists had been evaluated on a randomly sampled 25% subset of ARFBench.

Determine 4: Total accuracy and F1 of assorted baselines and basis fashions on ARFBench. Fashions are sorted by lowering accuracy. The Toto-1.0-QA-Experimental achieves the highest accuracy on ARFBench and yields comparable F1 to prime frontier fashions.

Amongst present fashions, GPT-5 (VLM) yielded the highest efficiency at 62.7% accuracy and 51.9% F1 (Determine 4). That is a lot larger than the random selection baseline at 22.5%, however nonetheless underperforms area specialists and is way under a model-expert oracle at 87.2% accuracy / 82.8% F1 (see under for additional dialogue). As anticipated, mannequin efficiency tends to worsen because the tier issue will increase.

We additionally observe a number of developments with our evaluations on ARFBench. Corroborating earlier works in time collection classification and QA akin to Daswani et al. 2024, we discover that VLMs outperform LLMs. The highest proprietary fashions and open-source fashions additionally confirmed a considerable hole in efficiency. Nevertheless, we discover that some open-source fashions carry out higher than many older proprietary fashions or fashions from the Claude household.

Hybrid TSFM-VLM fashions present promise for specialised TSQA modeling

*Determine 5: Structure diagram of the Toto-1.0-QA-Experimental (Toto-Qwen3-VL) mannequin. Frozen weights* *are denoted with a snowflake, whereas trainable weights are marked with a flame. With a small variety of trainable parameters, we will align TSFMs and VLMs and yield novel talents.*

Although VLMs yielded the very best accuracy and F1 rating amongst present fashions, we discovered that plotting and enter illustration was a problem for each VLMs and LLMs. For instance, as a result of excessive variety of variates, we regularly couldn’t plot the time collection with out repeating colours for or occluding completely different variates. This motivated a local time collection strategy alongside the VLM mannequin during which we may make the most of time collection, plots, and textual content as joint enter.

To check this, we educated a hybrid mannequin (Determine 5) by combining Toto, a state-of-the-art observability TSFM, with Qwen3-VL 32B, a number one open-source VLM. We used each artificial (Determine 6) and actual multimodal knowledge in a multi-stage post-training pipeline incorporating each supervised fine-tuning (SFT) and reinforcement studying (RL). The ensuing mannequin, Toto-1.0-QA-Experimental, yielded the highest accuracy rating of 63.9%, and comparable F1 to prime frontier fashions (48.9%). Within the Anomaly Identification activity class, the place a mannequin selects anomalous variates within the time collection, Toto-1.0-QA-Experimental outperforms all fashions by at the very least 8.8 proportion factors in F1 and achieves finest per-category accuracy, suggesting that TSFM-VLM modeling can extremely profit efficiency on explicit duties. Moreover, Toto-1.0-QA-Experimental’s parameter depend is a number of magnitudes decrease than frontier fashions, thus offering potential effectivity positive aspects at inference time.

Determine 6: Artificial knowledge era circulation for post-training hybrid TSFM-VLM and TSFM-LLM fashions. Time collection are generated by first sampling completely different lengths and scales and second by sampling every datapoint from a standard distribution. So as to add variation, we add seasonality and drift elements into the time collection, yielding completely different base time collection (prime proper). For every base time collection, we apply query templates and inject completely different anomalies (e.g. degree shift, change in seasonality) at varied factors of the time collection (backside proper). Lastly, we generate time collection captions and reasoning for the question-answer pair utilizing a VLM.

We refer readers to the paper for extra experimental particulars, error evaluation, and case research.

Area specialists complemented with fashions set a brand new superhuman frontier

The present combination hole on ARFBench between one of the best fashions (Toto-1.0-QA-Experimental & GPT-5) and the 2 human area specialists is simply 8.8 proportion factors in accuracy and 12.7 proportion factors in F1. Nevertheless, on the particular person query degree, we observe noticeably completely different habits between GPT-5 and the human specialists. GPT-5 solutions 48% of questions accurately that each specialists get incorrect; on these questions, the human specialists are inclined to make errors in instruction-following or fine-grained notion. In the meantime, at the very least one skilled accurately solutions 79% of questions that GPT-5 will get incorrect. On these units of questions, mannequin errors are inclined to contain hallucination and incorrect area information. We offer examples of each teams of errors within the paper.

As a result of massive distinction in error distribution, we hypothesize that when specialists are complemented with fashions, their joint functionality turns into a lot larger than any single skilled or mannequin alone. To ascertain this, we compute a model-expert oracle, a best-of-2 metric the place an oracle completely chooses one of the best reply between the mannequin and the skilled, which yields 87.2% accuracy and 82.8% F1 on our knowledge. That is far above present mannequin capabilities and units a brand new superhuman frontier for LLMs, VLMs, and TSFMs to attain.

What’s subsequent: time collection reasoning as a core element of brokers

Within the broader scope of incident response, ARFBench solely accommodates questions concentrating on analysis and reasoning. Nevertheless, we envision that robust analysis and reasoning talents will play a big half in end-to-end agentic programs (e.g., SRE or incident response brokers) that require time collection reasoning as a subroutine in understanding the incident. Whereas ARFBench can be utilized to guage time collection brokers, it isn’t presently a multi-turn benchmark. Nevertheless, we consider that future brokers and fashions that carry out effectively on the single-turn ARFBench will in the end carry out higher on end-to-end duties.

Getting began with ARFBench

If you’re fascinated with testing your mannequin on ARFBench, yow will discover the benchmark and leaderboard, and mannequin weights on Hugging Face, and the code on GitHub. To be taught extra, learn our paper.