Automated Failure Diagnosis for Multi-Agent Systems: A Step-by-Step Guide

Introduction

Multi-agent systems powered by large language models (LLMs) are increasingly used to tackle complex collaborative tasks, but failures remain common and notoriously difficult to debug. When your multi-agent system malfunctions, you're left sifting through endless interaction logs to determine which agent caused the failure and at what point—a process researchers have dubbed “automated failure attribution.” A team from Penn State, Duke, Google DeepMind, and other institutions developed a benchmark dataset (Who&When) and several automated attribution methods, accepted as a spotlight at ICML 2025. This guide walks you through applying those methods to your own multi-agent system, helping you pinpoint failures quickly and move from manual log archaeology to efficient, data-driven diagnosis.

Automated Failure Diagnosis for Multi-Agent Systems: A Step-by-Step Guide — Source: syncedreview.com

What You Need

Python 3.8+ environment
Access to multi-agent system logs – interaction records in JSON format (example format provided with the dataset)
Git and Hugging Face account (to download the Who&When dataset)
PyTorch and Transformers libraries
Basic understanding of LLM multi-agent architectures (e.g., agent roles, message passing)

Step-by-Step Guide

Step 1: Download the Who&When Dataset and Code

Begin by obtaining the benchmark dataset and open-source code from the official repositories. This ensures you have the right reference data and attribution tools.

Clone the GitHub repository: git clone https://github.com/mingyin1/Agents_Failure_Attribution
Install required dependencies: pip install -r requirements.txt
Download the dataset from Hugging Face: huggingface-cli download Kevin355/Who_and_When --local-dir ./data

Step 2: Understand the Benchmark Structure

The Who&When dataset contains multi-agent interaction logs annotated with ground-truth failure points: which agent was responsible and at which turn the failure originated. Study the dataset structure to map your own logs appropriately.

Each log entry includes: task description, agent identities, sequence of agent messages, and a binary failure label with the responsible agent and turn index.
Familiarize yourself with the three attribution methods provided baseline: DirectPrompt (ask an LLM to explain), TraceEval (evaluate each agent’s contribution), and the proposed AttributionLM (fine-tuned for this task).

Step 3: Prepare Your Own Multi-Agent Logs

To diagnose failures in your system, you need to format its logs in the same schema as the benchmark. This ensures the attribution methods can process them.

Export your system’s interaction data: each turn should capture the agent that spoke, the message content, and a timestamp or turn number.
For each task execution, create a JSON object with keys: task, agents (list of agent names), turns (list of turn objects with agent and message), and failure (boolean, set to false for unlabeled data).
Save your logs in a folder (e.g., ./my_logs/).

Step 4: Run Automated Failure Attribution

Use the provided scripts to apply the attribution method of your choice to your logs. The code includes a command-line interface for each method.

Choose a method: DirectPrompt is lightweight but less accurate; AttributionLM offers the best performance if you can run inference on a GPU.

Run the attribution script:

python run_attribution.py --method AttributionLM --logs ./my_logs --output ./results

The script returns a JSON file per log with responsible_agent and failure_turn predictions.

Step 5: Interpret the Results and Debug

Now you have a clear hypothesis: which agent likely caused the failure and at what step. Use this to focus your debugging efforts.

Check the predicted failure turn in your original logs: examine the messages exchanged around that point.
Verify whether the identified agent made an incorrect reasoning step or misunderstood another agent’s output.
If the prediction seems off, cross-reference with the attribution confidence score (output by the model). Consider re-running with a different method (e.g., TraceEval) to triangulate.

Step 6: Improve Your System Based on Findings

The ultimate goal is to fix the failure and prevent recurrence. The attribution gives you a starting point for system iteration.

Adjust the problematic agent’s prompt, role description, or information-processing pipeline.
Add guardrails: require agents to confirm intermediate results before proceeding.
Re-run your multi-agent system on the same task and verify the failure no longer occurs.

Tips for Success

Start with the benchmark logs to test your pipeline before analyzing your own data.
Use the AttributionLM model if you have access to a GPU; it outperforms prompt-based methods significantly.
Log everything – the more context you capture per turn (e.g., internal reasoning), the better the attribution will work.
Combine with manual review for critical systems; the model provides a hypothesis, not a guarantee.
Contribute back: if you discover new failure patterns, consider augmenting the Who&When dataset to help the community.