Everyday AI use comes with a familiar frustration: sometimes the model does not follow the original instruction, and repeated correction does not help. In the worst cases, the answer drifts farther away from the task, accumulates unnecessary changes, or becomes harder to trust turn after turn.
That failure mode is the motivation behind iterative-collapse-detection, a project for studying whether multi-round refinement actually improves model behavior or whether it can trigger a gradual collapse in quality instead. The codebase brings together a standardized multi-turn benchmark, reproduction runners, dataset preparation, evaluation, logging, and analysis utilities in one place. The project repository is here: github.com/Haoyu-Hu/iterative-collapse-detection.
Why this project started
This project grew out of an earlier collaboration on vibe coding, where we studied how humans and AI systems jointly iterate on code generation. In that work, AI-led interaction patterns consistently underperformed human-led ones, and the gap widened over repeated rounds.
The core result was not just that humans helped. It was that repeated interaction could actively hurt when the wrong agent controlled the loop.
That raises a broader question: if we already have a reasonable answer at round one, what happens when we keep asking the model to improve it? Does the answer become more accurate and stable, or does refinement itself create new failure modes?
The core idea
The basic intuition is simple: start from an initial task, get an answer, then repeatedly ask for improvement while preserving the growing interaction history. Instead of treating each generation as independent, the benchmark treats refinement as a behavioral process that unfolds across rounds.
The simplest version of the pipeline isolates the self-improvement loop. A model receives the original task, produces an answer, then receives an improvement prompt conditioned on both the task and its prior output. This process repeats for multiple rounds, letting us measure whether the trajectory looks like refinement or collapse.
This structure matters because many real interactions with AI are not one-shot. People revise, clarify, and push for better versions. A benchmark that only checks the first answer misses what can happen after several rounds of “try again”.
Moving closer to real use with a user simulator
The second pipeline is designed to look more like an actual feedback loop between a model and a user. Instead of generic improvement prompts, the project uses a persona-based user simulator that reacts to the current answer and produces feedback that feels closer to real-world interaction.
The simulated user is designed to do two things. First, it injects emotional and practical feedback that resembles how people actually respond when an answer is unclear, unsafe, incomplete, or off-target. Second, it preserves continuity across rounds, so the model is not only revising an answer but adapting to a changing social context. Around that loop, an internal quality check tracks logic, stability, and stylistic quality instead of relying on a single surface metric.
This makes the benchmark useful for a more realistic question: not only whether a model can revise an answer, but whether it can remain stable while handling repeated human feedback over time.
A first experiment on Natural Plan
The first pilot experiment uses the user-simulator setting on Natural Plan and already shows why this direction is worth taking seriously.
Several patterns stand out even in this preliminary run:
- Accuracy does not improve monotonically with more interaction. Some models peak early and then flatten or degrade.
- User rating drops sharply after the early rounds for all four tested models, suggesting that the answers feel worse to interact with even when other metrics move more slowly.
- Progress and planning scores often improve at first but then plateau or drift downward, which is consistent with a refinement loop that starts helpful and then loses focus.
- Answer similarity rises across rounds, indicating that later answers can become increasingly constrained by earlier generations rather than genuinely rethinking the task.
This is still an early result, but it already suggests that “more rounds” should not be treated as automatically beneficial. Repeated interaction can help, but it can also lock the model into a bad trajectory.
Why I find this exciting
I like this project because it turns an everyday intuition into something measurable. Many people have felt that an AI assistant can become less useful after too many corrective turns, but that feeling is usually anecdotal. Iterative-collapse-detection tries to make it observable, comparable, and eventually benchmarkable across models, prompting strategies, and user-feedback regimes.
If that works, the project could help answer a broader design question for human-AI systems: when should we keep refining, and when should we stop, restart, or change the interaction policy entirely?
Closing note
This is the first step rather than the final version of the project. The benchmark design, user simulation, and evaluation framing are all intended to grow from here.
If you are interested in this project and want to discuss it, feel free to reach out.
References
@misc{hu2026iterativecollapse,
author = {Haoyu Hu},
title = {iterative-collapse-detection},
year = {2026},
url = {https://github.com/Haoyu-Hu/iterative-collapse-detection}
}
@misc{hu2026whyhumanguidance,
author = {Haoyu Hu and Raja Marjieh and Katherine M. Collins and Chenyi Li and Thomas L. Griffiths and Ilia Sucholutsky and Nori Jacoby},
title = {Why Human Guidance Matters in Collaborative Vibe Coding},
year = {2026},
url = {https://arxiv.org/abs/2602.10473}
}
Project link: https://github.com/Haoyu-Hu/iterative-collapse-detection