[Project Notes] Auto-claude-code-research-in-sleep (ARIS)
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
Auto-claude-code-research-in-sleep, or ARIS, is a repository of Claude Code skills for running machine learning research workflows with a strong bias toward automation. Its most important design choice is not full autonomy by itself, but cross-model role separation: Claude Code executes, while a second model, usually reached through Codex MCP, reviews, scores, and challenges the work.
That makes the repository more interesting than a typical “agent does research” demo. The project is really a workflow system built out of SKILL.md files. Those skills encode checkpoints, review loops, GPU-budget rules, state persistence, paper-writing steps, and fallback behavior. The result is less like an application and more like a library of procedural research operators.
What The Repo Actually Contains
This is a lightweight repository. The core logic does not live in a Python package or a backend service. It lives mostly in the skill files under skills/. The top-level structure is enough to reveal the intent:
- orchestration skills such as
idea-discovery,auto-review-loop,paper-writing, andresearch-pipeline - support skills such as
research-lit,novelty-check,run-experiment,monitor-experiment,paper-plan,paper-figure, andpaper-compile - a few practical assets and paper templates, especially under the paper-writing stack
That structure matters. ARIS is not pretending that prompt programs are incidental. The prompt programs are the product.
The Core Bet: Cross-Model Collaboration
The strongest conceptual claim in the repository is that single-model self-review is structurally weak. The README argues that if the same model both executes and critiques, it tends to reproduce its own blind spots. ARIS answers this by splitting the loop into two roles.
Claude Code acts as the fast executor. It edits files, launches jobs, monitors outputs, compiles papers, and keeps the workflow moving. A second model, usually GPT-5.4 through Codex MCP, acts as the critic. That critic is meant to be slower, more deliberate, and more adversarial.
This is not a cosmetic detail. Many of the skills are really wrappers around that asymmetry. They are designed to make execution and criticism collide repeatedly until weak claims are either repaired, softened, or killed.
The Three Main Workflows
The repository organizes itself around three main workflows, and this is the cleanest way to understand what it is trying to automate.
How the skills call each other
At the highest level, the repository’s composition looks like this:
flowchart TD
A["research-lit"] --> B["idea-creator"]
B --> C["novelty-check"]
C --> D["research-review"]
D --> E["idea-discovery"]
E --> F["implement"]
F --> G["run-experiment"]
G --> H["auto-review-loop"]
H --> I["paper-plan"]
I --> J["paper-figure"]
J --> K["paper-write"]
K --> L["paper-compile"]
L --> M["auto-paper-improvement-loop"]
M --> N["research-pipeline"]
The important detail is that research-pipeline is not a monolithic engine. It is a coordinator that delegates to smaller skills and passes forward artifacts like IDEA_REPORT.md, AUTO_REVIEW.md, PAPER_PLAN.md, and NARRATIVE_REPORT.md.
Workflow 1: idea discovery
The idea-discovery pipeline chains literature review, brainstorming, novelty checking, and external critique. The research-lit skill is broader than a normal web-search helper: it can use Zotero, Obsidian, local PDFs, and web sources in a priority order. That already signals one of the repo’s best instincts, which is that useful research automation should attach to a researcher’s actual memory systems rather than pretending the open web is enough.
The rest of the workflow filters ideas through feasibility, compute cost, quick novelty checks, and optional pilot experiments. It is trying to prevent the common failure mode where an agent produces many interesting-sounding ideas and almost none of them survive contact with empirical reality.
flowchart TD
A["research-lit"] --> B["idea-creator"]
B --> C["novelty-check"]
C --> D["research-review"]
D --> E["IDEA_REPORT.md"]
E --> F{"human checkpoint"}
F -->|approve| G["next stage"]
F -->|refine scope| A
F -->|regenerate ideas| B
Workflow 2: auto review loop
This is probably the most distinctive part of ARIS. The auto-review-loop skill turns criticism into an explicit procedure:
- get an external review
- parse score, verdict, and minimum fixes
- implement the fixes
- run or monitor experiments if needed
- document the round
- repeat up to a capped number of times
What makes this more than “ask the LLM again” is the operational detail. The skill defines MAX_ROUNDS, a stop threshold, a REVIEW_STATE.json for recovery, and a requirement to store reviewer responses verbatim. It explicitly forbids pretending fixes were made when they were not. This is the kind of detail that only appears when someone has tried to keep a long-running agent loop from drifting or silently failing.
flowchart TD
A["research-review via Codex MCP"] --> B["parse score and weaknesses"]
B --> C["implement fixes"]
C --> D["run-experiment"]
D --> E["monitor-experiment"]
E --> F["append AUTO_REVIEW.md"]
F --> G["update REVIEW_STATE.json"]
G --> H{"score >= threshold or max rounds?"}
H -->|no| A
H -->|yes| I["stop and summarize"]
Workflow 3: paper writing
The paper-writing stack takes a NARRATIVE_REPORT.md and pushes it toward a paper directory with LaTeX source and compiled PDF. The flow is:
paper-planpaper-figurepaper-writepaper-compileauto-paper-improvement-loop
This part of the repo is less flashy than the autonomous-review pitch, but arguably more practical. It includes venue templates, bibliography cleanup, page-count checks, figure generation, and a second review loop focused on writing quality and formatting compliance. It is built by someone who clearly understands that “write the paper” is not one task but a stack of brittle subtasks.
flowchart TD
A["NARRATIVE_REPORT.md"] --> B["paper-plan"]
B --> C["PAPER_PLAN.md"]
C --> D["paper-figure"]
D --> E["figures/"]
E --> F["paper-write"]
F --> G["paper/ LaTeX source"]
G --> H["paper-compile"]
H --> I["main.pdf"]
I --> J["auto-paper-improvement-loop"]
J --> K["revised PDF + log"]
Why The Repo Feels More Serious Than Most Agent Demos
A lot of agent repositories describe a large ambition and then mostly provide search wrappers or broad prompts. ARIS feels more serious because it contains operational constraints almost everywhere.
Some of the clearest examples are:
idea-discoverysets pilot budgets, timeouts, and total GPU-hour capsauto-review-looppersists state so context compaction does not destroy long runsresearch-pipelineforces a human checkpoint before committing to an ideapaper-writingincludes explicit checkpoints between plan, figure, writing, compilation, and improvement
This is what makes the project feel like workflow engineering rather than prompt ornamentation. The repository is constantly trying to convert vague research optimism into explicit rules about when to stop, when to wait, when to escalate, and when to spend compute.
What I Find Most Practical
The most practical design choice is that the repo does not actually trust full autonomy as much as the slogan suggests. ARIS keeps adding ways to reintroduce human control. The AUTO_PROCEED option makes that explicit: users can let the workflows continue automatically, or they can require explicit approval at key steps.
That is a good tradeoff. In real research work, the expensive mistakes are often not local coding bugs but narrative pivots, evaluation choices, and compute commitments. The repo seems to understand that.
Another practical strength is graceful degradation. The literature skill becomes richer if Zotero and Obsidian are configured, but it still functions without them. Feishu notifications are optional and default-off. The README also advertises alternative model combinations rather than hard-binding the whole system to one exact API setup. This makes the repository feel built for messy real environments instead of polished demo conditions.
What Is Especially Clever
The cleverest part of the project is not any single skill. It is the way the repository treats research artifacts as interfaces.
IDEA_REPORT.md, AUTO_REVIEW.md, REVIEW_STATE.json, PAPER_PLAN.md, and NARRATIVE_REPORT.md are not just outputs. They are handoff objects between skills. That means the workflow is not only model-to-model, but file-to-file. The repo keeps externalizing state so that the next skill has something concrete to operate on.
This is a better design than relying entirely on hidden conversational context. Once a workflow becomes long or expensive, explicit artifacts matter more than a long prompt ever will.
Limitations
The repository is strong as outer-loop automation and weaker as inner-loop scientific judgment. It can organize literature, enforce critique, run experiments, rewrite narratives, and package output into a paper. It cannot guarantee that the underlying idea is important, that the benchmarks are the right ones, or that the resulting narrative is intellectually honest rather than merely polished.
It is also a better fit for empirical ML projects than for research that depends heavily on tacit lab practice, unusual infrastructure, or deep theoretical invention. ARIS assumes that enough of the work can be represented as files, logs, scripts, prompts, and reports. That is often true, but not universally true.
There is also an unavoidable risk that the paper-writing pipeline lowers the cost of making average work look submission-ready. The repo partly counters this with adversarial review and claim-killing, but the risk does not disappear just because the workflow is well designed.
Takeaways
I think ARIS is most valuable when read as a statement about research workflow design, not as proof that autonomous research is solved. The repository shows that the surrounding scaffolding matters a lot: role separation, explicit checkpoints, file-based state, budget controls, and iterative criticism may matter more than any single frontier-model capability.
The slogan is “do research while you sleep,” but the real contribution is more grounded. ARIS is a structured library for making research workflows explicit enough that another agent can inhabit them, push on them, and sometimes improve them.
