The Problem

You have AI.
Why are you still glued to the screen?

Copilot, Cursor, Claude Code. Tools are here. Somehow, your afternoon still looked like this

YOUR WEDNESDAY - 9:00 + 18:00 actual productive work: ~48 min

plan

guide

it quit

babysit

it quit

restart

work

babysit

plan

babysit

work

babysit

9101112131415161718

planning + guiding babysitting AI your actual work

You still got ~48 minutes of real work, even with best AI tools.

The Problem

You’ve seen all three this week

AI can code.
Enterprise execution is the part that breaks.

The failure usually starts after the code is written — internal data, private packages, remote GPUs, auth, queues, logs. The moment the task leaves your laptop, the agent needs company-specific moves it doesn’t know how to make.

claude-code14:22

wrote PySpark query for /data/prod/events

ran on local sample fixtures

hdfs dfs -ls /data/prod/events

-bash: hdfs: command not found

please install the HDFS client and

ensure it is on your PATH.

— session ended —

QUIT

Case A - Tue prod HDFS is firewalled off laptops — ssh to the edge node, kinit

claude-code11:41

refactored feature pipeline

added acme-feature-store==0.4.1

pip install -r requirements.txt

ERROR: No matching distribution found for acme-feature-store

The package does not exist on PyPI.

— session ended —

QUIT

Case B - Wed internal packages live on the company mirror — pip needs --index-url, not public PyPI.

claude-code16:45

wrote train.py + unit tests

dataset shape verified

python train.py --epochs 10

RuntimeError: No CUDA GPUs are available

torch.cuda.is_available() == False

Please run on a GPU host.

— session ended —

QUIT

Case C - Thu no GPU on the laptop — submit to the cluster with pyflyte run --remote, don’t run locally.

Different walls. Same ending: the agent stops, and you become the runtime.

The Solution

What “retry” actually means

Same wall. Agent vs. Engineer

Take Case A. The agent stops at “command not found.” The engineer treats that as a clue, not a conclusion — and finds the machine where the command exists.

A two-row timeline comparing an engineer and an AI agent both attempting to read /data/prod/events. The engineer's row shows seven events across fourteen minutes — try, search docs, read wiki, new plan, verify, scptrun, done. The agent's row has only two events at minute zero — try, quit — and is empty for the rest of the timeline. — Top: engineer — 7 moves, 14 min, done. Bottom: AI agent — 1 move, 0:30, no second move.

The difference isn’t intelligence. It’s persistence after the first failed move. — retry doesn’t mean running the broken command twice. it means learning why it broke, then trying a different path.

The Approach

The move that makes all of this go away

Stop saying “next step.”
Start saying “end state.”

Today

You describe the path.

One bad assumption breaks the chain. You step back in.

you say: run this, then this, then this…

Super Team

You describe the destination.

The team plans a route, hits a blocked road, recalculates, and keeps going until the destination checks pass.

Key Innovation

Done is not a status message.
It is a gate.

You describe the outcome in English. The PM turns it into checks — shell commands, status queries, log scans, artifact assertions. The team can keep working, but it cannot declare victory unless the checks pass.

# end-state.yaml — auto-generated by the PM from your goal
checks:
# Case A — runs on the edge node, lands data in HDFS
— exec: ssh edge-01 'hdfs dfs -test -s /out/features/_SUCCESS' # exit 0

# Case B — the new feature actually runs end-to-end (proves deps resolved)
— exec: pytest tests/test_feature_pipeline.py -q # exit 0

# Case C — the agent must produce ${EXEC_ID} of a Flyte run on the cluster
— exec: flytectl get execution ${EXEC_ID} -o json | jq -e '.phase=="SUCCEEDED"' # exit 0
— exec: flytectl get logs ${EXEC_ID} | grep -q 'Training complete' # success signal
— exec: ! flytectl get logs ${EXEC_ID} | grep -qE 'ERROR|Traceback' # no errors

Why Current Approaches Fail

a single agent in a loop

You said: don’t stop until done.
It heard: find a reason to stop.

➀ buried

“…what were we building again? (scroll up 40,000 tokens… )”

the original ask is at turn 3 of 86.
it doesn’t scroll back.

➁ anxious

“Approaching context limit. Let me prioritize wrapping up the core functionality.”

no one asked it to wrap up.
compaction already ran. it’s still anxious.

➂ self-judge

“Looks correct to me. Marking task as complete.” ✓

a 0.2-second unit test would have caught it.
it didn’t run one.

➃ blame the box

“This appears to be an environmental issue beyond my control.”

typo on line 47.
the “environment” is fine.

➄ negotiates done

“Significant progress has been made on the core functionality.”

1 of 7 contracts passing.
it calls this done.

One head cannot reliably plan, build, recover, and grade itself for hours — that’s why a loop alone doesn’t make an agent persistent.

Introducing Super Team

the answer to the loop trap

Hand off to Super Team.
Wake up with work done.

Super Team isn’t a bigger prompt. It’s a harness system around the model: separate roles, persistent state, objective gates, and a manager that keeps the run alive.

← you are here this unlocks overnight runs →

Tab-complete

30s

Cursor autocomplete

5 min

Cursor Agent

20 min

Claude Code

hours

Super Team

a line a function a feature a small PR end-to-end delivery

Autocomplete gave us lines. Agents gave us small tasks.
The next leap is delegated delivery — work that continues while you’re gone.

Architecture

You only talk to one.
A whole team handles the rest.

You talk to the PM. The PM turns intent into acceptance criteria. Behind it, specialist sessions research, plan, build, verify, just like you've been doing manually all week.

You define the outcome. The team handles the route.
If it finishes, you get the PR and the evidence. If it can’t, you get the one question only you can answer. less babysitting. more delegated work.

Under The Hood

three ideas behind reliable delegation

The harness
is the product.

The same model produces dramatically different outcomes depending on what surrounds it. Context engineering, adversarial evaluation, and compounding memory are what turn a capable model into a reliable system — not a bigger prompt.

01 Context

“Context is the scarcest resource.”

Every token filling the window with irrelevant history is a token stolen from reasoning. Super Team uses progressive disclosure — agents receive exactly what their role requires, nothing accumulated from phases they aren’t part of. The Manager re-reads state files from scratch on every 270-second cycle rather than carrying a growing context across the run.

02 Evaluation

“Self-evaluation is inherently lenient.”

A model anchored to its own reasoning approves its own mistakes. The Evaluator reads only the contract and the Generator’s outputs — never the Generator’s thinking. This single design choice makes evaluation adversarial rather than confirmatory, without requiring a different model or extra prompt engineering.

03 Freshness

“Accumulated context leads to drift.”

Context that grows across a long run degrades reliability. Generator and Evaluator pairs are fresh per increment — when a unit fails, only that unit restarts, not the whole pipeline. The frozen contract substitutes for context: it carries exactly what a new agent needs to reproduce prior work or judge it honestly.

Reference — Anthropic Engineering: Harness Design for Long-Running Agent Applications

Global Wiki & Local Warm Start inspired by Andrej Karpathy — LLM Wiki

Two tiers of knowledge

~/.superteam/ ← global (all projects) index.md ← the hot cache knowledge/ .superteam/ ← local (this project) knowledge/ index.md …

The local wiki holds project-specific discoveries — architecture quirks, undocumented APIs, test patterns, integration gotchas. The global wiki travels with you: toolchain tricks, company conventions, reusable gate scripts. The Explorer reads it before touching the codebase. The Curator writes it at the end of every successful session.

How knowledge compounds

Session 1 — cold start Explorer surveys codebase from scratch End of session — Curator runs Findings promoted to ~/.superteam/ Session 2 — warm start Explorer loads global wiki first, surveys only what’s new or missing Session N — near-instant context Patterns, toolchain, conventions already loaded before a line of code is read

The first session is cold. Every session after that is warmer. The wiki is agent-maintained — written by the Curator, read by the Explorer, and it compounds across every project you run Super Team on. Knowledge that would otherwise be re-derived each time becomes permanent.

PM Workflow & Incremental Delivery

01 — init

You describe the outcome

One sentence or a paragraph. No plan required — the team figures out the route.

only time you type

02 — classify

PM asks until confident

Targeted questions grounded in codebase reality — scope, edge cases, integration points — until the spec has no ambiguity left.

03 — gate

Hard gates before code

Acceptance scripts are written and reviewed before a line of implementation. “Done” is checkable, not a feeling.

you review & approve

04 — execute

Incremental, fresh pairs

Each work unit gets a new Generator and Evaluator. Failure is isolated, not cascading. Context never accumulates across units.

fully autonomous

The thesis, in one sentence

Today’s AI tools stop where the model stops.

The next product frontier is everything around the model: persistence, recovery, memory, orchestration, and verification.

Super Team treats that frontier as a systems problem. Contracts make “done” checkable. Specialist agents keep roles separate. Shared memory preserves what the run learns. That’s what turns a helpful coding assistant into a team you can delegate to.

You have AI.Why are you still glued to the screen?

AI can code.Enterprise execution is the part that breaks.

Same wall. Agent vs. Engineer

Stop saying “next step.”Start saying “end state.”

You describe the path.

You describe the destination.

Done is not a status message.It is a gate.

You said: don’t stop until done.It heard: find a reason to stop.

Hand off to Super Team.Wake up with work done.

You only talk to one.A whole team handles the rest.

The harnessis the product.

Two tiers of knowledge

How knowledge compounds

You have AI.
Why are you still glued to the screen?

AI can code.
Enterprise execution is the part that breaks.

Stop saying “next step.”
Start saying “end state.”

Done is not a status message.
It is a gate.

You said: don’t stop until done.
It heard: find a reason to stop.

Hand off to Super Team.
Wake up with work done.

You only talk to one.
A whole team handles the rest.

The harness
is the product.