github.com/Crysple/superteam
Super Team
v0.1.0
The Problem

You have AI.
Why are you still glued to the screen?

Copilot, Cursor, Claude Code. Tools are here. Somehow, your afternoon still looked like this

YOUR WEDNESDAY - 9:00 + 18:00 actual productive work: ~48 min
plan
guide
it quit
babysit
it quit
restart
work
babysit
plan
babysit
work
babysit
9101112131415161718
planning + guiding babysitting AI your actual work

You still got ~48 minutes of real work, even with best AI tools.

The Problem
You’ve seen all three this week

AI can code.
Enterprise execution is the part that breaks.

The failure usually starts after the code is written — internal data, private packages, remote GPUs, auth, queues, logs. The moment the task leaves your laptop, the agent needs company-specific moves it doesn’t know how to make.

claude-code14:22
wrote PySpark query for /data/prod/events
ran on local sample fixtures
hdfs dfs -ls /data/prod/events

-bash: hdfs: command not found
  please install the HDFS client and
  ensure it is on your PATH.
— session ended —
QUIT
Case A - Tue prod HDFS is firewalled off laptops — ssh to the edge node, kinit
claude-code11:41
refactored feature pipeline
added acme-feature-store==0.4.1
pip install -r requirements.txt

ERROR: No matching distribution found for acme-feature-store
  The package does not exist on PyPI.
— session ended —
QUIT
Case B - Wed internal packages live on the company mirror — pip needs --index-url, not public PyPI.
claude-code16:45
wrote train.py + unit tests
dataset shape verified
python train.py --epochs 10

RuntimeError: No CUDA GPUs are available
   torch.cuda.is_available() == False
  Please run on a GPU host.
— session ended —
QUIT
Case C - Thu no GPU on the laptop — submit to the cluster with pyflyte run --remote, don’t run locally.

Different walls. Same ending: the agent stops, and you become the runtime.

The Solution
What “retry” actually means

Same wall. Agent vs. Engineer

Take Case A. The agent stops at “command not found.” The engineer treats that as a clue, not a conclusion — and finds the machine where the command exists.

A two-row timeline comparing an engineer and an AI agent both attempting to read /data/prod/events. The engineer's row shows seven events across fourteen minutes — try, search docs, read wiki, new plan, verify, scptrun, done. The agent's row has only two events at minute zero — try, quit — and is empty for the rest of the timeline.
Top: engineer — 7 moves, 14 min, done. Bottom: AI agent — 1 move, 0:30, no second move.

The difference isn’t intelligence. It’s persistence after the first failed move. — retry doesn’t mean running the broken command twice. it means learning why it broke, then trying a different path.

The Approach
The move that makes all of this go away

Stop saying “next step.”
Start saying “end state.”

Today

You describe the path.

One bad assumption breaks the chain. You step back in.

you say: run this, then this, then this…
Super Team

You describe the destination.

The team plans a route, hits a blocked road, recalculates, and keeps going until the destination checks pass.

Key Innovation

Done is not a status message.
It is a gate.

You describe the outcome in English. The PM turns it into checks — shell commands, status queries, log scans, artifact assertions. The team can keep working, but it cannot declare victory unless the checks pass.

# end-state.yaml — auto-generated by the PM from your goal
checks:
# Case A — runs on the edge node, lands data in HDFSexec: ssh edge-01 'hdfs dfs -test -s /out/features/_SUCCESS' # exit 0

# Case B — the new feature actually runs end-to-end (proves deps resolved)exec: pytest tests/test_feature_pipeline.py -q # exit 0

# Case C — the agent must produce ${EXEC_ID} of a Flyte run on the clusterexec: flytectl get execution ${EXEC_ID} -o json | jq -e '.phase=="SUCCEEDED"' # exit 0exec: flytectl get logs ${EXEC_ID} | grep -q 'Training complete' # success signalexec: ! flytectl get logs ${EXEC_ID} | grep -qE 'ERROR|Traceback' # no errors
Why Current Approaches Fail
a single agent in a loop

You said: don’t stop until done.
It heard: find a reason to stop.

  buried
“…what were we building again? (scroll up 40,000 tokens… )
the original ask is at turn 3 of 86.
it doesn’t scroll back.
  anxious
“Approaching context limit. Let me prioritize wrapping up the core functionality.”
no one asked it to wrap up.
compaction already ran. it’s still anxious.
  self-judge
“Looks correct to me. Marking task as complete.”
a 0.2-second unit test would have caught it.
it didn’t run one.
 blame the box
“This appears to be an environmental issue beyond my control.”
typo on line 47.
the “environment” is fine.
 negotiates done
“Significant progress has been made on the core functionality.”
1 of 7 contracts passing.
it calls this done.

One head cannot reliably plan, build, recover, and grade itself for hours — that’s why a loop alone doesn’t make an agent persistent.

Introducing Super Team
the answer to the loop trap

Hand off to Super Team.
Wake up with work done.

Super Team isn’t a bigger prompt. It’s a harness system around the model: separate roles, persistent state, objective gates, and a manager that keeps the run alive.

← you are here this unlocks overnight runs →
5s
Tab-complete
30s
Cursor autocomplete
5 min
Cursor Agent
20 min
Claude Code
hours
Super Team
a line a function a feature a small PR end-to-end delivery

Autocomplete gave us lines. Agents gave us small tasks.
The next leap is delegated delivery — work that continues while you’re gone.

Architecture

You only talk to one.
A whole team handles the rest.

You talk to the PM. The PM turns intent into acceptance criteria. Behind it, specialist sessions research, plan, build, verify, just like you've been doing manually all week.

your side the team - super team the only line you touch YOU the human PM your interface ORCHESTRATOR drives the pipeline ARCHITECT plan + contracts MANAGER stateless monitor EXPLORER research + wiki GENERATOR fresh + writes code EVALUATOR runs hard gates CURATOR session + wiki tap / hover any role

You define the outcome. The team handles the route.
If it finishes, you get the PR and the evidence. If it can’t, you get the one question only you can answer. less babysitting. more delegated work.

Under The Hood
three ideas behind reliable delegation

The harness
is the product.

The same model produces dramatically different outcomes depending on what surrounds it. Context engineering, adversarial evaluation, and compounding memory are what turn a capable model into a reliable system — not a bigger prompt.

01 Context

“Context is the scarcest resource.”

Every token filling the window with irrelevant history is a token stolen from reasoning. Super Team uses progressive disclosure — agents receive exactly what their role requires, nothing accumulated from phases they aren’t part of. The Manager re-reads state files from scratch on every 270-second cycle rather than carrying a growing context across the run.

02 Evaluation

“Self-evaluation is inherently lenient.”

A model anchored to its own reasoning approves its own mistakes. The Evaluator reads only the contract and the Generator’s outputs — never the Generator’s thinking. This single design choice makes evaluation adversarial rather than confirmatory, without requiring a different model or extra prompt engineering.

03 Freshness

“Accumulated context leads to drift.”

Context that grows across a long run degrades reliability. Generator and Evaluator pairs are fresh per increment — when a unit fails, only that unit restarts, not the whole pipeline. The frozen contract substitutes for context: it carries exactly what a new agent needs to reproduce prior work or judge it honestly.

Reference — Anthropic Engineering: Harness Design for Long-Running Agent Applications

Global Wiki & Local Warm Start inspired by Andrej Karpathy — LLM Wiki

Two tiers of knowledge

~/.superteam/ ← global (all projects) index.md ← the hot cache knowledge/ .superteam/ ← local (this project) knowledge/ index.md …

The local wiki holds project-specific discoveries — architecture quirks, undocumented APIs, test patterns, integration gotchas. The global wiki travels with you: toolchain tricks, company conventions, reusable gate scripts. The Explorer reads it before touching the codebase. The Curator writes it at the end of every successful session.

How knowledge compounds

Session 1 — cold start Explorer surveys codebase from scratch End of session — Curator runs Findings promoted to ~/.superteam/ Session 2 — warm start Explorer loads global wiki first, surveys only what’s new or missing Session N — near-instant context Patterns, toolchain, conventions already loaded before a line of code is read

The first session is cold. Every session after that is warmer. The wiki is agent-maintained — written by the Curator, read by the Explorer, and it compounds across every project you run Super Team on. Knowledge that would otherwise be re-derived each time becomes permanent.

PM Workflow & Incremental Delivery
01 — init
You describe the outcome

One sentence or a paragraph. No plan required — the team figures out the route.

only time you type
02 — classify
PM asks until confident

Targeted questions grounded in codebase reality — scope, edge cases, integration points — until the spec has no ambiguity left.

03 — gate
Hard gates before code

Acceptance scripts are written and reviewed before a line of implementation. “Done” is checkable, not a feeling.

you review & approve
04 — execute
Incremental, fresh pairs

Each work unit gets a new Generator and Evaluator. Failure is isolated, not cascading. Context never accumulates across units.

fully autonomous
The thesis, in one sentence

Today’s AI tools stop where the model stops.

The next product frontier is everything around the model: persistence, recovery, memory, orchestration, and verification.

Super Team treats that frontier as a systems problem. Contracts make “done” checkable. Specialist agents keep roles separate. Shared memory preserves what the run learns. That’s what turns a helpful coding assistant into a team you can delegate to.