Dark Software Factory

A dark software factory is a non-interactive software development workflow where coding agents write, test, and maintain both code and specifications with no human writing or reviewing the code. Humans express intent through prompts, feature requests, bug reports, and acceptance criteria, and shape agent behavior through harness engineering.

Details

The approach inverts the traditional human-in-the-loop model: instead of authoring specifications and reviewing code, humans express intent through prompts and feature requests, provide acceptance criteria, and verify outcomes. Agents translate that intent into formal specifications and maintain them alongside the code. The shift reframes software quality from "does the code look correct?" to "does the system satisfy its constraints and pass behavioral validation?"

The repository is the single source of truth for everything agents need: specifications, architectural decisions, quality principles, and agent instructions all live as markdown files alongside the code. Specifications are agent-maintained artifacts that co-evolve with the implementation - a feature request or bug report triggers agents to update both the specification and the code. No external wikis, no tribal knowledge, no context that is not checked in - agents operate from a checkout alone. This makes the entire development surface text files manipulated through filesystem operations, which is tractable for current LLMs and compatible with standard version control tooling.

Git provides the coordination layer. Branches isolate concurrent agent work, each agent operates on a full repository checkout inside its own sandbox, pull requests define the review and merge boundary, and commit history provides full auditability of every agent decision.

Application legibility enables agents to validate outcomes directly. The application is made bootable so each agent can run it inside its sandbox. Browser automation (Chrome DevTools Protocol) lets agents drive the UI, reproduce bugs, and verify fixes. A local observability stack exposes logs, metrics, and traces to agent queries, making prompts like "ensure no span in these critical user journeys exceeds two seconds" tractable. Each sandbox's observability data is ephemeral and torn down after the task completes.

Because agents write both implementation and tests, conventional test suites alone cannot serve as the primary quality signal - an agent that writes assert true passes its own tests. Dark software factories address this through harness engineering: mechanical enforcement via custom linters, structural tests, and dependency-direction rules that prevent classes of error deterministically, without relying on the agent to remember rules. The harness is self-improving - when an agent struggles with a task, the team treats the failure as a signal that some capability is missing (a tool, a guardrail, documentation, or an abstraction) and feeds it back into the environment, typically by having the agent itself write the fix.

Agent-to-agent review replaces human code review. An agent implements a change, reviews its own work locally, requests additional agent reviews, responds to feedback, and iterates until all reviewers are satisfied. Agents can squash and merge their own pull requests. Humans may review but are not required to - escalation to a human occurs only when judgment is needed. This enables single-prompt end-to-end feature development: given a task, the agent can implement, validate, open a PR, handle review feedback, detect and remediate build failures, and merge.

Continuous refactoring

Agent-generated code drifts over time. Agents replicate patterns already present in the repository - including suboptimal ones - leading to entropy accumulation. Naming conventions diverge, deprecated patterns spread, and architectural boundaries erode. Manual cleanup does not scale - one team reported spending 20% of engineering time on weekly cleanup before automating the process.

The alternative is encoding opinionated, mechanical rules ("golden principles") directly into the repository and running background agent tasks that scan for deviations on a regular cadence. These agents open small, focused refactoring pull requests - each reviewable in under a minute and often eligible for automerge. Quality grades per product domain and architectural layer track gaps over time, giving both agents and humans visibility into where drift is accumulating. Human taste is captured once in these principles and then enforced continuously on every line of code, replacing periodic manual cleanup with systematic maintenance.

This treats technical debt as a continuous maintenance concern rather than a periodic cleanup project. Small, frequent corrections prevent bad patterns from compounding across the codebase over days or weeks. The economics favor this model when agent throughput is high: the cost of a background refactoring run is low relative to the cost of a large manual cleanup, and each correction captured as a golden principle applies to all future code automatically. This differs from cognitive debt, which describes erosion of human understanding - here the concern is codebase-level entropy that can be addressed mechanically.

Dark software factories depend on cloud coding agents operating in sandboxed environments, often as multi-agent systems with distinct agent roles - implementation, review, security scanning, and structural maintenance - each with different tools, permissions, and validation criteria. Token costs are substantial. Beyond inference cost, the approach requires significant upfront investment in harness infrastructure. This makes the approach economically viable only when the value of the software produced justifies both the ongoing inference cost and the initial harness construction cost.

Examples

A three-engineer team produces roughly one million lines of code in five months with zero manually-written lines, averaging 3.5 PRs per engineer per day, with throughput increasing as the team grows to seven engineers.
A short AGENTS.md serves as the repository's table of contents, pointing to deeper specification sources - design documents, execution plans, architecture maps, and quality scores - enabling progressive context disclosure.
Agents boot the application in their sandbox, drive it via Chrome DevTools Protocol to reproduce bugs and validate fixes, and query a local observability stack (LogQL for logs, PromQL for metrics) to verify performance and reliability requirements.

Counterarguments

The approach shifts the hard problem from writing code to building and maintaining the harness. An incomplete or poorly maintained harness produces unreliable output, and building a sufficient harness requires deep engineering judgment about architectural boundaries, enforcement mechanisms, and quality signals. Early evidence suggests months of dedicated effort before agent-driven development becomes reliably productive, and the investment may not generalize - the OpenAI team that demonstrated the approach explicitly cautions that their results "should not be assumed to generalize without similar investment." The difficulty is real but different in kind: harness investment compounds and improves all future agent work, whereas hand-written code does not.
Agents writing both implementation and validation creates a systemic monoculture risk: if the underlying LLM has a consistent blind spot, the same flaw can appear in both the implementation and the agent-driven review that validates it, passing all checks while harboring a shared deficiency. Mechanical enforcement (linters, structural tests) partially mitigates this by catching error classes independently of the agent's reasoning, but cannot cover all failure modes.
Regulated environments (healthcare, finance, aviation) often mandate human sign-off on code changes. When no human has reviewed the code, accountability and legal liability are unresolved - the organization cannot point to a responsible reviewer, and regulatory frameworks that assume human oversight do not have clear analogs for agent-only development.
When code no human has read breaks in production, diagnosing and fixing it is difficult. Delegating debugging to agents creates a trust recursion - the same verification gap that required the dark factory model now applies to incident response. Making observability data directly legible to agents partially mitigates this (agents can query logs, metrics, and traces to diagnose issues), but each layer of agent-driven diagnosis still inherits the reliability limits of the layer it is trying to fix.
Human intent passes through two agent-mediated translations: from prompts and feature requests into formal specifications, and from specifications into code. Each translation can lose or distort meaning, and agents may resolve ambiguity in ways that diverge from human intent without detection until outcome verification. Harness rules can only enforce what someone anticipated, so bugs may shift from implementation errors to specification misinterpretations or enforcement gaps, trading one class of defect for another.
The entire approach depends on frontier model capabilities remaining at or above current levels at sustainable cost. A harness is designed around the strengths and limitations of specific models; if model capabilities regress, pricing changes dramatically, or the vendor relationship changes, the harness investment may not transfer. The organization has no fallback - it has optimized for a workflow that cannot function with weaker models, creating a strategic dependency on a capability level it does not control.
If engineers only build harnesses and review agent output, their ability to implement, debug, and reason about code erodes over time. When the dark factory approach fails for a particular problem - an edge case the harness does not cover, a production incident requiring deep system understanding - there may be nobody left who can do the work manually. The dependency is self-reinforcing: the more work agents handle, the less capable the human fallback becomes.
Golden principles for continuous refactoring may be premature for fast-evolving codebases where the "right" pattern has not yet stabilized. Encoding opinionated rules too early locks in conventions that the team may need to revise, turning the refactoring agent into an obstacle rather than an aid.
Refactoring agents modifying code that other agents are actively working on creates merge conflicts and can invalidate in-progress work, requiring coordination mechanisms that add complexity to the background automation.
The 20% cleanup time cited as motivation for continuous refactoring may shift to a different activity - writing and maintaining golden principles, reviewing refactoring PRs, debugging the automation - rather than being eliminated. The net time savings depend on the ratio of rule-authoring cost to repeated manual cleanup cost.

Confidence

Medium. Demonstrated at scale by OpenAI - a real product with hundreds of internal users, roughly one million lines of agent-generated code over five months, with a small team. However, the harness construction cost is high and generalizability to other teams and domains is uncertain. The monoculture risk from agent-to-agent review remains structurally unaddressed, and the approach has not been tested in regulated environments or over multi-year maintenance horizons.