This page requires Javascript. Please enable it to view the website.

The AI Picbreeder Experiment

Best Paper nominee, GECCO 2026.

Sam Earle* NYU

Kai Arulkumaran Sakana AI

Andrew Dai* Independent

Akarsh Kumar* MIT

Julian Togelius NYU

Sebastian Risi Sakana AI

Jul 6

2026

Paper

Code

Data

* Work done during an internship or residency at Sakana AI.

Can AI agents be creative? It's a question on the lips and fingertips of many. Sure, AI is the new medium, but could it ever be the one making the message? We ask whether the agents we use today as tools might be re-purposed to spontaneously speak among themselves. The end result would be a model organism of cultural production.

What should we ask of the agents to make this real? The tension is that we cannot ask for anything at all—it has to be left up to them—yet modern agents demand our asking by design. They are trained, evaluated, and orchestrated as goal-following entities. But creativity elides planning, and the end products of cultural processes are not conceived of at the outset but forged in serendipity. This is true not only of the arts, but of math and science as well, where the discovery of new questions is at least as important as their resolution.

To this end, we envision agents capable of intentional aimless wandering—agents that can surprise themselves, and in response throw their plans and preconceptions out the window in pursuit of the unexpected. To study whether AI systems might have this capacity, we turn to a minimal substrate where we know it to be possible—even necessary. Picbreeder was a website which gained a modest cult following in the 2010s. Here, you might imagine us trying to re-animate it with something like the mechanical ghosts of its past users: replacing humans with large models trained on their collective output, and asking if these models are capable of the same level of creative discovery as we were.

The Picbreeder website, page 1 of 3—click to open our AI re-creation — At **picbreeder.org**, users participated in a process of generating collaborative art by interactively evolving images. The images began abstract, but over multiple sessions with different users, genealogies complexified and familiar forms emerged, often by surprise. (Image retrieved via the Internet Archive.) (Click to access an homage to the original site, populated with aggregate results over multiple replays of AI-driven reproductions of the collective human event.)

The Picbreeder website, page 2 of 3—click to open our AI re-creation — At **picbreeder.org**, users participated in a process of generating collaborative art by interactively evolving images. The images began abstract, but over multiple sessions with different users, genealogies complexified and familiar forms emerged, often by surprise. (Image retrieved via the Internet Archive.) (Click to access an homage to the original site, populated with aggregate results over multiple replays of AI-driven reproductions of the collective human event.)

Before frontier models allowed us to speak into existence a vast distribution of possible images, Picbreeder had users breed images by representing them indirectly. Each image in Picbreeder is encoded as a Compositional Pattern-Producing Network (CPPN), an initially small, protean neural network that takes as input some coordinates in pixel space, and outputs that pixel's color. Because they can take arbitrary continuous coordinates as input, these networks effectively encode infinite-resolution images. Elsewhere, they've been used to represent videos, terrain in game-like environments, and even the weights of larger downstream neural networks with more regular topologies. You can see how coordinate values are turned into color pixels in the animation below:

→

Compositional Pattern-Producing Networks (CPPNs) At every pixel, x, y coordinates, and radius from center r are fed into the network, which returns hue, saturation and value (HSV), i.e. a color. In this way, each CPPN encodes an image of arbitrary resolution. Click the input grid to pin the probe in place; press 'r' to reset.

In practice, we pass all the input coordinates through the network at once. Each node in the network has some grid-shaped activation we can render as a greyscale image, and at the network's output, three such outputs are combined to produce a single color image. In the tool below, you can see how changing the weights or activation functions of various nodes and edges changes the intermediary activations and final output of a CPPN:

DNA editor. Edit a CPPN and view the real-time effect on intermediary and final outputs. Click a node or edge to open its control ring, drag a node to move it (or drag the ⊕ handle on its border onto another node to link the two), or click and drag on an edge to scrub its weight. (See also the analog on the interactive breeding site.)

Playing with the weights for a moment, you get the sense that building target images by building CPPNs from scratch out of nodes and edges would be a thankless endeavor. Instead, in Picbreeder, images are evolved. This is done interactively with the user in the loop: from a grid of random initial CPPNs (which generally look like smooth blobs or gradients), a user selects one or several which they'd like to grow further.

These individuals (each little brain containing a single picture) are then bred together and mutated to produce a new set of candidate images, and the process repeats. Mutation and crossover operations are drawn from the Neuroevolution of Augmenting Topologies (NEAT) algorithm, which unlike traditional gradient descent not only optimizes the weights of the network, but also its shape or topology in terms of the layout of nodes and edges.

Breeding console. Click to select · space to evolve · z to undo · r to reset · p for new parent. (See the full version.)

In this way, images are created not by design but by a prolonged process of natural selection. And as the human in this loop, it's nearly impossible to plan for a particular long-term goal: for example to imagine an image in advance then successfully breed it. Not within this interaction scheme, not on any reasonable kind of timescale, at least. But Picbreeder shows us that these kinds of achievements can be attained anyway if the process of interactive evolution is distributed across a large number of users. Here, the artifacts materialize by accident, appearing to the user as objectives-in-retrospect.

In other words, the Picbreeder users played with the system, wandering, and generated art not by conceiving it against marble or canvas, but by catching it like a fish from a stream. In the AI Picbreeder experiment, we ask: can AI catch similar fish? And if it can, then how does it achieve this? Then we can try to find any parts of its looking at and moving in the water that matter to its success.

To get at these questions, we conceive of a simplified interaction loop with the Picbreeder website, automating users with large vision-language models. In this loop, we sample from an ever-growing shared archive of publications, and feed these snapshots to various parallel VLM agents. These agents can select an image to "branch", breeding its offspring across generations, then naming and publishing their favorite child. Intermittently, agents are also asked to evaluate the quality and novelty of such snapshots from the archive. Future samples are drawn partially at random, and partially according to recency of publication, mean ratings from past agents, and branching popularity.

System overview: a single Picbreeder-VLM agent session and the shared archive it co-evolves with. The archive is sampled into a 100-image branching sample (5 mutually-exclusive subsets), which feeds both Evolution (VLM breeder, mutation/crossover, and the generation grid looping 20 times, after which a VLM publication step sends the chosen image back to the archive) and Evaluation (a fresh VLM rates archive entries 1-5). — AI Picbreeder: mimicking a pared-down version of human interaction with the original Picbreeder website, we use VLM instances to simulate users, growing unbounded archives of collaborative art produced by agents under varying conditions. In parallel, VLM agents interactively evolve and evaluate candidates saved to an online archive. As the archive grows unboundedly, candidates are selected according to their past evaluation scores and their placement in the phylogenetic tree (recency, number of offspring). In practice, we run 10 breeding agents in parallel continuously, branching from and publishing to the shared archive, and launch a critic agent whenever the archive grows by 5 publications. (Click to animate.)

This resembles Evolution through Large Models, which involves the use of large language models as mutation operators within existing evolutionary algorithms, and has been variously leveraged to discover robot control policies, novel two-player board games, and solutions to combinatorial optimization problems. The key difference here is that in place of MAP-Elites or a generic genetic algorithm, we deploy an algorithm modeled after the website: its selection mechanisms, fitness function, and archive structure are all downstream of the design of that former user interface.

Timelapse: the growth of an archive. As agents branch from previous creations, we track the central (most-branched) nodes to get a sense of the aesthetic and semantic trends of the archive over time. Here, we see surges in popularity of legs, ducks, hawks, bulls, and finally an especially popular "anatomical study"—an impressionist, human-like form. (Double click to open in the archive viewer.)

Such an experiment in open-endedness cannot have a pre-defined objective. But while developing the system, examining its outputs, and reconstructing the phylogeny of the human archive, we concluded that qualitatively, the human archive has a certain je ne sais quoi. There is a kind of discernment among the forms there, a refinement lacking from some of the AI galleries, as if the AI agents never got over the initial spark of resemblance to find something better yet.

Indeed the agents' work projects a certain je sais exactly what. They often get sucked into their own attractors. Where the humans seem to capitalize on luck and imbue it with attention to create something refined, the AI often seems to experience it more passively, if perceptively. For what it's worth, though, the attractors that show up are different enough between seeds and hyperparameters to make repeatedly running the system a relatively entertaining endeavor.

Have a look at the galleries by clicking the thumbnails below:

A representative grid of the human-bred Picbreeder archive — **Human archive**

A coverage-maximizing sample of one VLM-driven Picbreeder archive — **Human archive**

With some idea of the shape of this qualitative difference in mind, we architect some metrics to express it numerically. In sum, we try to measure how visually different images in the archive look, how much they look similar to a predefined set of objects, and—beyond this predefined set—the diversity among the things they depict (i.e. in terms of the captions a model would generate for them).

The three diversity metrics, each measured in its own embedding space. The archive passes through SigLIP2 image embedding once; that single image embedding is reused by both the Visual Coverage plane and the joint Semantic Recall plane. The archive is also captioned by a VLM and text-embedded for Semantic Coverage, and a set of nouns is text-embedded into the same SigLIP2 space for Semantic Recall. Visual and Semantic Coverage are the k-covering radii of the archives embedded in their respective spaces; Semantic Recall is the mean nearest-noun distance. — **Evaluation metrics.** Visual and Semantic Coverage embed images and (VLM-generated) captions to a shared embedding space and measure each archive's propensity to cover that space in terms of k-covering radii. Semantic Recall pre-defines a set of nouns corresponding to images we might hope to recover, maps images and labels to a joint text and image embedding space, and measures the extent to which generated images approximate the set of labels, via max per-label cosine similarity. In our experiments, we use these metrics to evaluate the archives produced by individual runs of the system. Use the arrows above to see them visualized on the aggregate over all of these experiments, for the sake of illustration.

A 2D UMAP projection of a uniform sample of images pooled from every experiment—all VLM sweep runs and the full human Picbreeder archive—in SigLIP2 image space, softly partitioned into nine regions, with a representative image called out from each region and labeled by the experiment that produced it. — **Evaluation metrics.** Visual and Semantic Coverage embed images and (VLM-generated) captions to a shared embedding space and measure each archive's propensity to cover that space in terms of k-covering radii. Semantic Recall pre-defines a set of nouns corresponding to images we might hope to recover, maps images and labels to a joint text and image embedding space, and measures the extent to which generated images approximate the set of labels, via max per-label cosine similarity. In our experiments, we use these metrics to evaluate the archives produced by individual runs of the system. Use the arrows above to see them visualized on the aggregate over all of these experiments, for the sake of illustration.

We also develop a choice set of knobs for the system. We're inspired by the idea that serendipity requires chance, a kind of local narrative flow, and a broader personal context; and design interventions intended to serve as simplified computational models of these effects. In particular, we inject random noise into the agents' decision making process, we play with their memories—erasing everything but the present moment or overloading them with a full view of everything they've ever seen—and we seed them with simple personality traits intended to indirectly influence their interaction with the system.

Three variations on the inner evolution loop from the system figure, one per knob. In each, a selection stage feeds mutation/crossover, which produces a grid of offspring whose selected cell is tinted blue; the loop runs twenty times back to the selection stage. Noise: the loop returns to a gate that draws a uniform random variate u in zero to one; if u is at most epsilon it routes to a random pick, otherwise to the VLM's pick, both feeding mutation. Memory: faded copies of the candidate grid, each a past turn with a different pick, are stacked behind the current grid along a depth diagonal braced and labeled CL. Agents: a single trait is sampled once from a pool of NA personality traits (the pool braced and labeled NA) and prepended to the VLM system prompt. — **Agent interventions** applied during interactive evolution. **Noise:** at each selection step, we draw u ∼ 𝐔(0,1); when u ≤ ε, a random candidate image is selected. **Memory:** the last CL turns—past generations, breeding selections, and the original archive sample—sit stacked behind the current grid in the VLM's chat history. **Agents:** a single trait is sampled once, at session start, from a pool of NA personality traits and prepended to the breeder's system prompt.

We find that the historical human archive tends to dominate VLM runs in terms of our evaluation metrics. We additionally measure the tree balance of generated phylogenies to gain a sense of the extent to which agents and users fixate on a choice few prolific parents when branching. Without interventions, VLMs tend to be more discriminatory/myopic/obsessive than humans by this measure, while the random baseline—by definition—is the most balanced.

When augmenting the VLM loop with agent diversity, the system—on separate occasions—catches up in terms of semantic recall and semantic coverage. Adding agents also tends to steadily increase the degree of balance in the tree.

When we remove agents' memories, and they only have context of the current generation, they're prone to reproducing the same image ad nauseam. Even one previous generation of history is enough to break this pathology, making the agent less likely to knowingly repeat a selection. Too much memory seems to overload the agent's context and degrade its capabilities, with archives at longer memory-lengths often filled with overly abstract or repetitive, simple forms.

Injecting noise in agents' breeding selections can also—if a little crudely—push the agents out of problematic basins, at the risk of filling the archive with a greater proportion of intermittent stepping-stone slop (though interestingly, recognizable forms eventually do emerge even when agents are restricted solely to branching decisions; and indeed we can see this dial as a second-order mutation rate).

Sweep	Setting	Semantic Recall	Visual Coverage	Semantic Coverage	Tree Balance (J¹)
Noise (ε)	0.0	0.087 ±0.001	0.614 ±0.002	0.696 ±0.004	0.235 ±0.026
	0.05	0.086 ±0.001	0.619 ±0.011	0.702 ±0.006	0.246 ±0.023
	0.25	0.088 ±0.001	0.638 ±0.009	0.717 ±0.004	0.249 ±0.022
	0.5	0.085 ±0.001	0.633 ±0.007	0.709 ±0.005	0.260 ±0.013
	0.75	0.084 ±0.001	0.639 ±0.004	0.706 ±0.003	0.303 ±0.010
	1.0	0.082 ±0.001	0.610 ±0.013	0.700 ±0.003	0.275 ±0.030
Memory (CL)	0	0.082 ±0.001	0.527 ±0.009	0.632 ±0.006	0.305 ±0.027
	1	0.087 ±0.001	0.614 ±0.002	0.696 ±0.004	0.235 ±0.026
	2	0.083 ±0.001	0.583 ±0.004	0.675 ±0.006	0.339 ±0.033
	10	0.079 ±0.003	0.512 ±0.021	0.661 ±0.010	0.331 ±0.040
	20 (full)	0.083 ±0.001	0.595 ±0.010	0.697 ±0.003	0.350 ±0.018
Agents (NA)	0	0.087 ±0.001	0.614 ±0.002	0.696 ±0.004	0.235 ±0.026
	10	0.086 ±0.002	0.605 ±0.012	0.698 ±0.011	0.373 ±0.030
	100	0.089 ±0.001	0.659 ±0.006	0.710 ±0.009	0.473 ±0.022
	1000	0.088 ±0.001	0.665 ±0.004	0.734 ±0.013	0.476 ±0.013
Random		0.080 ±0.001	0.612 ±0.005	0.692 ±0.002	0.540 ±0.003
Human		0.089	0.681	0.730	0.363

Quantitative results. The human archive emerges as a strong upper bound relative to our VLM-driven experiments, where we measure the effects of noise, memory and agents on Picbreeder. The random baseline provides a lower bound. Among VLM experiments, agent diversity via the injection of subtle breeder system prompts is particularly effective in closing the VLM-human gap. A minimal amount of memory is crucial in reducing repetitive actions while avoiding context bloat in this token-hungry multimodal regime. A bit of noise in the selection process can also encourage exploration. Uniformly random branching produces maximally balanced trees in expectation. Humans are much less balanced; VLMs similarly so, though they can be made somewhat more or less balanced under varying hyperparameters. Mean ± std over 6 seeds; 2,000 sessions each. The Human archive consists of a single historical "run" (and thus carries no interval).

Elsewhere, it’s been argued that with respect to training large models like LLMs, open-ended search might be crucial to learning good representations. For example, imagine that a model trained only to predict the next token in language corpora might not have the same “mental model” of abstract, symbolic arithmetic operations that we might expect (e.g. being able to compute quantities over one class of objects but not another). Maybe, if we can reproduce the open-endedness of something like Picbreeder, we could also apply it to processes of training large models themselves. Further still, we might hope that the agents at the helm of this model cultural organism might instead form a research culture concerning their own growth, in turn refining their understanding, bootstrapping their models of the world, and perhaps leading to recursive self-improvement.

We see some evidence of that here, perhaps, in a skull form produced by AI Picbreeder—without our asking, and indeed the AI variant is plainly lesser than the human one. Sweeping the most consequential CPPN weights of this image does not explode it as in a CPPN optimized to produce such a skull via gradient descent. It remains to be seen, however, how much of this representational robustness is owing to CPPNs and NEAT, and how much to the VLMs' judgment and the Picbreeder interaction loop at large. But these experiments would seem to suggest that at least some of one potentially important emergent byproduct of open-endedness can recur in a synthetic system without its being optimized for it directly.

Weight-sweep visualization of an SGD-trained skull CPPN from Kumar et al. (2025). — **Why open-ended agent-based evolution might be used to train (AI with) better representations.** Both VLMs and humans found something that looked like a skull. (The human skull is better.) When one tries to optimize a CPPN to depict that skull (left), it has a fractured, noisy representation. The human skull looks to have more meaningful representations. VLM agents seeking open-endedly seems to result in representations somewhere in between.

Weight-sweep visualization of a VLM-evolved skull CPPN. — **Why open-ended agent-based evolution might be used to train (AI with) better representations.** Both VLMs and humans found something that looked like a skull. (The human skull is better.) When one tries to optimize a CPPN to depict that skull (left), it has a fractured, noisy representation. The human skull looks to have more meaningful representations. VLM agents seeking open-endedly seems to result in representations somewhere in between.

But this doesn't mean that synthetic open-endedness is not necessarily solved. One could contend, for example, that VLM-driven Picbreeder is not engaged in a process of open-ended discovery so much as of rediscovery—of circuitously and arduously recovering its vanilla visual priors through a molasses of neural image representations. Though here we'd counter that a large part of human creativity in this space could be cast in the same light, that this is by design a space where creativity operates under and is characterized by strong constraints.

No, maybe in spite of our efforts to engineer a pipeline for collaborative design, the agents driving it are plagued by an irreconcilable rigidity. We can see traces, perhaps, of their striver pathology in the transcripts. See below, where two vehicles are arrived at via more or less winding roads; the VLMs tend to remain in the semantic space of cars in general (moving from the driver's seat, to the dashboard, to the hood, to a full profile) while humans traverse a more eclectic and unpredictable sequence, exploring eyes and (alien) faces before suddenly landing on the automobile. In this light, one could argue the VLMs show a lack of imagination, where even the apparently serendipitous leaps in semantic space are in fact predictable according to an implicit theme or predetermined objective.

Human lineage step 1 — **Human selections and publications.**

Human lineage step 2 — **Human selections and publications.**

We make the code and data available for follow-up questions. Though here we’ve distilled from the design of Picbreeder.org a kind of simple, bespoke evolutionary algorithm, there might be some benefit in taking a more unstructured, true-to-life approach. Our homage to the former website acts as a first step in that direction: a number of agents could be allowed to interact with the site's various pages open-endedly, browsing, searching, and jumping between sessions at their own pace.

Citation

For attribution in academic contexts, please cite this work as

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi, "In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models", GECCO 2026.

BibTeX citation

@inproceedings{earle2026picbreedervlm,
  title     = {In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models},
  author    = {Earle, Sam and Arulkumaran, Kai and Dai, Andrew and Kumar, Akarsh and Togelius, Julian and Risi, Sebastian},
  booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '26)},
  year      = {2026}
}

Open Source Code

We release our code here. The full paper is on arXiv.