Can AI be creative? It's a question on the lips and fingertips of many. The culture is filled with slop, fine art is fueled by newAI-drivenforms, and every day we encounter of all manner of strangeobjects falling somewhere between. All told, humans deploy AI as a material in cultural conversation. But could AI ever meaningfully become the speaker? That is, could agents self-organize to create culture with a trajectory as open-ended as our own, a kind of self-contained living cultural entity?
This is our criterion for considering agents creative. But what does creativity look like at the agent level?
One thing we can safely say is that it's distinct from the process of pursuing a well-defined plan, or working linearly toward an objectively-verifiable goal. In a creative process, the end product is often not conceivable from the outset. Instead, it is discovered serendipitously; its shape emerges from the process itself. It's a trope that bears repeating: in the studio, on the canvas, and on the stage, that which appears brilliant in retrospect is often attributable to some happy accident that lights a spark of inspiration in the moment.
For an AI to be creative, then, it should be capable of discovering novel forms during a kind of semi-aimless wandering. It should be capable of surprising itself, and in response, throwing its plans and preconceptions out the window in pursuit of some unexpected possibility.
To study whether the capacity for such serendipitous discovery exists in AI systems, we turn to a minimal substrate where we know it to be possible.
The original Picbreeder website
That substrate is Picbreeder, an online service where users collaboratively evolved images with no goal beyond following whatever caught their eye. Each image is encoded by a compositional pattern-producing network (CPPN), whose structure is grown over generations by NEAT. Picbreeder became the standard demonstration that dropping a fixed objective — searching for novelty and diversity rather than optimizing toward a target — can reach forms that goal-directed search never would, the thesis later popularized as Why Greatness Cannot Be Planned and now central to work on open-endedness and machine creativity.
→
→
→
→
Compositional Pattern-Producing Networks (CPPNs) At every pixel,
x,
y coordinates, and radius from center r are fed into the network, which returns hue, saturation and value (HSV), i.e. a color. In this way, each CPPN encodes an image of arbitrary resolution.
Click a node in the graph to render its output here.
Explore a CPPN. This is the picture's DNA — the network that paints it,
drawn bottom-to-top from its inputs (x, y, d, bias) up to its colour
outputs, with the intermediate pattern at every node shown as the node itself. Pick a human-evolved
original from Picbreeder or one of our VLM-evolved networks. Click a node to enlarge its output and
open a ring of controls (change its activation, add a link, edit, delete); click a connection for its
own ring, or just drag a link up/down to scrub its weight and watch every pattern shift. It's the same
editor as the full interactive site.
Human archive9,375 imagesAI archives13 runs · noise, memory & agents
Click either to browse the full archive inline. Switch between the human and AI archives — and across AI runs — from the viewer's top bar; sort by recency, similarity, or phylogeny.
Our experiment swaps the human selector for a vision–language model — here Gemini 2.5 — which views a grid of candidate images and chooses which to breed, just as a Picbreeder user would. The idea echoes earlier innovation engines, in which a trained network's own responses drive an open-ended evolutionary search. To quantify what each archive covers, we embed every published image with SigLIP 2 and compare its semantic spread against human similarity judgments from the THINGS dataset.
VLMs replace human users in Picbreeder, growing an unbounded archive of novel images.
Phylogeny — each node is a published image; an edge connects a parent to the child branched from it, so paths through the tree trace lines of evolutionary descent. Laid out force-directed (sfdp): related lineages cluster apart, tight clumps are oft-rebranched hubs, which swell into legible thumbnails so you can watch the most-branched images in their place in the tree.Most-branched leaderboard — the same images ranked by their number of direct descendants so far.The archive, as it fills — all 3,123 publications in order, each rising in at the bottom as the feed scrolls up.
0:00
One archive, three views (long-context run, CL = 10, seed 5). The full run — all 3,123 publications — advancing in step: left, the branching tree grows as related lineages cluster apart and the most-branched hubs swell into view; centre, a running leaderboard of the most-branched images; right, the archive fills publication by publication. Drag the bar to scrub all three together.
The results, with the historical human baseline for reference (green = overall best, grey = default setting, bold = best within a sweep).
Evaluation metrics. The same archive is embedded into three spaces, drawn as a stack of planes. Visual Coverage and Semantic Coverage are the k-covering radius of the archive in SigLIP2 image space and in the text space of its VLM captions respectively — how much of the space the archive spreads across. Semantic Recall embeds image and text jointly and measures the mean distance from each THINGS noun to its nearest archive image — how close the archive comes to depicting a fixed vocabulary of concepts.
Sweep
Setting
Semantic Recall
Visual Coverage
Semantic Coverage
Tree Balance (J¹)
Noise (ε)
0.0
0.087
0.614
0.696
0.235
0.05
0.086
0.619
0.702
0.246
0.25
0.088
0.638
0.717
0.249
0.5
0.085
0.633
0.709
0.260
0.75
0.084
0.639
0.706
0.303
1.0
0.082
0.610
0.700
0.275
Memory (CL)
0
0.082
0.527
0.632
0.305
1
0.087
0.614
0.696
0.235
2
0.083
0.583
0.675
0.339
10
0.079
0.512
0.661
0.331
20 (full)
0.083
0.595
0.697
0.350
Agents (NA)
0
0.087
0.614
0.696
0.235
10
0.086
0.605
0.698
0.373
100
0.089
0.659
0.710
0.473
1000
0.088
0.665
0.734
0.476
Baselines
Random
0.080
0.612
0.692
0.540
Human
0.089
0.681
0.730
0.363
Mean over 6 seeds; 2,000 sessions each. (Random's high Tree Balance is expected: uniformly random branching produces maximally balanced trees in expectation.)
Human lineage of a car. Published images along the ancestry of a car in the human Picbreeder archive, earliest at left.
VLM lineage of a car. Published images along the ancestry of a VLM-evolved car, earliest at left; each labeled with the title the model gave it on publishing.
Soda-can pull-tabs (emerges at long context).Foxes.
Top-rated leaderboard (long-context run, CL = 10). Published images ranked by the mean VLM rating each carried at that point in the run, ties broken by number of ratings — replayed from the agents' branching-snapshot logs. The bar at top fills as publications accrue. Every slot is a top-down soda-can lid (“Aluminum Can Top,” “Pop Top,” “Photorealistic Can Top”), each rated 5.00.A region of the 1,000-agent archive. High-frequency, low-interpretability patterns that recur across many agents.
These recurring high-frequency regions recall the evolved “fooling” images that a network scores with high confidence yet a human finds unrecognizable — a reminder that the VLM's preferences, like any learned objective, can be satisfied by textures as readily as by objects.
SGD-trained CPPN. Perturbing individual weights produces chaotic, skull-destroying distortions — the "fractured, entangled" regime. From , Fig. 6b.VLM-evolved CPPN (ours). Perturbations stay skull-like and change relatively smoothly — less fractured than SGD, but without crisp semantic factors.Human-evolved Picbreeder CPPN. Individual weights cleanly factor into semantic controls — "Mouth Opening", "Eye Winking", etc. From , Fig. 6a.
Sweeping individual weights of skull CPPNs across three regimes. Each row varies one weight from δw = −1 to δw = +1 while holding the rest fixed. SGD optimization yields entangled representations whose perturbations are destructive; human-driven Picbreeder yields cleanly factorized ones; VLM-driven Picbreeder sits in between — smooth and skull-preserving, but not yet semantically labelable.Archive traversal. Sixty-four cells, each morphing a CPPN along a path through one run's branching tree, from one published image to the next. [placeholder — cells will be ordered by visual similarity]
More broadly, using foundation models as the engine of evolutionary search — and as stand-ins for human notions of interestingness — points toward systems that generate and pursue their own goals, much as an agent shapes its own behavior from a reward signal in reinforcement learning.
Citation
For attribution in academic contexts, please cite this work as
Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi, "In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models", GECCO 2026.
BibTeX citation
@inproceedings{earle2026picbreedervlm,
title = {In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models},
author = {Earle, Sam and Arulkumaran, Kai and Dai, Andrew and Kumar, Akarsh and Togelius, Julian and Risi, Sebastian},
booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '26)},
year = {2026}
}
Open Source Code
We release our code here. The full paper is on arXiv.