Blog
Post · 2026-06-02

Categorization via Embedding Clustering

On June 2, 2026, the reader at hadleylab.org/library stopped needing a person to decide its shelves. A method called k-means, first written down by Stuart Lloyd in 1957, sorted all 128 published documents into 8 categories from the math of their embeddings alone. The obvious reading is that a library needs a librarian to choose its categories. The working reading is that the categories were already sitting in the writing, latent in the text, and the only job left was to measure them.

The 128 documents are everything the library publishes: 95 blog posts, 11 papers, 6 books, and 16 decks. For months they carried exactly one label apiece, the source they came from. A blog was a blog. A deck was a deck. That tells a reader where a document was filed, never what it is about.

What the corpus already knew

Every one of those 128 documents had already been turned into a vector. When the library became its own corpus, each rendered page was embedded by a model named bge-m3 into a list of 1024 numbers. That list is a coordinate. Two documents that talk about the same things land near each other in the space those coordinates describe, even when they never share a single word.

So the categories did not have to be invented. They were a property of where the documents already sat. The embeddings were built for search, so the reader could ask a question and get back passages instead of filenames. The same coordinates that answer a search also answer a different question: which documents belong together?

How k-means draws the lines

k-means is the plainest answer to that question. You tell it how many groups you want, it drops that many center points into the space at random, assigns every document to its nearest center, moves each center to the middle of the documents it caught, and repeats until nothing moves. Lloyd's procedure converges to a stable partition. The scikit-learn library does the arithmetic.

Two choices make the result reproducible. The number of groups is fixed at 8 and written into governance, not guessed at run time. The random starting positions are pinned with a seed of 42, so the same corpus produces the same partition on every machine. Run it twice and the output is identical to the byte. A category scheme that drifted every time you rebuilt it would be worse than no scheme at all.

The eight categories that fell out

Here is what the corpus separated into, with each group named by the words that distinguish it. The names are not chosen. They are the highest-scoring terms in each group under a measure called TF-IDF, which rewards a word for being common inside a group and rare everywhere else.

The largest group, 33 documents, is the clinical and compliance writing. The second, 29, is the work about the coin economy and the compiler. A group of 15 covers sessions, the canon, and git history. Another 15 hold the cancer and Caribbean case material. Twelve are about transcripts and the agent. Then three groups of 8: the neutral-theory essays, the vocabulary and scope-structure notes, and the writing about data and care. No group was empty, and none swallowed the corpus.

Why a low score is the honest answer

There is a standard way to ask whether a clustering is any good, a number called the silhouette score that runs from minus one to one. Ours is 0.036. That is low, and saying so is the point. Text embeddings spread documents across 1024 dimensions, and real writing does not fall into hard, well-separated balls. A category like the clinical writing shades into the compliance writing, which shades into the enterprise writing, because the documents themselves do.

The silhouette is printed for a person to read, never used to pick the number of categories on its own. The count of 8 is a governance decision, informed by the score and by whether the groups stay balanced, then frozen. Letting an algorithm chase the highest silhouette would hand the shape of the library to a metric that does not know what a reader wants.

The unlock is one backend, scoped

The reason this was a small job and not a new system is the part worth keeping. The library does not run its own embedding model or its own clustering code. Retrieval already existed. Embedding already existed. Clustering was the one missing piece, and it is the same fifteen lines of arithmetic over the same coordinates, pointed at the library instead of at a medical corpus. The real change is that there is now one backend for reading the corpus and one knob, the category count, per collection.

That is why each category also carries an era, a date bucket from the library's own timeline, and a short list of tags drawn from the same scoring that named the groups. A reader can now narrow the shelf to the compiler writing from the founding months, or the clinical material from this spring, without anyone having tagged a single document by hand.

What this means for the reader

The library you browse now shows three new filters above the search box, and every one of them was computed, not typed. Pick a category and you see the documents that cluster together. Pick an era and you see a slice of the timeline. The shelves arranged themselves, and the arrangement will rebuild itself, identically, the next time a document is published.

Sources

Claim Source Ref
Lloyd's procedure converges to a stable partition k-means clustering
scikit-learn does the arithmetic for the partition scikit-learn KMeans
bge-m3 embedded each document into 1024 numbers BGE-M3 embedding model
TF-IDF rewards a distinguishing word in each group tf–idf
The silhouette score runs from minus one to one Silhouette (clustering)

The library stopped needing a librarian to file its work the moment its own writing could be measured instead of labeled.

Categorization via Embedding Clustering | LIBRARY | CLUSTERING | BLOGS