Skip to content
Gary Wu
Go back

Compounding Research Corpora

Edit page

Concept-shaped knowledge bases with dense bidirectional links compound across articles; topic-shaped ones produce excellent stand-alones. Here is the survey that forces the question.


We watched two researchers work.

The first is Jon Y at Asianometry. He starts every video from scratch: pull sources, read until claims repeat, move big blocks first, write the script end-to-end. The video is extraordinary — high density, original framing, specific citations, none of the AI-slop tells. But when he starts the next video, the slate is clean. Research notes sit in Google Docs. One topic per folder. Nothing in topic B knows what topic A discovered. The per-video brilliance does not compound into inter-video intelligence.

The second is Wikipedia. Wikipedia’s semiconductor articles are individually mundane compared to Jon’s best work. But Wikipedia’s emergent property is different: when you read the article on TSMC, dozens of inline links take you to foundries, to photolithography, to supply chain geography, to EUV optics, to ASML. The links are not decorations. They are the mechanism by which the corpus keeps getting richer. Add a new fact to one article and it instantly acquires context from every article it links to. The more Wikipedia grows, the richer each new article is at birth, because the linking layer delivers pre-existing context that no single article could carry.

That gap — between Jon Y’s per-video brilliance and Wikipedia’s corpus-level compounding — is the thing this article is about.

What you’ll learn:


The Thesis

A research corpus compounds at the linking layer, not the article layer. Topic-shaped knowledge bases — one directory per topic, sources and synthesis per topic — produce excellent stand-alone artifacts but do not compound across topics. Concept-shaped knowledge bases — atomic notes, dense bidirectional links, typed entity extraction — compound: the Nth article is richer than the first because the linking layer surfaces patterns across the corpus. The open empirical question for AI-mediated niche-specific research is which shape fits which use case.


1. Topic-Shape and Concept-Shape

The distinction is a question of what you index on.

Topic-shape: One directory per research question. Sources, synthesis, draft, and final artifact live together. The canonical file-system signature looks like this:

kb/
  houseplants/
    pothos/
      topic.md
      sources/
        rhs-care-guide.md
        university-extension-paper.md
      synthesis.md
      draft.md
      final/
        article.md

Topic-shape privileges depth-per-artifact. Each research session is optimized for producing one great piece on one topic. Cross-topic pattern detection is not a first-class operation; it requires someone to look across directories by hand. The linkage graph is thin. Two pothos articles and one snake-plant article have no machine-visible relationship unless someone manually writes one.

Concept-shape: One page (or node) per concept. Sources feed many concept pages simultaneously. The canonical signature:

wiki/
  concepts/
    leaf-variegation.md         ← cited by 4 articles
    low-light-tolerance.md      ← cited by 7 articles
    root-rot-causes.md          ← cited by 6 articles
  entities/
    pothos.md                   ← links to variegation, low-light-tolerance
    snake-plant.md              ← links to low-light-tolerance, root-rot-causes
    dracaena.md                 ← links to low-light-tolerance

Concept-shape privileges cross-topic visibility. When you write the snake-plant article, the low-light-tolerance concept page already has findings from the pothos article embedded in it. The Nth article arrives with N-1 articles’ worth of context pre-loaded via the concept network. Compounding is the default behavior, not an afterthought.

The two shapes also privilege different human intuitions. Topic-shape matches how publishers, editors, and most knowledge workers think: “I’m writing about Topic X.” Concept-shape matches how researchers and engineers think: “What does the system know about this claim?” The tension is real and not easily resolved — which is exactly why the empirical question matters.


2. Prior Art Comparison

Nine systems, each verified against a primary source. The comparison table asks five questions per system: What shape? Does the corpus grow over time or operate on a fixed set? Does it produce narrative output? Single or multi-operator? And what compounding mechanism does it specifically validate?

SystemShapeCorpus growth over time?Narrative output?Operator modelCompounding mechanism validated
gbrain (Garry Tan)ConceptYes — every write enrichesNo — structured pages + graphSingle-userSelf-wiring typed-link extraction on every write; zero LLM calls; hybrid BM25 + vector retrieval
Andy Matuschak Evergreen NotesConceptYes — every sessionNo — personal referenceSingle-userDense bidirectional links + associative ontology; link-pressure forces relational thinking
WikipediaConceptYes — continuousNo — encyclopedicMulti-authorInternal-link graph as the integration layer; every new article enters pre-linked; article value proportional to backlink density
Asianometry workflowTopicNo — per-video clean slateYes — scripted videoSingle operatorSource ladder + read-until-repeat comprehension threshold; no cross-topic compounding in design
Karpathy LLM WikiConceptYes — every session leaves it richerNo — entity pagesSingle-userAgent maintains and cross-references on ingest; synthesis compounds rather than being re-derived per query
Connected Papers / Litmaps / ResearchRabbitGraph (citation)Depends on user corpusNo — graph visualizationSingle or multiCitation-graph traversal surfaces prior art; fixed once corpus is set; no narrative generation
ElicitTopic (per query)No — answers per queryPartial — structured synthesis per querySingle-userSentence-level citations against 138M papers; validates retrieval quality but not cross-session accumulation
NotebookLMTopic (fixed corpus)No — fixed at session startYes — Audio Overview, Q&ASingle-userMulti-doc synthesis across a fixed corpus; validates AI-mediated cross-document reasoning; does not validate corpus growth
Roam / Logseq / TanaConceptYes — grows with useNo — notes + graphSingle-userBidirectional block-level links; graph view shows link density; note value increases with network age

A note on gbrain’s row. gbrain appears in the table as one row alongside eight others, but it is not a peer in our internal architecture — it is the upstream pattern source. RFC-ATLAS-008 names it explicitly: “gbrain | Pattern source — gbrain’s Compiled Truth + Timeline architecture formalized here.” We adopt the pattern at the architecture level (Compiled Truth + Timeline page body, medallion bronze/silver/gold tiering, self-wiring zero-LLM typed-link extraction); we do not run gbrain’s runtime. gbrain’s runtime is single-user PGLite; ours is Mnemosi — the same pattern run multi-tenant on Cloudflare Workers, D1, and Vectorize. Both must hold simultaneously: we sit on top of gbrain’s architecture, and we do not run gbrain.

What this table reveals:

The systems that compound are all concept-shaped with persistent, growing corpora. The systems that produce excellent output without compounding are either topic-shaped (Asianometry, Elicit per-query) or operate on a fixed corpus (NotebookLM). The two axes — shape and growth — are largely independent: you can have a growing topic-shaped corpus (Asianometry produces more videos over time) and still not compound, because the growing topology doesn’t create cross-topic linkage.

Citation-graph tools (Connected Papers, Litmaps, ResearchRabbit) validate a third mechanism: the external link graph of academic citations can be traversed to discover prior art. These tools work with fixed corpora and don’t produce narrative output, but they do demonstrate that graph-traversal over a citation network surfaces genuine intellectual relationships that keyword search misses. That’s the academic-literature validation of the same compounding mechanism Matuschak and Wikipedia implement internally.

Source citations inline

gbrain: github.com/garrytan/gbrain README, verified April 2026. P@5 49.1% / R@5 97.9% on a 240-page Opus corpus. Compiled Truth + Timeline pattern. Internal canon: ~/Work/_readme/wiki/business/llm-memory-landscape-2026-04.md; S66 entanglement survey.

Matuschak: notes.andymatuschak.org/Evergreen_notes_should_be_concept-oriented and .../Evergreen_notes_should_be_densely_linked, verified April 2026. “With this approach, there’s no accumulation… Your new thoughts on the concept don’t combine with the old ones to form a stronger whole.” Luhmann (luhmann.surge.sh/communicating-with-slip-boxes, verified April 2026): “the importance of what has actually been noted is secondary” — what matters is that the slip box “provides combinatorial possibilities which were never planned, never preconceived.”

Wikipedia: en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking, verified April 2026. “Internal links bind the project together into an interconnected whole.”

Asianometry: hearthisidea.com/episodes/asianometry/, verified April 2026. “I pick the biggest textbook first… can I understand this? No? Next textbook.” “First, you move the big blocks around, then the little blocks.” On parallelized solo workflow: “I build the video and write the script at the same time, so collecting assets, making graphs, and charts all happens at the same time.”

Karpathy LLM Wiki: gist.github.com/karpathy/442a6bf555914893e9891c11519de94f, April 2026, verified directly. “the LLM doesn’t just index it — it reads it, extracts the key information, and integrates it into the existing wiki.” “the wiki is a persistent, compounding artifact.” “The wiki keeps getting richer with every source you add and every question you ask.”

Citation tools: effortlessacademic.com/litmaps-vs-researchrabbit-vs-connected-papers-the-best-literature-review-tool-in-2025/, verified April 2026. Bibliographic coupling and co-citation analysis; ResearchRabbit v2 launched October 30, 2025.

Elicit: elicit.com/solutions/search, April 2026. 138M papers, sentence-level citations; per-query, not persistent.

NotebookLM: notebooklm.google/, April 2026. Up to 50 sources per notebook; Gemini 2.x; corpus is set at session start.

Roam / Logseq / Tana: primeproductiv4.com/apps-tools/roam-research-review/, April 2026. “The more you use it, the more valuable it becomes.”

Vannevar Bush: “As We May Think,” The Atlantic, July 1945. worrydream.com/refs/Bush%20-%20As%20We%20May%20Think%20(Life%20Magazine%209-10-1945).pdf. “The human mind does not operate by going down a tree of classification, but operates by association.” The associative trail is the conceptual precursor of the link graph.


3. The Compounding Signals

How do you tell whether a corpus is actually compounding? Three observable signals, measurable with or without AI tooling.

Signal 1: Link density growth rate. A compounding corpus shows increasing average link-degree per article over time, not flat or decreasing. In a topic-shaped corpus, each article has roughly the same number of cross-links because cross-links are not the primary mechanism. In a concept-shaped corpus, as the concept network matures, new articles can link to more pre-existing concepts, so average link-degree increases as a function of corpus age. Measure: track mean outgoing links per article every N articles. A flat or declining curve is diagnostic of topic-shape behavior regardless of the filing system used.

Signal 2: Citation backflow. Does article N cite articles 1..N-1? In a topic-shaped corpus, article N typically cites external sources — it does not cite earlier corpus entries, because earlier corpus entries are organized by topic, not concept, and are not naturally surfaced during research on Topic N. In a concept-shaped corpus, article N uses shared concept pages that were already touched by articles 1..N-1, so the shared ancestry is automatically expressed. Measure: track what fraction of in-corpus citations are to earlier corpus articles vs external sources only. A compounding corpus has a rising internal-citation ratio over time.

Signal 3: Cross-topic insight surfacing. The strongest compounding signal is the one you cannot measure without running the loop: does reading article N ever produce a claim that references a finding from article M on a different topic? In Jon Y’s per-video workflow, this does not happen structurally because topic M and topic N are in different folders with no machine-visible relationship. In a well-linked concept corpus, it happens naturally because the concept network creates adjacency between topics that share concept nodes. Measure: in the synthesis step, track whether synthesis.md for Topic N contains citations to sources/*.md files that originated in a different topic directory. Cross-topic citation in synthesis = compounding signal.

These three signals are observable in a calibration run. None require you to pre-commit to a shape choice; they are shape-agnostic diagnostics.


4. What the Prior Art Agrees On

Across the nine systems, four things are consistent regardless of shape:

Source quality is the floor. Asianometry reads until claims repeat; Elicit searches 138M academic papers; Wikipedia requires verifiable third-party sources. Every system that produces reliable output enforces a source quality gate. No linking architecture rescues bad sources.

Atomic units are better than monolithic ones. gbrain and Matuschak both insist on single-idea pages. Karpathy’s LLM Wiki uses entity pages. Wikipedia has one article per concept. Even Asianometry’s block-assembly method (move big blocks, then small blocks) is an implicit acknowledgement that the article is a composition of atomic claims, not a monolith. The atomic unit is the level at which links operate; linking a monolith to another monolith is low-bandwidth.

Periodic synthesis is necessary. gbrain runs enrichment cycles. Karpathy’s wiki agent integrates on ingest. Wikipedia has an editorial cycle that continuously improves article quality. Asianometry’s “read until claims repeat” is a per-article synthesis cycle, not a cross-article one. Without periodic synthesis — something that compresses raw source material into organized compiled truth — the corpus becomes a pile of citations rather than a body of knowledge.

The link is the unit of intelligence, not the page. Matuschak’s Luhmann citation makes this most explicit: in a well-linked system the content of individual pages is secondary to the structure of possibilities the links create. Wikipedia’s emergent intelligence is not in any single article; it is in the link graph. gbrain’s zero-LLM self-wiring is designed specifically so the link graph writes itself continuously rather than being a manual afterthought. In every high-compounding system surveyed, link density is explicitly engineered, not incidental.

What is genuinely in dispute is whether concept-shape is strictly better than topic-shape for all use cases. The prior art does not settle this, because the comparison is not symmetric:

For a high-volume niche-content pipeline running 30+ topics per niche, the accumulation benefits of concept-shape are large in theory. For a single-operator deep-research workflow producing 1–2 artifacts per month, topic-shape may be the right shape because the corpus never gets large enough for the link graph to deliver meaningful ROI. The prior art gives us the mechanism; it does not give us the volume threshold at which concept-shape pays.


5. What AI-Mediated Research Changes

Two things are genuinely new in 2025–2026, and they change the shape question from philosophical to empirical.

First: Zero-LLM typed-link extraction makes concept-shape cheap. In a human Zettelkasten, maintaining the link graph is a manual discipline — you have to remember to link every new note back to existing ones, and you have to do it every time, under time pressure. gbrain (v0.12+, verified April 2026) demonstrated that this can be automated: every write extracts entity references and creates typed links (works_at, invested_in, founded, advises) with zero LLM calls, using a deterministic extractor. P@5 49.1% / R@5 97.9% on a 240-page Opus corpus, verified reproducible. If the entity extraction generalizes to niche-content domains (plants, care-tips, species taxonomy), the human cost of maintaining concept-shape falls to near zero. If it does not generalize, the cost stays high.

Second: Multi-doc AI synthesis validates cross-document reasoning, but not corpus growth. NotebookLM (Gemini 2.x, up to 50 sources, cross-document Q&A) and Elicit (138M papers, sentence-level citations) both demonstrate that AI can synthesize across multiple documents in a single session. This validates the query side of concept-shape’s value proposition: given a well-linked corpus, the AI can reason across nodes. But neither validates the growth side: NotebookLM’s corpus is set at session start with no mechanism for answers to enrich the source set; Elicit’s per-query synthesis does not persist. The capability is proven; the persistence mechanism is not.

Third: Agentic research loops make the shape question testable in hours, not months. Jon Y’s workflow is a manual end-to-end pipeline — text-file topic list, script written in parallel with asset collection, edit-and-record in a single tightly-optimized block — and his synthesis is his cognitive output, not a logged artifact. An AI-mediated research loop — /research-sources, /synthesize-sources, /coverage-check, /draft-skeleton, /draft-coherence, /quality-check operating on a kb/ directory — can produce an artifact in hours and leave a machine-readable trail in the filesystem. The question “does article N cite findings from article M?” becomes checkable with grep rather than a deep reading of two separate essays. This is the architectural leverage that makes calibration runs possible.

What this means is that the shape question has moved from “which philosophy of knowledge organization is correct?” to “which shape produces higher cross-article citation density on the specific corpus we’re running?” That is an empirical question with a measurable answer.


6. The Empirical Test

For a calibration run on a niche-content pipeline, the following signals constitute a verdict on whether the corpus is compounding:

Signal A — Internal citation rate in synthesis.md: After N articles on the same niche, what fraction of claims in synthesis.md for article N+1 cite sources/*.md files from articles 1..N (cross-topic citations) vs only cite newly fetched external sources? A rising internal-citation rate across articles is compounding. A flat or zero rate means topic-shape behavior regardless of the filing system.

Signal B — Concept page reference count growth: If the corpus uses a concept layer (wiki/concepts/ or equivalent), track the mean number of incoming references per concept page as articles are added. In a compounding corpus, frequently-mentioned concepts accumulate references and become richer with each article. Measure at article 5, 10, 15, 20. A flat curve means concepts are not being shared across article boundaries.

Signal C — Cross-niche synthesis quality: For a corpus with multiple niches (e.g., houseplants + tropical plants + succulents), does the /synthesize-sources step for niche B ever reference findings that only exist in niche A’s corpus? In RFC-NM-001’s hierarchical inheritance design (~/Work/niche-media/hq/rfcs/RFC-NM-001-content-production-pipeline.md §5.2), parent-niche synthesis is explicitly loaded for context. Measure whether child-niche synthesis actually uses parent-niche content by tracking citation provenance.

Threshold suggestion: After 10 articles on one niche, if Signal A (internal citation rate) is below 10%, the corpus is not compounding. A corpus that has run 10 articles and still sources only from external-only fetches is operating as topic-shape regardless of directory structure. If Signal A exceeds 30% by article 10, the corpus is compounding meaningfully.

What a calibration run does not settle: Whether the quality of cross-topic references improves downstream content quality in the eyes of a reader. Signal A–C are structural measures. Reader quality is a downstream metric that requires publication + traffic + engagement data. The calibration run gives you the structural verdict on compounding; the editorial verdict requires shipping.


7. What We Are Testing

The houseplants calibration run in ~/Work/niche-media/ is the first test of this framework. RFC-NM-001 uses a topic-shaped kb/ structure by default. The question is whether the hierarchical inheritance mechanism (§5.2: load parent-niche synthesis for context) acts as a lightweight concept-shape substitute, or whether it is genuinely topic-shape behavior with extra steps.

This article will be revised once the calibration run produces data on Signals A–C. What is not known yet:

The honest position is: the framework is ready, the test is not yet run, and the verdict is open.


The substrate question in this article — where does the knowledge live, and what shape should it take? — connects to three sibling articles:


9. Sources

Primary — verified directly:

Internal canon:

[unverified]:


Verification Log

Cited 16 URLs across 9 systems plus 4 internal canon files. Verified directly by fetch: Matuschak (2 pages), Karpathy gist, gbrain README, Wikipedia linking manual, Asianometry hearthisidea interview, Elicit, NotebookLM, citation-tool comparison, Roam Research review. Could not verify: Jon Y “7 sources minimum” (secondary source only); Karpathy exact publish date (content confirmed, date not).


Edit page
Share this post on:

Next Post
Ambient Spaced Repetition