Blog / Knowledge Engineering

Knowledge an Agent Can Read

TLCTC, rendered into the Open Knowledge Format

BK
Bernhard Kreinz
~10 min read

A taxonomy is a promise that two people — or two systems — will name the same thing the same way. In 2026, more and more of those "people" are agents. So the honest question for any framework is no longer just "can a CISO read this?" It is: can the machine your CISO is about to delegate to read it too?

TLCTC — the Top Level Cyber Threat Clusters — was built to fix a language problem. It classifies threats by why a compromise happens, the generic vulnerability exploited, rather than by what happens afterwards ("ransomware," "data breach," "DDoS"). Ten clusters, ten axioms, a handful of classification rules, and an attack-path notation that lets you write an incident down as a causal sequence. The whole point is to be a shared language: one that governance, security operations, and secure development can all speak without translation loss.

But a shared language has a delivery problem. Ours lived where human authors put it: a citable core paper, an extended white paper, JSON schemas, a glossary, mapping files, a clutch of single-file HTML tools. Excellent for a person willing to read. Considerably less excellent for an LLM agent asked, at 02:00, to deconstruct an alert into clusters and tell the on-call which control objective should have caught it. The knowledge was all there. It just wasn't packaged for the reader who is increasingly doing the reading.

So we packaged it. TLCTC now ships as an Open Knowledge Format bundle — and this is the story of why that small, unglamorous act of rendering matters more than it looks.

What the Open Knowledge Format actually is

The Open Knowledge Format (OKF) is a small, deliberately minimal specification from Google's knowledge-catalog project. Its premise is almost aggressively unfashionable: represent knowledge as a tree of plain markdown files with YAML frontmatter. No database. No proprietary index. No embedding store you have to stand up before anyone can read a sentence. A directory of text files that both a human and a language model can open and understand.

The whole spec fits in your head:

  • Every document carries a YAML frontmatter block. The one required field is type — a short string that says what kind of concept this is. title, description, resource, tags, and timestamp are recommended; producers may add their own keys, and consumers must preserve the ones they don't recognise.
  • A couple of reserved files do navigation: index.md for browsing, log.md for a changelog. These carry no frontmatter.
  • Concepts link to each other with ordinary, bundle-relative markdown links — /clusters/cluster-7.md. The meaning of a link lives in the prose around it, not in some typed edge schema.

That is essentially the entire format. Its genius is in what it refuses to do. By staying as flat text, an OKF bundle is consumable by anything that can read a file — a RAG pipeline, a coding agent, a human with grep — without first adopting a vendor's worldview. It is the lowest common denominator, chosen on purpose.

A format that any reader can open is worth more than a clever one that only your reader can.

Why a cause-oriented taxonomy wants to be agent-readable

There is a particular reason TLCTC, of all things, belongs in a format like this. The framework's core claim is that it bridges three audiences with one vocabulary: governance reasoning about risk appetite, security operations triaging alerts, and secure development deciding what to build. Three audiences who, today, increasingly hand work to agents — a SOC copilot, a code-review bot, a control-mapping assistant.

If the shared language only exists as prose those agents can't reliably parse, the bridge is theoretical. Every agent re-derives its own ad-hoc threat ontology from whatever it scraped, and you are back to semantic diffusion — the exact disease TLCTC was built to cure, now reintroduced one inference call at a time. Rendering the taxonomy into a bundle an agent can load is not a side project. It is the framework keeping its own promise in the medium where the promise now has to hold.

Canonical sources JSON schemas White paper Tools / matrix build-okf.js deterministic render okf/ bundle 412 markdown docs consumed by Agents Humans single source of truth stays upstream — the bundle is a view

A rendered view, not a fork

The single most important design decision is also the most boring: the bundle is generated, never hand-maintained. A small Node script, build-okf.js, reads the canonical sources — the Layer-1 framework JSON, the white paper, the Layer-2 registry, the attack-path files, the mappings, the control-matrix tool — and renders them into markdown. The JSON and the papers remain the single source of truth. The bundle is a view.

This matters because the alternative — a second, lovingly hand-written copy of the taxonomy "for the agents" — is a slow-motion disaster. The day after you write it, it starts drifting from the source. Six months later you have two TLCTCs that disagree in small, expensive ways, and no one can tell you which is authoritative. A generated bundle cannot drift: rebuild it and any divergence is overwritten. We even wired it so the rebuild is deterministic — run the generator and git shows no diff unless a real source changed.

A companion script, validate-okf.js, enforces OKF conformance: every non-reserved document has a non-empty type, the reserved files stay frontmatter-free, and intra-bundle links are checked. The two scripts are stitched into one command:

npm run validate
#  → validate-framework   (Layer-1 JSON still valid against its schema)
#  → build-okf            (regenerate the okf/ bundle)
#  → validate-okf         (assert OKF SPEC conformance)

What's in the bundle

The render produces 412 concept documents across nine sections. Each is a small, single-purpose file an agent can pull on demand and follow by its links:

  • clusters/ — the ten threat clusters, rendered from the white paper's full six-field definitions (Definition, Generic Vulnerability, Attacker's View, Developer's View, Boundary Tests, Topology), each linking out to its governing axioms, rules, controls, and mappings.
  • axioms/ · rules/ — the ten axioms and the classification rules (R-EXEC, R-CRED, R-ROLE, the unresolved-step rules), one per file.
  • spheres/ · contexts/ — the Layer-2 registry: responsibility spheres and boundary contexts used by the notation.
  • glossary/ — 247 term documents.
  • attack-paths/ — 51 real incidents, each rendered with its reconstructed notation string, a step-by-step table, and prose notes.
  • mappings/ — MITRE ATT&CK, MITRE CWE, and SigmaHQ, grouped by cluster so an agent reading #7 Malware can follow one link to every technique, weakness, and detection rule that lands there.
  • controls/ — the governance layer, which deserves its own section below.

A single document looks like this — structure for the machine in the frontmatter, meaning for both readers in the body:

---
type: "cluster"
title: "#7 Malware"
description: "Execution of Foreign Executable Content through the
  environment's designed execution capabilities."
resource: "tlctc:cluster:#7"
tags: ["taxonomy", "cluster", "internal"]
strategic_id: "#7"
generic_vulnerability: "The environment's intended capability to
  execute potentially untrusted executable content."
topology: "internal"
---
# #7 Malware

**Definition:** Execution of Foreign Executable Content (FEC) ...
**Boundary Tests (normative):** If FEC executes → #7 (per R-EXEC) ...

# Relationships
- Control objectives: /controls/cluster-7.md
- Mapped techniques: /mappings/attack/cluster-7.md

The controls section is the point

If the rest of the bundle serves operations and development, the controls/ section is what finally serves governance — the audience a clusters-and-mappings export quietly under-serves. It renders the full TLCTC × NIST CSF matrix: ten clusters across six functions (GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER), giving sixty control-objective cells, each tagged to its bow-tie side — preventive on the cause side, mitigating on the consequence side.

And every one of those sixty cells is populated, not stubbed. We rendered them from the project's Control Matrix tool, which maps each cluster-and-function cell to concrete ISO 27001:2022 Annex A controls — local and umbrella, organizational and technical. A SOC analyst classifying an incident as #9 → #4 → #1 can now walk each cluster's row and read which Annex A control should have prevented, detected, or contained each step. The cause-side taxonomy and the control standard meet in one table.

The section also carries the framework's control-effectiveness model — the part that turns a control inventory into a measurement. Three layers per control:

  • CDE_max — the design-effectiveness ceiling (never 1.0; no control is perfect).
  • CDE_fitness — whether the control can structurally act inside the attack's velocity (Δt) window at all.
  • COE — measured operational performance from typed metrics.
ECR  =  COE × CDE_max × fitness_factor          # effective control rating
DCS  =  MTTD / Δt                              # detection coverage score
       < 0.5 effective · 0.8–1.0 marginal · > 2.0 structurally failed

The Detection Coverage Score is the sharp edge: the same four-hour mean-time-to-detect is excellent against a 14-day APT transition and useless against a 10-minute ransomware one. An agent reading the cell now has the formula and the velocity context to say which. Alongside it sit the KRI / KCI / KPI indicator definitions, so the matrix is measurable rather than merely admirable.

Honesty about provenance

One discipline runs through the whole bundle, and it is worth stating plainly because it is the difference between a knowledge base and a confident liar. Generated content is labelled as generated.

The ISO 27001 Annex A placements are starter guidance — AI-assisted, useful as a scaffold, but not a certified control set — and every control document says so. The CWE mapping is marked experimental for the same reason. The taxonomy itself — clusters, axioms, rules — is frozen and authored; the operational scaffolding around it is honest about being scaffolding. An agent that ingests this bundle inherits not just the knowledge but the confidence calibration that should travel with it. In a world of fluent, unsourced machine output, a provenance note is a safety control.

How to use it

The bundle lives in the repository at okf/. There is nothing to install and nothing to call. Point a RAG pipeline or an agent at the directory; let it read index.md and follow links from there exactly as a person would. Because it is flat markdown with stable, bundle-relative links, progressive disclosure is free: the agent loads the cluster it cares about, follows one link to the controls, another to the mappings, and stops when it has what it needs — no whole-corpus embedding required to answer a narrow question.

To regenerate after any change upstream, run npm run validate. The bundle rebuilds from source, validates, and — if nothing real changed — leaves your working tree spotless.

The small act that matters

Rendering a taxonomy into markdown is not a research breakthrough. It is plumbing. But plumbing is what decides whether the water actually reaches the tap. TLCTC spent its first life arguing that cybersecurity needs a stable, cause-oriented vocabulary so that humans stop talking past each other. The bundle extends that argument exactly one step into the present: the machines we are handing the work to need to speak it too — and they will only speak it cleanly if we hand it to them in a form they can read, generated from one source of truth, honest about what is authored and what is assistance.

You cannot manage what you cannot consistently name. Increasingly, the thing doing the naming is an agent. This is how we make sure it names things the same way we do.

Further reading