Data Ingestion

The ingestion pipeline is how raw text becomes geometry in AGISystem2. It is intentionally simple and linear so that every step can be inspected and replayed. A sentence arrives, is rewritten into the constrained grammar, is parsed into a small tree, and is then encoded into a high-dimensional vector. That vector shapes bounded diamonds in ConceptStore and is indexed for later retrieval. This chapter follows that conveyor from left to right and points to the relevant module specifications and tests.

From Text to Normalised Sentences

In many applications, incoming text will be messy: natural language questions, documentation excerpts, logs, or transcripts. Before any of this can influence the conceptual space, TranslatorBridge must turn it into the small dialect described in the Syntax and Grammar chapters. Given an input paragraph, the bridge uses a pinned language model and prompt to produce one or more canonical sentences such as Dog IS_A Animal or ExportData PROHIBITED_BY GDPR. Its behaviour is deterministic for a given configuration, and it reports its model and prompt versions so that later you can repeat the same normalisation if needed.

If the bridge cannot find a safe rewrite into subject–relation–object form, it should decline the input rather than guess. This rule applies especially in safety-critical contexts, where a mis-normalised sentence could warp the conceptual space in hard-to-detect ways. Internal tests contain examples of both successful and intentionally failing normalisations, ensuring that surprising phrases are either handled explicitly or rejected.

Parsing and Encoding

Once normalised sentences are available, the Parser builds a shallow tree for each: a root node for the subject, a relation node, and one or more child nodes for objects or properties. The recursion horizon limits how deep this tree may become; tokens beyond that depth are either ignored or treated neutrally to prevent runaway complexity. The encoder then walks this tree. For every edge from parent to child, it applies a relation-specific permutation to the child’s vector and adds it to the parent using saturated int8 arithmetic implemented by MathEngine.

A normalised sentence becomes a small subject–relation–object tree that the encoder walks using relation-specific permutations and saturated addition. The diagram abstracts this into three nodes feeding a single encoded vector, which is the unit that later shapes diamonds and participates in retrieval.

This process yields a single high-dimensional vector per sentence that encodes both the identities of the tokens and the roles they played. Because the same relations always use the same permutations and all arithmetic is deterministic, two identical sentences ingested under the same configuration will always produce identical vectors. Other chapters describe the underlying parser and encoder behaviour conceptually, and internal tests verify that permutations and additions behave as expected.

At this stage, different parts of the vector space play different roles: ontology dimensions (0–255) carry factual structure (for example, temperature for facts like Water BOILS_AT Celsius100), axiology dimensions (256–383) are reserved for values and norms, and the empirical tail (384+) is available for learned or domain-specific features. The exact catalogue of axes is described in the Dimensions Overview, Ontology Dimensions, and Axiology Dimensions pages.

Clustering into Concepts

Each encoded sentence is both a new fact and a clue about the shape of the concept it refers to. The ClusterManager examines the new vector relative to existing diamonds for the target concept. If the point falls comfortably inside one of them, the corresponding diamond may be tightened or slightly adjusted. If it lies outside all current diamonds but close enough to be considered a plausible extension, a diamond may be widened. If it sits far from existing regions, ClusterManager may decide to create a new diamond, representing a distinct sense or context.

This clustering logic turns a stream of discrete observations into evolving, continuous regions in conceptual space. It supports both conservative updates, where concepts change slowly as more evidence accumulates, and more dramatic restructurings, such as splitting a single overloaded concept into several more specific ones. Internal tests demonstrate how these decisions are made and how they preserve determinism.

Persistence and Indexing

After clustering decides how to adjust concepts, ConceptStore persists the resulting diamonds using Storage modules that write compact binary representations to disk or keep them in memory depending on profile. In test profiles this may mean writing to a temporary directory; in production it means persisting under a chosen root with optional custom backends. Either way, the goal is that reloading a store reproduces the same conceptual space, down to min/max bounds, centres, radii, and relevance masks.

At the same time, Retriever updates its index so that subsequent queries can find relevant concepts quickly. It feeds the new vectors into the locality-sensitive hashing scheme configured for the current profile, computing hash values, band keys, and bucket assignments. Because hashing is seed-based and index parameters are fixed per profile, repeated ingestion of the same data set in the same configuration yields identical index structures. Internal tests check that persisted and reloaded stores lead to the same retrieval behaviour.

End-to-End Example

Consider the canonical fact "Water BOILS_AT Celsius100", expressed in Sys2DSL as:

@f1 ASSERT Water BOILS_AT Celsius100
@f2 ASSERT Celsius100 IS_A temperature

TranslatorBridge (if present) either emits this form directly or normalises a more natural sentence into it. Parser and encoder then compute a vector where the value concept Celsius100 is a first-class concept with its own position in conceptual space, connected to Water via the BOILS_AT relation permutation. This pushes signal onto the Temperature ontology axis in the dimension catalog. ClusterManager notices that this point strengthens the existing "Water" concept's association with a particular temperature range and adjusts the corresponding diamond accordingly. ConceptStore writes the new diamond to disk, and Retriever records the new vector in its index. A subsequent question such as @q ASK "Water BOILS_AT Celsius100?" will follow the query path described in the Reasoning and API chapters and, if configurations align, land squarely inside the updated region.

Throughout this process, provenance links every change back to the original text, the normalisation steps, the profile in use, and the seeds that shaped hashing and permutations. The Ingestion story therefore fits into the larger explainability and bias-control picture: data does not simply disappear into weights, but remains traceable as points and regions in a well-defined conceptual space.

For more background on how this geometric ingestion story fits into the theory, see the Conceptual Spaces, Ontology, Axiology, and Symbol Grounding wiki pages.

How Data Enters

From Text to Normalised Sentences

Parsing and Encoding

Clustering into Concepts

Persistence and Indexing

End-to-End Example