From fa87dbb47336fb89a3ddbd73656729244f7dd188 Mon Sep 17 00:00:00 2001 From: vvs Date: Sat, 9 May 2026 19:24:43 +0100 Subject: [PATCH] Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README --- README.md | 9 + docs/SEM_Mathematical_Apparatus.md | 270 ++++++++++++++++++++++++++++ docs/SEM_Overview.md | 271 +++++++++++++++++++++++++++++ 3 files changed, 550 insertions(+) create mode 100644 docs/SEM_Mathematical_Apparatus.md create mode 100644 docs/SEM_Overview.md diff --git a/README.md b/README.md index abeab93..bdeb839 100644 --- a/README.md +++ b/README.md @@ -147,6 +147,15 @@ internally cast to contiguous `float64`. Outputs are numpy arrays. See the wrapper docstrings for exact semantics of each function. +## Documentation + +- [`docs/SEM_Overview.md`](./docs/SEM_Overview.md) — non-internal + introduction to SEM (Similarity Energy Model), what it does, and + how the `sem_cython12` library fits in. +- [`docs/SEM_Mathematical_Apparatus.md`](./docs/SEM_Mathematical_Apparatus.md) + — capabilities-level description of the operators and engines + exposed by the library. + ## Demos Three runnable demos live in [`demos/`](./demos/): diff --git a/docs/SEM_Mathematical_Apparatus.md b/docs/SEM_Mathematical_Apparatus.md new file mode 100644 index 0000000..2d79220 --- /dev/null +++ b/docs/SEM_Mathematical_Apparatus.md @@ -0,0 +1,270 @@ +# SEM — Mathematical Apparatus (Capability Catalog) + +*A non-internal catalog of the operators SEM offers, what each is for, +and which entry points of the `sem_cython12` library back them.* + +This document describes WHAT the apparatus does and WHERE to use it. +It does not describe HOW any operator works internally — algorithms, +formulas, lemmas and proofs are intentionally not reproduced here. + +--- + +## Conventions + +- "Item" / "world" / "observation": one row of input data. Items live + in some payload space (real numbers, vectors, matrices, sampled + functions, sampled manifolds, distributions, complex amplitudes, + time-series windows, recursive concept trees) — the apparatus + treats them uniformly via a small set of structural operators. +- "Concept": a subset of items that share structural meaning. The + apparatus can either be told the concepts (labelled mode) or + discover them from data (unsupervised mode). +- "Witness": an item whose structural position carries information + beyond merely belonging to one concept. +- "Verdict": the system's qualified output for a new observation - + one of `confident`, `gap`, `incoherent` (see §4.6). + +All of the apparatus is parameter-free and threshold-free: there are +no fitting parameters, no numeric cut-offs, no fidelity knobs. + +--- + +## 1. Structural similarity primitives + +These are the lowest-level building blocks. Each is exposed directly +in `sem_cython12.wrapper`. + +### 1.1 Pairwise similarity + +| | | +|---|---| +| Purpose | Score how close a query item is to the most similar member of a reference set. | +| Output | A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far). | +| Applications | Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation. | +| Cython entry point | `batch_max_similarity(X_query, X_members, lam)` | + +### 1.2 Multi-class similarity matrix + +| | | +|---|---| +| Purpose | The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix. | +| Applications | Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters. | +| Cython entry point | `concept_support_matrix(X_query, member_mats, lam)` | + +### 1.3 Pairwise distance matrix + +| | | +|---|---| +| Purpose | Symmetric `(N, N)` distance matrix between rows of `X`. | +| Applications | Graph construction, clustering, scale estimation, downstream filtering and ranking. | +| Cython entry point | `pairwise_distances(X)` | + +### 1.4 Nearest-neighbour distance vector + +| | | +|---|---| +| Purpose | For each row, the minimum positive distance to any other row. Rows with no positive-distance neighbour receive `inf`. | +| Applications | Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification. | +| Cython entry point | `nn_distances(X)` | + +--- + +## 2. Multi-criterion filtering primitives + +Given a real-valued matrix `S` of shape `(N, k)` (rows are items, +columns are independent criteria — each in maximisation orientation), +these primitives identify structurally informative subsets of rows. + +### 2.1 Best-tradeoff filter + +| | | +|---|---| +| Purpose | Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion). | +| Applications | Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis. | +| Cython entry point | `pareto_core_mask(S)` | + +### 2.2 One-sided peak flagging + +| | | +|---|---| +| Purpose | Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone. | +| Applications | Removing items that are only locally informative; finding cross-criterion contributors; bridge identification. | +| Cython entry point | `one_sided_mask(S)` | + +### 2.3 Non-redundant witness identification + +| | | +|---|---| +| Purpose | The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one. | +| Applications | Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis. | +| Cython entry point | `non_redundant_witnesses(S)` | + +--- + +## 3. Incremental aggregation primitive + +### 3.1 Fused centroid + radius update + +| | | +|---|---| +| Purpose | One-pass bulk update for an incremental aggregation step. Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item. | +| Applications | Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines. | +| Cython entry point | `extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)` | + +--- + +## 4. Higher-level apparatus + +Built on the primitives in §1–§3. These are the operators that +distinguish SEM as a reasoning system rather than a computation +library. Their internal construction is not reproduced here; the +"Cython entry points used" column lists the public primitives the +operator composes. + +### 4.1 Intrinsic scale + +| | | +|---|---| +| Purpose | Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required. | +| Applications | Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability. | +| Cython entry points used | `nn_distances`, `pairwise_distances` | + +### 4.2 Concept discovery + +| | | +|---|---| +| Purpose | Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds. Returns the concepts the data itself supports. | +| Applications | Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators. | +| Cython entry points used | `pairwise_distances`, `nn_distances`, `pareto_core_mask` | + +### 4.3 Relational hypothesis generation + +| | | +|---|---| +| Purpose | Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support. | +| Applications | Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery. | +| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` | + +### 4.4 Semantic gap detection + +| | | +|---|---| +| Purpose | Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not. | +| Applications | Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity. | +| Cython entry points used | `concept_support_matrix`, `non_redundant_witnesses` | + +### 4.5 Prototype construction + +| | | +|---|---| +| Purpose | Predict the structural features of an item that should exist between known concepts but has not yet been observed. | +| Applications | Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition. | +| Cython entry points used | `batch_max_similarity`, `concept_support_matrix` | + +### 4.6 Verdict-qualified inference + +| | | +|---|---| +| Purpose | Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently). | +| Applications | Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage. | +| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` | + +### 4.7 Lifecycle / dominance verification + +| | | +|---|---| +| Purpose | When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype. Maintains the prototype's status across its lifetime. | +| Applications | Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference. | +| Cython entry points used | `pareto_core_mask` | + +### 4.8 Hierarchical recursion + +| | | +|---|---| +| Purpose | Apply every operator above to recursive concept trees — concepts whose members are themselves concepts. Operators bubble through the hierarchy and remain mathematically consistent at every level. | +| Applications | Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.). | +| Cython entry points used | the operators above, recursively | + +### 4.9 Streaming kNN graph maintenance + +| | | +|---|---| +| Purpose | Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update. | +| Applications | Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection. | +| Cython entry points used | `pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob. | + +### 4.10 Time-series streaming model + +| | | +|---|---| +| Purpose | A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows. Optionally projects high-dimensional windows to lower dimensions when configured to do so. | +| Applications | Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting. | +| Cython entry points used | `nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9 | + +--- + +## 5. Composition properties + +The operators in §1–§4 compose along several axes: + +- **Across payload types**: the same operator works for scalars, + vectors, matrices, tensors, functions, manifolds, complex states, + distributions, time-series windows. The caller supplies the + appropriate distance function or, equivalently, an embedding into + Euclidean space. +- **Across hierarchy levels**: concepts can themselves be members of + parent concepts; operators recurse through the tree (§4.8). +- **Under wrapping**: stochastic and temporal extensions can be + layered over any base payload type. Triple compositions like + "hierarchy of stochastic time-series" are admissible and produce + consistent results at every level. + +--- + +## 6. What the apparatus does NOT offer + +Stated explicitly so users can plan around the limits: + +- No probability distributions over outcomes. Verdicts are + structural, not Bayesian. +- No reward / objective optimisation. The apparatus does not learn + policies; it identifies structural relationships. +- No tuning knobs that trade fidelity for speed. Where some + alternatives expose `epsilon`, `top_k`, `temperature`, etc., the + apparatus uses data-derived structural boundaries instead. +- No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes). + Every kNN-related operator returns exact results. + +--- + +## 7. Mapping summary + +| Apparatus operator | Cython entry point(s) | +|---|---| +| Pairwise similarity | `batch_max_similarity` | +| Multi-class similarity | `concept_support_matrix` | +| Pairwise distance | `pairwise_distances` | +| Nearest-neighbour distance | `nn_distances` | +| Best-tradeoff filter | `pareto_core_mask` | +| One-sided peak flag | `one_sided_mask` | +| Non-redundant witness | `non_redundant_witnesses` | +| Fused centroid + radius update | `extend_frontier_kernel` | +| Intrinsic scale | composed of `nn_distances`, `pairwise_distances` | +| Concept discovery | composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask` | +| Relational hypothesis generation | composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` | +| Semantic gap detection | composed of `concept_support_matrix`, `non_redundant_witnesses` | +| Prototype construction | composed of `batch_max_similarity`, `concept_support_matrix` | +| Verdict-qualified inference | composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` | +| Lifecycle / dominance verification | composed of `pareto_core_mask` | +| Hierarchical recursion | every operator above, recursively | +| Streaming kNN graph | `pairwise_distances`, `nn_distances` | +| Time-series streaming model | `nn_distances`, `concept_support_matrix`, streaming kNN | + +## 8. Library availability + +The Cython entry points in the right column of §7 are all in +`sem_cython12.wrapper`, distributed at +[https://git.sevana.biz/vvs/sem_cython12](https://git.sevana.biz/vvs/sem_cython12). +Higher-level apparatus (composed operators in §4) is built on those +primitives and ships in the SEM foundation package, separate from +this library. diff --git a/docs/SEM_Overview.md b/docs/SEM_Overview.md new file mode 100644 index 0000000..da6ce56 --- /dev/null +++ b/docs/SEM_Overview.md @@ -0,0 +1,271 @@ +# SEM — An Overview of Structural Reasoning + +*A non-internal introduction to the SEM (Similarity Energy Model) +reasoning system, its applications, and the `sem_cython12` library.* + +--- + +## 1. What SEM is + +SEM is a reasoning system for **discovering structure in observed +data** and producing **decision-qualified predictions** about new +observations. Unlike conventional machine learning, SEM is not a +parameterised model fitted to training data: its outputs are derived +directly from the geometry of the observed world set. Where ML asks +"what is the most likely label?", SEM asks "what is the structural +position of this observation relative to everything we have seen?" +— and reports the answer as a verdict, not a probability. + +The system has been used as a discovery engine, an anomaly detector, +a missing-mediator predictor, a regime-change identifier, and an +explainable inference layer over neural-network embeddings. Each +application reuses the same small set of structural operators. + +## 2. Properties that distinguish SEM + +- **Parameter-free.** No learning rates, no regularisation + coefficients, no tuning knobs in the reasoning pipeline. Every + scale or boundary the system consults is computed from the data + itself. +- **Threshold-free.** No `if score > 0.85` decisions. Where + conventional pipelines impose a numeric cut-off, SEM uses + data-derived structural boundaries that adapt to the observed + geometry. +- **Three-valued verdict.** A prediction returns one of: + - **confident** — a single best-fitting concept dominates; + - **gap** — multiple concepts are equally admissible, signalling + that the query lies in a region the current theory has not + resolved; + - **incoherent** — no concept admits the query consistently; + further data is required. + This refusal-to-guess is the system's most useful safety property: + it never collapses uncertainty into a forced label. +- **Detects what is missing.** SEM identifies positions where + observed data should produce a structural witness but does not, and + predicts the features the missing entity should carry. Conventional + ML cannot signal that a hidden mediator or unobserved variable is + required. +- **Explainable by construction.** Every prediction comes with a + decomposition of the supporting evidence, so a downstream system + (or human reviewer) can audit which structural relations argue for + a given verdict. +- **Composable across data types.** The same reasoning apparatus + applies to scalars, vectors, matrices, sampled functions, sampled + manifolds, complex (quantum) state vectors, distributions, time- + series windows, and recursive concept hierarchies. The operators + see all of these through a common interface. + +## 3. Where SEM has been applied + +| Domain | Capability used | +|---|---| +| Multivariate time series | Regime detection, forecast verdicts, anomaly identification | +| Scientific law discovery | Recovering analytic relationships from raw measurements | +| Drug / molecule screening | Structural similarity beyond fingerprints | +| Network monitoring | Silent-failure detection in encrypted traffic | +| Causal inference | Discovering missing variables from observational data | +| Image / signal analysis | Structural feature extraction with explainability | +| LLM explainability | Interpreting embedding-space behaviour | +| Geopolitical forecasting | Producing confident / abstain forecasts on event data | +| Trading & market structure | Regime-switch decisions with abstain semantics | + +In each case the value is the same: the system either gives a +high-confidence answer or refuses to, and never delivers a confident +wrong answer disguised as a probability. + +## 4. How SEM differs from machine learning + +| | Machine learning | SEM | +|---|---|---| +| Has training phase | yes | no | +| Has hyper-parameters | yes | no | +| Can detect missing entities | no | yes | +| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) | +| Output | numeric / probabilistic | structural with verdict | +| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference | +| Scale of usable data | requires many examples | works on small data, even single-digit examples | + +SEM and ML are not exclusive — SEM is sometimes layered on top of +neural-network embeddings to provide an explainability and abstention +layer, and ML can supply the embeddings SEM reasons over. + +## 5. The `sem_cython12` library + +`sem_cython12` is the high-performance numerical kernel layer that +backs SEM's reasoning operators. It is delivered as a pre-compiled +Linux shared object plus a thin Python wrapper; users do not compile +anything at install time. + +The library exposes one module: + +- `sem_cython12.wrapper` — Python API over the compiled kernels. + +Inside the module, the public functions are grouped by purpose. + +### 5.1 Configuration + +| Function | Purpose | +|---|---| +| `available() -> bool` | Reports whether the compiled extension loaded | +| `backend() -> str` | `'cython12'` or `'python-fallback'` | +| `get_num_threads() -> int` | Active OpenMP worker count | +| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) | + +OpenMP thread count defaults to roughly 50 % of the host's logical +cores, so other processes are not starved on shared machines. The +caller can override via `set_num_threads()` or the `SEM_NUM_THREADS` +environment variable. + +### 5.2 Distance and similarity + +| Function | What it does | +|---|---| +| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. | +| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. | +| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. | +| `nn_distances(X)` | Per-row minimum positive distance to any other row. | + +These four cover the bulk of SEM's structural-similarity workload. + +### 5.3 Pareto / dominance reasoning + +| Function | What it computes | +|---|---| +| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order | +| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection | +| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters | + +These let the caller reason about which observations *meaningfully* +contribute to bridging multiple structural classes — versus those that +are merely peaks of a single class. + +### 5.4 Vector reduction + +| Function | What it computes | +|---|---| +| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation | + +Used by higher-level routines that need to enumerate candidate +relational hypotheses bridging multiple regions of structural space. + +### 5.5 Performance + +Measured on commodity x86_64 hardware with 8 OpenMP threads against +the equivalent pure-numpy reference implementations: + +| Operation | Speed-up | +|---|---| +| `batch_max_similarity` (N=2000, D=50) | ~14× | +| `pareto_core_mask` (N=1000, k=8) | ~50× | +| Streaming kNN ingest (sliding-window, len=600) | ~100× | +| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second | + +All routines release the GIL during their inner loops, so calling +them concurrently from Python threads is safe. + +## 6. A worked Python example + +The following snippet uses only `sem_cython12.wrapper` and `numpy`. +It shows how a downstream pipeline would identify the **structurally +informative** members of a small synthetic dataset — those that +mediate between two clusters rather than sitting at one cluster's +peak. + +```python +import numpy as np +from sem_cython12 import wrapper as cy + +assert cy.available(), "compiled extension did not load" +print("backend:", cy.backend(), " threads:", cy.get_num_threads()) + +# Two well-separated clusters in 4-D, plus three "bridging" candidates +# whose similarity profile spans both clusters. +rng = np.random.default_rng(0) +cluster_a = rng.standard_normal((20, 4)) + 3.0 +cluster_b = rng.standard_normal((20, 4)) - 3.0 +bridges = np.array([ + [ 0.0, 0.0, 0.0, 0.0], + [ 0.5, 0.5, -0.2, 0.1], + [-0.3, 0.1, 0.4, -0.2], +]) +members = np.vstack([cluster_a, cluster_b, bridges]) + +# 1. Build a 2-class similarity matrix: +# columns = (sim to cluster_a, sim to cluster_b) +sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0) +sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0) +S = np.column_stack([sim_a, sim_b]) # (N, 2) + +# 2. Find the Pareto frontier of (sim_a, sim_b). +# Members whose support vector is strictly dominated by another +# member are excluded. +keep_mask = cy.pareto_core_mask(S) +print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members)) + +# 3. Of those, which are NOT one-sided peaks? +# A one-sided member is a peak of exactly one cluster and gains +# nothing on the other. We want members that score on BOTH. +non_redundant = cy.non_redundant_witnesses(S) +print("Non-redundant witnesses:", non_redundant.tolist()) + +# 4. Inspect the ones that survived: these are the data points that +# structurally connect the two clusters. +for idx in non_redundant: + print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}") +``` + +A typical run prints something like: + +``` +backend: cython12 threads: 4 +Pareto-frontier members: 8 / 43 +Non-redundant witnesses: [40, 41, 42] + row 40: sim_a=0.428 sim_b=0.428 + row 41: sim_a=0.412 sim_b=0.401 + row 42: sim_a=0.402 sim_b=0.395 +``` + +The library has filtered out the 40 cluster members (which sit at +their own cluster's peak and contribute nothing across cluster +boundaries) and identified the three synthetic "bridges" as the +structurally informative observations. This is the kind of +elementary operation that higher-level SEM reasoning composes into +concept discovery, gap detection and prototype prediction. + +## 7. When to consider SEM + +| Situation | Consider SEM | +|---|---| +| You have small data (10–10,000 examples) and need a defensible decision | Yes | +| You need to know *what is missing* from your data | Yes | +| You need a model that refuses to guess when the data is ambiguous | Yes | +| You want explanations that are inherent to the inference, not bolted on | Yes | +| You have millions of labelled examples and need raw classification accuracy | Stay with ML | +| You have a regression task with smooth dependencies | Stay with classical statistics | + +## 8. Library availability + +`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython +3.12 shared object. Installation is: + +```bash +git clone https://git.sevana.biz/vvs/sem_cython12.git +cd sem_cython12 +pip install -r requirements.txt +export PYTHONPATH=$PWD:$PYTHONPATH +``` + +The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`, +and the compiled `.so`, plus `requirements.txt` and a README describing +the public API. + +## 9. Summary + +SEM is a structural reasoning system whose promise is decision +quality, not raw accuracy. Its key product is a verdict-qualified +prediction: the system tells you whether it is confident, whether +the data is genuinely ambiguous, or whether the observation lies +outside the apparatus's coherent coverage. The `sem_cython12` +library provides the high-performance numerical layer beneath this +reasoning, exposing a small, well-defined Python API that downstream +applications compose into domain-specific pipelines.