Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README

This commit is contained in:
vvs
2026-05-09 19:24:43 +01:00
parent 80f99d1d15
commit fa87dbb473
3 changed files with 550 additions and 0 deletions
+9
View File
@@ -147,6 +147,15 @@ internally cast to contiguous `float64`. Outputs are numpy arrays.
See the wrapper docstrings for exact semantics of each function.
## Documentation
- [`docs/SEM_Overview.md`](./docs/SEM_Overview.md) — non-internal
introduction to SEM (Similarity Energy Model), what it does, and
how the `sem_cython12` library fits in.
- [`docs/SEM_Mathematical_Apparatus.md`](./docs/SEM_Mathematical_Apparatus.md)
— capabilities-level description of the operators and engines
exposed by the library.
## Demos
Three runnable demos live in [`demos/`](./demos/):
+270
View File
@@ -0,0 +1,270 @@
# SEM — Mathematical Apparatus (Capability Catalog)
*A non-internal catalog of the operators SEM offers, what each is for,
and which entry points of the `sem_cython12` library back them.*
This document describes WHAT the apparatus does and WHERE to use it.
It does not describe HOW any operator works internally — algorithms,
formulas, lemmas and proofs are intentionally not reproduced here.
---
## Conventions
- "Item" / "world" / "observation": one row of input data. Items live
in some payload space (real numbers, vectors, matrices, sampled
functions, sampled manifolds, distributions, complex amplitudes,
time-series windows, recursive concept trees) — the apparatus
treats them uniformly via a small set of structural operators.
- "Concept": a subset of items that share structural meaning. The
apparatus can either be told the concepts (labelled mode) or
discover them from data (unsupervised mode).
- "Witness": an item whose structural position carries information
beyond merely belonging to one concept.
- "Verdict": the system's qualified output for a new observation -
one of `confident`, `gap`, `incoherent` (see §4.6).
All of the apparatus is parameter-free and threshold-free: there are
no fitting parameters, no numeric cut-offs, no fidelity knobs.
---
## 1. Structural similarity primitives
These are the lowest-level building blocks. Each is exposed directly
in `sem_cython12.wrapper`.
### 1.1 Pairwise similarity
| | |
|---|---|
| Purpose | Score how close a query item is to the most similar member of a reference set. |
| Output | A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far). |
| Applications | Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation. |
| Cython entry point | `batch_max_similarity(X_query, X_members, lam)` |
### 1.2 Multi-class similarity matrix
| | |
|---|---|
| Purpose | The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix. |
| Applications | Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters. |
| Cython entry point | `concept_support_matrix(X_query, member_mats, lam)` |
### 1.3 Pairwise distance matrix
| | |
|---|---|
| Purpose | Symmetric `(N, N)` distance matrix between rows of `X`. |
| Applications | Graph construction, clustering, scale estimation, downstream filtering and ranking. |
| Cython entry point | `pairwise_distances(X)` |
### 1.4 Nearest-neighbour distance vector
| | |
|---|---|
| Purpose | For each row, the minimum positive distance to any other row. Rows with no positive-distance neighbour receive `inf`. |
| Applications | Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification. |
| Cython entry point | `nn_distances(X)` |
---
## 2. Multi-criterion filtering primitives
Given a real-valued matrix `S` of shape `(N, k)` (rows are items,
columns are independent criteria — each in maximisation orientation),
these primitives identify structurally informative subsets of rows.
### 2.1 Best-tradeoff filter
| | |
|---|---|
| Purpose | Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion). |
| Applications | Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis. |
| Cython entry point | `pareto_core_mask(S)` |
### 2.2 One-sided peak flagging
| | |
|---|---|
| Purpose | Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone. |
| Applications | Removing items that are only locally informative; finding cross-criterion contributors; bridge identification. |
| Cython entry point | `one_sided_mask(S)` |
### 2.3 Non-redundant witness identification
| | |
|---|---|
| Purpose | The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one. |
| Applications | Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis. |
| Cython entry point | `non_redundant_witnesses(S)` |
---
## 3. Incremental aggregation primitive
### 3.1 Fused centroid + radius update
| | |
|---|---|
| Purpose | One-pass bulk update for an incremental aggregation step. Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item. |
| Applications | Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines. |
| Cython entry point | `extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)` |
---
## 4. Higher-level apparatus
Built on the primitives in §1–§3. These are the operators that
distinguish SEM as a reasoning system rather than a computation
library. Their internal construction is not reproduced here; the
"Cython entry points used" column lists the public primitives the
operator composes.
### 4.1 Intrinsic scale
| | |
|---|---|
| Purpose | Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required. |
| Applications | Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability. |
| Cython entry points used | `nn_distances`, `pairwise_distances` |
### 4.2 Concept discovery
| | |
|---|---|
| Purpose | Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds. Returns the concepts the data itself supports. |
| Applications | Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators. |
| Cython entry points used | `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
### 4.3 Relational hypothesis generation
| | |
|---|---|
| Purpose | Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support. |
| Applications | Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery. |
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
### 4.4 Semantic gap detection
| | |
|---|---|
| Purpose | Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not. |
| Applications | Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity. |
| Cython entry points used | `concept_support_matrix`, `non_redundant_witnesses` |
### 4.5 Prototype construction
| | |
|---|---|
| Purpose | Predict the structural features of an item that should exist between known concepts but has not yet been observed. |
| Applications | Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition. |
| Cython entry points used | `batch_max_similarity`, `concept_support_matrix` |
### 4.6 Verdict-qualified inference
| | |
|---|---|
| Purpose | Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently). |
| Applications | Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage. |
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
### 4.7 Lifecycle / dominance verification
| | |
|---|---|
| Purpose | When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype. Maintains the prototype's status across its lifetime. |
| Applications | Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference. |
| Cython entry points used | `pareto_core_mask` |
### 4.8 Hierarchical recursion
| | |
|---|---|
| Purpose | Apply every operator above to recursive concept trees — concepts whose members are themselves concepts. Operators bubble through the hierarchy and remain mathematically consistent at every level. |
| Applications | Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.). |
| Cython entry points used | the operators above, recursively |
### 4.9 Streaming kNN graph maintenance
| | |
|---|---|
| Purpose | Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update. |
| Applications | Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection. |
| Cython entry points used | `pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob. |
### 4.10 Time-series streaming model
| | |
|---|---|
| Purpose | A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows. Optionally projects high-dimensional windows to lower dimensions when configured to do so. |
| Applications | Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting. |
| Cython entry points used | `nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9 |
---
## 5. Composition properties
The operators in §1–§4 compose along several axes:
- **Across payload types**: the same operator works for scalars,
vectors, matrices, tensors, functions, manifolds, complex states,
distributions, time-series windows. The caller supplies the
appropriate distance function or, equivalently, an embedding into
Euclidean space.
- **Across hierarchy levels**: concepts can themselves be members of
parent concepts; operators recurse through the tree (§4.8).
- **Under wrapping**: stochastic and temporal extensions can be
layered over any base payload type. Triple compositions like
"hierarchy of stochastic time-series" are admissible and produce
consistent results at every level.
---
## 6. What the apparatus does NOT offer
Stated explicitly so users can plan around the limits:
- No probability distributions over outcomes. Verdicts are
structural, not Bayesian.
- No reward / objective optimisation. The apparatus does not learn
policies; it identifies structural relationships.
- No tuning knobs that trade fidelity for speed. Where some
alternatives expose `epsilon`, `top_k`, `temperature`, etc., the
apparatus uses data-derived structural boundaries instead.
- No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes).
Every kNN-related operator returns exact results.
---
## 7. Mapping summary
| Apparatus operator | Cython entry point(s) |
|---|---|
| Pairwise similarity | `batch_max_similarity` |
| Multi-class similarity | `concept_support_matrix` |
| Pairwise distance | `pairwise_distances` |
| Nearest-neighbour distance | `nn_distances` |
| Best-tradeoff filter | `pareto_core_mask` |
| One-sided peak flag | `one_sided_mask` |
| Non-redundant witness | `non_redundant_witnesses` |
| Fused centroid + radius update | `extend_frontier_kernel` |
| Intrinsic scale | composed of `nn_distances`, `pairwise_distances` |
| Concept discovery | composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
| Relational hypothesis generation | composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
| Semantic gap detection | composed of `concept_support_matrix`, `non_redundant_witnesses` |
| Prototype construction | composed of `batch_max_similarity`, `concept_support_matrix` |
| Verdict-qualified inference | composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
| Lifecycle / dominance verification | composed of `pareto_core_mask` |
| Hierarchical recursion | every operator above, recursively |
| Streaming kNN graph | `pairwise_distances`, `nn_distances` |
| Time-series streaming model | `nn_distances`, `concept_support_matrix`, streaming kNN |
## 8. Library availability
The Cython entry points in the right column of §7 are all in
`sem_cython12.wrapper`, distributed at
[https://git.sevana.biz/vvs/sem_cython12](https://git.sevana.biz/vvs/sem_cython12).
Higher-level apparatus (composed operators in §4) is built on those
primitives and ships in the SEM foundation package, separate from
this library.
+271
View File
@@ -0,0 +1,271 @@
# SEM — An Overview of Structural Reasoning
*A non-internal introduction to the SEM (Similarity Energy Model)
reasoning system, its applications, and the `sem_cython12` library.*
---
## 1. What SEM is
SEM is a reasoning system for **discovering structure in observed
data** and producing **decision-qualified predictions** about new
observations. Unlike conventional machine learning, SEM is not a
parameterised model fitted to training data: its outputs are derived
directly from the geometry of the observed world set. Where ML asks
"what is the most likely label?", SEM asks "what is the structural
position of this observation relative to everything we have seen?"
— and reports the answer as a verdict, not a probability.
The system has been used as a discovery engine, an anomaly detector,
a missing-mediator predictor, a regime-change identifier, and an
explainable inference layer over neural-network embeddings. Each
application reuses the same small set of structural operators.
## 2. Properties that distinguish SEM
- **Parameter-free.** No learning rates, no regularisation
coefficients, no tuning knobs in the reasoning pipeline. Every
scale or boundary the system consults is computed from the data
itself.
- **Threshold-free.** No `if score > 0.85` decisions. Where
conventional pipelines impose a numeric cut-off, SEM uses
data-derived structural boundaries that adapt to the observed
geometry.
- **Three-valued verdict.** A prediction returns one of:
- **confident** — a single best-fitting concept dominates;
- **gap** — multiple concepts are equally admissible, signalling
that the query lies in a region the current theory has not
resolved;
- **incoherent** — no concept admits the query consistently;
further data is required.
This refusal-to-guess is the system's most useful safety property:
it never collapses uncertainty into a forced label.
- **Detects what is missing.** SEM identifies positions where
observed data should produce a structural witness but does not, and
predicts the features the missing entity should carry. Conventional
ML cannot signal that a hidden mediator or unobserved variable is
required.
- **Explainable by construction.** Every prediction comes with a
decomposition of the supporting evidence, so a downstream system
(or human reviewer) can audit which structural relations argue for
a given verdict.
- **Composable across data types.** The same reasoning apparatus
applies to scalars, vectors, matrices, sampled functions, sampled
manifolds, complex (quantum) state vectors, distributions, time-
series windows, and recursive concept hierarchies. The operators
see all of these through a common interface.
## 3. Where SEM has been applied
| Domain | Capability used |
|---|---|
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
| Scientific law discovery | Recovering analytic relationships from raw measurements |
| Drug / molecule screening | Structural similarity beyond fingerprints |
| Network monitoring | Silent-failure detection in encrypted traffic |
| Causal inference | Discovering missing variables from observational data |
| Image / signal analysis | Structural feature extraction with explainability |
| LLM explainability | Interpreting embedding-space behaviour |
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
| Trading & market structure | Regime-switch decisions with abstain semantics |
In each case the value is the same: the system either gives a
high-confidence answer or refuses to, and never delivers a confident
wrong answer disguised as a probability.
## 4. How SEM differs from machine learning
| | Machine learning | SEM |
|---|---|---|
| Has training phase | yes | no |
| Has hyper-parameters | yes | no |
| Can detect missing entities | no | yes |
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
| Output | numeric / probabilistic | structural with verdict |
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
| Scale of usable data | requires many examples | works on small data, even single-digit examples |
SEM and ML are not exclusive — SEM is sometimes layered on top of
neural-network embeddings to provide an explainability and abstention
layer, and ML can supply the embeddings SEM reasons over.
## 5. The `sem_cython12` library
`sem_cython12` is the high-performance numerical kernel layer that
backs SEM's reasoning operators. It is delivered as a pre-compiled
Linux shared object plus a thin Python wrapper; users do not compile
anything at install time.
The library exposes one module:
- `sem_cython12.wrapper` — Python API over the compiled kernels.
Inside the module, the public functions are grouped by purpose.
### 5.1 Configuration
| Function | Purpose |
|---|---|
| `available() -> bool` | Reports whether the compiled extension loaded |
| `backend() -> str` | `'cython12'` or `'python-fallback'` |
| `get_num_threads() -> int` | Active OpenMP worker count |
| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |
OpenMP thread count defaults to roughly 50 % of the host's logical
cores, so other processes are not starved on shared machines. The
caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
environment variable.
### 5.2 Distance and similarity
| Function | What it does |
|---|---|
| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
| `nn_distances(X)` | Per-row minimum positive distance to any other row. |
These four cover the bulk of SEM's structural-similarity workload.
### 5.3 Pareto / dominance reasoning
| Function | What it computes |
|---|---|
| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |
These let the caller reason about which observations *meaningfully*
contribute to bridging multiple structural classes — versus those that
are merely peaks of a single class.
### 5.4 Vector reduction
| Function | What it computes |
|---|---|
| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |
Used by higher-level routines that need to enumerate candidate
relational hypotheses bridging multiple regions of structural space.
### 5.5 Performance
Measured on commodity x86_64 hardware with 8 OpenMP threads against
the equivalent pure-numpy reference implementations:
| Operation | Speed-up |
|---|---|
| `batch_max_similarity` (N=2000, D=50) | ~14× |
| `pareto_core_mask` (N=1000, k=8) | ~50× |
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
All routines release the GIL during their inner loops, so calling
them concurrently from Python threads is safe.
## 6. A worked Python example
The following snippet uses only `sem_cython12.wrapper` and `numpy`.
It shows how a downstream pipeline would identify the **structurally
informative** members of a small synthetic dataset — those that
mediate between two clusters rather than sitting at one cluster's
peak.
```python
import numpy as np
from sem_cython12 import wrapper as cy
assert cy.available(), "compiled extension did not load"
print("backend:", cy.backend(), " threads:", cy.get_num_threads())
# Two well-separated clusters in 4-D, plus three "bridging" candidates
# whose similarity profile spans both clusters.
rng = np.random.default_rng(0)
cluster_a = rng.standard_normal((20, 4)) + 3.0
cluster_b = rng.standard_normal((20, 4)) - 3.0
bridges = np.array([
[ 0.0, 0.0, 0.0, 0.0],
[ 0.5, 0.5, -0.2, 0.1],
[-0.3, 0.1, 0.4, -0.2],
])
members = np.vstack([cluster_a, cluster_b, bridges])
# 1. Build a 2-class similarity matrix:
# columns = (sim to cluster_a, sim to cluster_b)
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
S = np.column_stack([sim_a, sim_b]) # (N, 2)
# 2. Find the Pareto frontier of (sim_a, sim_b).
# Members whose support vector is strictly dominated by another
# member are excluded.
keep_mask = cy.pareto_core_mask(S)
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
# 3. Of those, which are NOT one-sided peaks?
# A one-sided member is a peak of exactly one cluster and gains
# nothing on the other. We want members that score on BOTH.
non_redundant = cy.non_redundant_witnesses(S)
print("Non-redundant witnesses:", non_redundant.tolist())
# 4. Inspect the ones that survived: these are the data points that
# structurally connect the two clusters.
for idx in non_redundant:
print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}")
```
A typical run prints something like:
```
backend: cython12 threads: 4
Pareto-frontier members: 8 / 43
Non-redundant witnesses: [40, 41, 42]
row 40: sim_a=0.428 sim_b=0.428
row 41: sim_a=0.412 sim_b=0.401
row 42: sim_a=0.402 sim_b=0.395
```
The library has filtered out the 40 cluster members (which sit at
their own cluster's peak and contribute nothing across cluster
boundaries) and identified the three synthetic "bridges" as the
structurally informative observations. This is the kind of
elementary operation that higher-level SEM reasoning composes into
concept discovery, gap detection and prototype prediction.
## 7. When to consider SEM
| Situation | Consider SEM |
|---|---|
| You have small data (1010,000 examples) and need a defensible decision | Yes |
| You need to know *what is missing* from your data | Yes |
| You need a model that refuses to guess when the data is ambiguous | Yes |
| You want explanations that are inherent to the inference, not bolted on | Yes |
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
| You have a regression task with smooth dependencies | Stay with classical statistics |
## 8. Library availability
`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
3.12 shared object. Installation is:
```bash
git clone https://git.sevana.biz/vvs/sem_cython12.git
cd sem_cython12
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```
The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
and the compiled `.so`, plus `requirements.txt` and a README describing
the public API.
## 9. Summary
SEM is a structural reasoning system whose promise is decision
quality, not raw accuracy. Its key product is a verdict-qualified
prediction: the system tells you whether it is confident, whether
the data is genuinely ambiguous, or whether the observation lies
outside the apparatus's coherent coverage. The `sem_cython12`
library provides the high-performance numerical layer beneath this
reasoning, exposing a small, well-defined Python API that downstream
applications compose into domain-specific pipelines.