sem_cython12/docs/SEM_Overview.md

# SEM — An Overview of Structural Reasoning

*A non-internal introduction to the SEM (Similarity Energy Model)
reasoning system, its applications, and the `sem_cython12` library.*

---

## 1.  What SEM is

SEM is a reasoning system for **discovering structure in observed
data** and producing **decision-qualified predictions** about new
observations.  Unlike conventional machine learning, SEM is not a
parameterised model fitted to training data: its outputs are derived
directly from the geometry of the observed world set.  Where ML asks
"what is the most likely label?", SEM asks "what is the structural
position of this observation relative to everything we have seen?"
— and reports the answer as a verdict, not a probability.

The system has been used as a discovery engine, an anomaly detector,
a missing-mediator predictor, a regime-change identifier, and an
explainable inference layer over neural-network embeddings.  Each
application reuses the same small set of structural operators.

## 2.  Properties that distinguish SEM

- **Parameter-free.**  No learning rates, no regularisation
  coefficients, no tuning knobs in the reasoning pipeline.  Every
  scale or boundary the system consults is computed from the data
  itself.
- **Threshold-free.**  No `if score > 0.85` decisions.  Where
  conventional pipelines impose a numeric cut-off, SEM uses
  data-derived structural boundaries that adapt to the observed
  geometry.
- **Three-valued verdict.**  A prediction returns one of:
  - **confident** — a single best-fitting concept dominates;
  - **gap** — multiple concepts are equally admissible, signalling
    that the query lies in a region the current theory has not
    resolved;
  - **incoherent** — no concept admits the query consistently;
    further data is required.
  This refusal-to-guess is the system's most useful safety property:
  it never collapses uncertainty into a forced label.
- **Detects what is missing.**  SEM identifies positions where
  observed data should produce a structural witness but does not, and
  predicts the features the missing entity should carry.  Conventional
  ML cannot signal that a hidden mediator or unobserved variable is
  required.
- **Explainable by construction.**  Every prediction comes with a
  decomposition of the supporting evidence, so a downstream system
  (or human reviewer) can audit which structural relations argue for
  a given verdict.
- **Composable across data types.**  The same reasoning apparatus
  applies to scalars, vectors, matrices, sampled functions, sampled
  manifolds, complex (quantum) state vectors, distributions, time-
  series windows, and recursive concept hierarchies.  The operators
  see all of these through a common interface.

## 3.  Where SEM has been applied

| Domain | Capability used |
|---|---|
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
| Scientific law discovery | Recovering analytic relationships from raw measurements |
| Drug / molecule screening | Structural similarity beyond fingerprints |
| Network monitoring | Silent-failure detection in encrypted traffic |
| Causal inference | Discovering missing variables from observational data |
| Image / signal analysis | Structural feature extraction with explainability |
| LLM explainability | Interpreting embedding-space behaviour |
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
| Trading & market structure | Regime-switch decisions with abstain semantics |

In each case the value is the same: the system either gives a
high-confidence answer or refuses to, and never delivers a confident
wrong answer disguised as a probability.

## 4.  How SEM differs from machine learning

|  | Machine learning | SEM |
|---|---|---|
| Has training phase | yes | no |
| Has hyper-parameters | yes | no |
| Can detect missing entities | no | yes |
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
| Output | numeric / probabilistic | structural with verdict |
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
| Scale of usable data | requires many examples | works on small data, even single-digit examples |

SEM and ML are not exclusive — SEM is sometimes layered on top of
neural-network embeddings to provide an explainability and abstention
layer, and ML can supply the embeddings SEM reasons over.

## 5.  The `sem_cython12` library

`sem_cython12` is the high-performance numerical kernel layer that
backs SEM's reasoning operators.  It is delivered as a pre-compiled
Linux shared object plus a thin Python wrapper; users do not compile
anything at install time.

The library exposes one module:

- `sem_cython12.wrapper` — Python API over the compiled kernels.

Inside the module, the public functions are grouped by purpose.

### 5.1  Configuration

| Function | Purpose |
|---|---|
| `available() -> bool` | Reports whether the compiled extension loaded |
| `backend() -> str` | `'cython12'` or `'python-fallback'` |
| `get_num_threads() -> int` | Active OpenMP worker count |
| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |

OpenMP thread count defaults to roughly 50 % of the host's logical
cores, so other processes are not starved on shared machines.  The
caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
environment variable.

### 5.2  Distance and similarity

| Function | What it does |
|---|---|
| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`.  `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
| `nn_distances(X)` | Per-row minimum positive distance to any other row. |

These four cover the bulk of SEM's structural-similarity workload.

### 5.3  Pareto / dominance reasoning

| Function | What it computes |
|---|---|
| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |

These let the caller reason about which observations *meaningfully*
contribute to bridging multiple structural classes — versus those that
are merely peaks of a single class.

### 5.4  Vector reduction

| Function | What it computes |
|---|---|
| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |

Used by higher-level routines that need to enumerate candidate
relational hypotheses bridging multiple regions of structural space.

### 5.5  Performance

Measured on commodity x86_64 hardware with 8 OpenMP threads against
the equivalent pure-numpy reference implementations:

| Operation | Speed-up |
|---|---|
| `batch_max_similarity` (N=2000, D=50) | ~14× |
| `pareto_core_mask` (N=1000, k=8) | ~50× |
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |

All routines release the GIL during their inner loops, so calling
them concurrently from Python threads is safe.

## 6.  A worked Python example

The following snippet uses only `sem_cython12.wrapper` and `numpy`.
It shows how a downstream pipeline would identify the **structurally
informative** members of a small synthetic dataset — those that
mediate between two clusters rather than sitting at one cluster's
peak.

```python
import numpy as np
from sem_cython12 import wrapper as cy

assert cy.available(), "compiled extension did not load"
print("backend:", cy.backend(), "  threads:", cy.get_num_threads())

# Two well-separated clusters in 4-D, plus three "bridging" candidates
# whose similarity profile spans both clusters.
rng = np.random.default_rng(0)
cluster_a = rng.standard_normal((20, 4)) +  3.0
cluster_b = rng.standard_normal((20, 4)) -  3.0
bridges   = np.array([
    [ 0.0, 0.0,  0.0, 0.0],
    [ 0.5, 0.5, -0.2, 0.1],
    [-0.3, 0.1,  0.4, -0.2],
])
members = np.vstack([cluster_a, cluster_b, bridges])

# 1. Build a 2-class similarity matrix:
#    columns = (sim to cluster_a, sim to cluster_b)
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
S = np.column_stack([sim_a, sim_b])               # (N, 2)

# 2. Find the Pareto frontier of (sim_a, sim_b).
#    Members whose support vector is strictly dominated by another
#    member are excluded.
keep_mask = cy.pareto_core_mask(S)
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))

# 3. Of those, which are NOT one-sided peaks?
#    A one-sided member is a peak of exactly one cluster and gains
#    nothing on the other.  We want members that score on BOTH.
non_redundant = cy.non_redundant_witnesses(S)
print("Non-redundant witnesses:", non_redundant.tolist())

# 4. Inspect the ones that survived: these are the data points that
#    structurally connect the two clusters.
for idx in non_redundant:
    print(f"  row {idx}:  sim_a={S[idx, 0]:.3f}  sim_b={S[idx, 1]:.3f}")
```

A typical run prints something like:

```
backend: cython12   threads: 4
Pareto-frontier members: 8 / 43
Non-redundant witnesses: [40, 41, 42]
  row 40:  sim_a=0.428  sim_b=0.428
  row 41:  sim_a=0.412  sim_b=0.401
  row 42:  sim_a=0.402  sim_b=0.395
```

The library has filtered out the 40 cluster members (which sit at
their own cluster's peak and contribute nothing across cluster
boundaries) and identified the three synthetic "bridges" as the
structurally informative observations.  This is the kind of
elementary operation that higher-level SEM reasoning composes into
concept discovery, gap detection and prototype prediction.

## 7.  When to consider SEM

| Situation | Consider SEM |
|---|---|
| You have small data (10–10,000 examples) and need a defensible decision | Yes |
| You need to know *what is missing* from your data | Yes |
| You need a model that refuses to guess when the data is ambiguous | Yes |
| You want explanations that are inherent to the inference, not bolted on | Yes |
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
| You have a regression task with smooth dependencies | Stay with classical statistics |

## 8.  Library availability

`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
3.12 shared object.  Installation is:

```bash
git clone https://git.sevana.biz/vvs/sem_cython12.git
cd sem_cython12
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```

The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
and the compiled `.so`, plus `requirements.txt` and a README describing
the public API.

## 9.  Summary

SEM is a structural reasoning system whose promise is decision
quality, not raw accuracy.  Its key product is a verdict-qualified
prediction: the system tells you whether it is confident, whether
the data is genuinely ambiguous, or whether the observation lies
outside the apparatus's coherent coverage.  The `sem_cython12`
library provides the high-performance numerical layer beneath this
reasoning, exposing a small, well-defined Python API that downstream
applications compose into domain-specific pipelines.