Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README
This commit is contained in:
@@ -0,0 +1,271 @@
|
||||
# SEM — An Overview of Structural Reasoning
|
||||
|
||||
*A non-internal introduction to the SEM (Similarity Energy Model)
|
||||
reasoning system, its applications, and the `sem_cython12` library.*
|
||||
|
||||
---
|
||||
|
||||
## 1. What SEM is
|
||||
|
||||
SEM is a reasoning system for **discovering structure in observed
|
||||
data** and producing **decision-qualified predictions** about new
|
||||
observations. Unlike conventional machine learning, SEM is not a
|
||||
parameterised model fitted to training data: its outputs are derived
|
||||
directly from the geometry of the observed world set. Where ML asks
|
||||
"what is the most likely label?", SEM asks "what is the structural
|
||||
position of this observation relative to everything we have seen?"
|
||||
— and reports the answer as a verdict, not a probability.
|
||||
|
||||
The system has been used as a discovery engine, an anomaly detector,
|
||||
a missing-mediator predictor, a regime-change identifier, and an
|
||||
explainable inference layer over neural-network embeddings. Each
|
||||
application reuses the same small set of structural operators.
|
||||
|
||||
## 2. Properties that distinguish SEM
|
||||
|
||||
- **Parameter-free.** No learning rates, no regularisation
|
||||
coefficients, no tuning knobs in the reasoning pipeline. Every
|
||||
scale or boundary the system consults is computed from the data
|
||||
itself.
|
||||
- **Threshold-free.** No `if score > 0.85` decisions. Where
|
||||
conventional pipelines impose a numeric cut-off, SEM uses
|
||||
data-derived structural boundaries that adapt to the observed
|
||||
geometry.
|
||||
- **Three-valued verdict.** A prediction returns one of:
|
||||
- **confident** — a single best-fitting concept dominates;
|
||||
- **gap** — multiple concepts are equally admissible, signalling
|
||||
that the query lies in a region the current theory has not
|
||||
resolved;
|
||||
- **incoherent** — no concept admits the query consistently;
|
||||
further data is required.
|
||||
This refusal-to-guess is the system's most useful safety property:
|
||||
it never collapses uncertainty into a forced label.
|
||||
- **Detects what is missing.** SEM identifies positions where
|
||||
observed data should produce a structural witness but does not, and
|
||||
predicts the features the missing entity should carry. Conventional
|
||||
ML cannot signal that a hidden mediator or unobserved variable is
|
||||
required.
|
||||
- **Explainable by construction.** Every prediction comes with a
|
||||
decomposition of the supporting evidence, so a downstream system
|
||||
(or human reviewer) can audit which structural relations argue for
|
||||
a given verdict.
|
||||
- **Composable across data types.** The same reasoning apparatus
|
||||
applies to scalars, vectors, matrices, sampled functions, sampled
|
||||
manifolds, complex (quantum) state vectors, distributions, time-
|
||||
series windows, and recursive concept hierarchies. The operators
|
||||
see all of these through a common interface.
|
||||
|
||||
## 3. Where SEM has been applied
|
||||
|
||||
| Domain | Capability used |
|
||||
|---|---|
|
||||
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
|
||||
| Scientific law discovery | Recovering analytic relationships from raw measurements |
|
||||
| Drug / molecule screening | Structural similarity beyond fingerprints |
|
||||
| Network monitoring | Silent-failure detection in encrypted traffic |
|
||||
| Causal inference | Discovering missing variables from observational data |
|
||||
| Image / signal analysis | Structural feature extraction with explainability |
|
||||
| LLM explainability | Interpreting embedding-space behaviour |
|
||||
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
|
||||
| Trading & market structure | Regime-switch decisions with abstain semantics |
|
||||
|
||||
In each case the value is the same: the system either gives a
|
||||
high-confidence answer or refuses to, and never delivers a confident
|
||||
wrong answer disguised as a probability.
|
||||
|
||||
## 4. How SEM differs from machine learning
|
||||
|
||||
| | Machine learning | SEM |
|
||||
|---|---|---|
|
||||
| Has training phase | yes | no |
|
||||
| Has hyper-parameters | yes | no |
|
||||
| Can detect missing entities | no | yes |
|
||||
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
|
||||
| Output | numeric / probabilistic | structural with verdict |
|
||||
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
|
||||
| Scale of usable data | requires many examples | works on small data, even single-digit examples |
|
||||
|
||||
SEM and ML are not exclusive — SEM is sometimes layered on top of
|
||||
neural-network embeddings to provide an explainability and abstention
|
||||
layer, and ML can supply the embeddings SEM reasons over.
|
||||
|
||||
## 5. The `sem_cython12` library
|
||||
|
||||
`sem_cython12` is the high-performance numerical kernel layer that
|
||||
backs SEM's reasoning operators. It is delivered as a pre-compiled
|
||||
Linux shared object plus a thin Python wrapper; users do not compile
|
||||
anything at install time.
|
||||
|
||||
The library exposes one module:
|
||||
|
||||
- `sem_cython12.wrapper` — Python API over the compiled kernels.
|
||||
|
||||
Inside the module, the public functions are grouped by purpose.
|
||||
|
||||
### 5.1 Configuration
|
||||
|
||||
| Function | Purpose |
|
||||
|---|---|
|
||||
| `available() -> bool` | Reports whether the compiled extension loaded |
|
||||
| `backend() -> str` | `'cython12'` or `'python-fallback'` |
|
||||
| `get_num_threads() -> int` | Active OpenMP worker count |
|
||||
| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |
|
||||
|
||||
OpenMP thread count defaults to roughly 50 % of the host's logical
|
||||
cores, so other processes are not starved on shared machines. The
|
||||
caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
|
||||
environment variable.
|
||||
|
||||
### 5.2 Distance and similarity
|
||||
|
||||
| Function | What it does |
|
||||
|---|---|
|
||||
| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
|
||||
| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
|
||||
| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
|
||||
| `nn_distances(X)` | Per-row minimum positive distance to any other row. |
|
||||
|
||||
These four cover the bulk of SEM's structural-similarity workload.
|
||||
|
||||
### 5.3 Pareto / dominance reasoning
|
||||
|
||||
| Function | What it computes |
|
||||
|---|---|
|
||||
| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
|
||||
| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
|
||||
| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |
|
||||
|
||||
These let the caller reason about which observations *meaningfully*
|
||||
contribute to bridging multiple structural classes — versus those that
|
||||
are merely peaks of a single class.
|
||||
|
||||
### 5.4 Vector reduction
|
||||
|
||||
| Function | What it computes |
|
||||
|---|---|
|
||||
| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |
|
||||
|
||||
Used by higher-level routines that need to enumerate candidate
|
||||
relational hypotheses bridging multiple regions of structural space.
|
||||
|
||||
### 5.5 Performance
|
||||
|
||||
Measured on commodity x86_64 hardware with 8 OpenMP threads against
|
||||
the equivalent pure-numpy reference implementations:
|
||||
|
||||
| Operation | Speed-up |
|
||||
|---|---|
|
||||
| `batch_max_similarity` (N=2000, D=50) | ~14× |
|
||||
| `pareto_core_mask` (N=1000, k=8) | ~50× |
|
||||
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
|
||||
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
|
||||
|
||||
All routines release the GIL during their inner loops, so calling
|
||||
them concurrently from Python threads is safe.
|
||||
|
||||
## 6. A worked Python example
|
||||
|
||||
The following snippet uses only `sem_cython12.wrapper` and `numpy`.
|
||||
It shows how a downstream pipeline would identify the **structurally
|
||||
informative** members of a small synthetic dataset — those that
|
||||
mediate between two clusters rather than sitting at one cluster's
|
||||
peak.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sem_cython12 import wrapper as cy
|
||||
|
||||
assert cy.available(), "compiled extension did not load"
|
||||
print("backend:", cy.backend(), " threads:", cy.get_num_threads())
|
||||
|
||||
# Two well-separated clusters in 4-D, plus three "bridging" candidates
|
||||
# whose similarity profile spans both clusters.
|
||||
rng = np.random.default_rng(0)
|
||||
cluster_a = rng.standard_normal((20, 4)) + 3.0
|
||||
cluster_b = rng.standard_normal((20, 4)) - 3.0
|
||||
bridges = np.array([
|
||||
[ 0.0, 0.0, 0.0, 0.0],
|
||||
[ 0.5, 0.5, -0.2, 0.1],
|
||||
[-0.3, 0.1, 0.4, -0.2],
|
||||
])
|
||||
members = np.vstack([cluster_a, cluster_b, bridges])
|
||||
|
||||
# 1. Build a 2-class similarity matrix:
|
||||
# columns = (sim to cluster_a, sim to cluster_b)
|
||||
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
|
||||
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
|
||||
S = np.column_stack([sim_a, sim_b]) # (N, 2)
|
||||
|
||||
# 2. Find the Pareto frontier of (sim_a, sim_b).
|
||||
# Members whose support vector is strictly dominated by another
|
||||
# member are excluded.
|
||||
keep_mask = cy.pareto_core_mask(S)
|
||||
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
|
||||
|
||||
# 3. Of those, which are NOT one-sided peaks?
|
||||
# A one-sided member is a peak of exactly one cluster and gains
|
||||
# nothing on the other. We want members that score on BOTH.
|
||||
non_redundant = cy.non_redundant_witnesses(S)
|
||||
print("Non-redundant witnesses:", non_redundant.tolist())
|
||||
|
||||
# 4. Inspect the ones that survived: these are the data points that
|
||||
# structurally connect the two clusters.
|
||||
for idx in non_redundant:
|
||||
print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}")
|
||||
```
|
||||
|
||||
A typical run prints something like:
|
||||
|
||||
```
|
||||
backend: cython12 threads: 4
|
||||
Pareto-frontier members: 8 / 43
|
||||
Non-redundant witnesses: [40, 41, 42]
|
||||
row 40: sim_a=0.428 sim_b=0.428
|
||||
row 41: sim_a=0.412 sim_b=0.401
|
||||
row 42: sim_a=0.402 sim_b=0.395
|
||||
```
|
||||
|
||||
The library has filtered out the 40 cluster members (which sit at
|
||||
their own cluster's peak and contribute nothing across cluster
|
||||
boundaries) and identified the three synthetic "bridges" as the
|
||||
structurally informative observations. This is the kind of
|
||||
elementary operation that higher-level SEM reasoning composes into
|
||||
concept discovery, gap detection and prototype prediction.
|
||||
|
||||
## 7. When to consider SEM
|
||||
|
||||
| Situation | Consider SEM |
|
||||
|---|---|
|
||||
| You have small data (10–10,000 examples) and need a defensible decision | Yes |
|
||||
| You need to know *what is missing* from your data | Yes |
|
||||
| You need a model that refuses to guess when the data is ambiguous | Yes |
|
||||
| You want explanations that are inherent to the inference, not bolted on | Yes |
|
||||
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
|
||||
| You have a regression task with smooth dependencies | Stay with classical statistics |
|
||||
|
||||
## 8. Library availability
|
||||
|
||||
`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
|
||||
3.12 shared object. Installation is:
|
||||
|
||||
```bash
|
||||
git clone https://git.sevana.biz/vvs/sem_cython12.git
|
||||
cd sem_cython12
|
||||
pip install -r requirements.txt
|
||||
export PYTHONPATH=$PWD:$PYTHONPATH
|
||||
```
|
||||
|
||||
The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
|
||||
and the compiled `.so`, plus `requirements.txt` and a README describing
|
||||
the public API.
|
||||
|
||||
## 9. Summary
|
||||
|
||||
SEM is a structural reasoning system whose promise is decision
|
||||
quality, not raw accuracy. Its key product is a verdict-qualified
|
||||
prediction: the system tells you whether it is confident, whether
|
||||
the data is genuinely ambiguous, or whether the observation lies
|
||||
outside the apparatus's coherent coverage. The `sem_cython12`
|
||||
library provides the high-performance numerical layer beneath this
|
||||
reasoning, exposing a small, well-defined Python API that downstream
|
||||
applications compose into domain-specific pipelines.
|
||||
Reference in New Issue
Block a user