11 KiB
SEM — An Overview of Structural Reasoning
A non-internal introduction to the SEM (Similarity Energy Model)
reasoning system, its applications, and the sem_cython12 library.
1. What SEM is
SEM is a reasoning system for discovering structure in observed data and producing decision-qualified predictions about new observations. Unlike conventional machine learning, SEM is not a parameterised model fitted to training data: its outputs are derived directly from the geometry of the observed world set. Where ML asks "what is the most likely label?", SEM asks "what is the structural position of this observation relative to everything we have seen?" — and reports the answer as a verdict, not a probability.
The system has been used as a discovery engine, an anomaly detector, a missing-mediator predictor, a regime-change identifier, and an explainable inference layer over neural-network embeddings. Each application reuses the same small set of structural operators.
2. Properties that distinguish SEM
- Parameter-free. No learning rates, no regularisation coefficients, no tuning knobs in the reasoning pipeline. Every scale or boundary the system consults is computed from the data itself.
- Threshold-free. No
if score > 0.85decisions. Where conventional pipelines impose a numeric cut-off, SEM uses data-derived structural boundaries that adapt to the observed geometry. - Three-valued verdict. A prediction returns one of:
- confident — a single best-fitting concept dominates;
- gap — multiple concepts are equally admissible, signalling that the query lies in a region the current theory has not resolved;
- incoherent — no concept admits the query consistently; further data is required. This refusal-to-guess is the system's most useful safety property: it never collapses uncertainty into a forced label.
- Detects what is missing. SEM identifies positions where observed data should produce a structural witness but does not, and predicts the features the missing entity should carry. Conventional ML cannot signal that a hidden mediator or unobserved variable is required.
- Explainable by construction. Every prediction comes with a decomposition of the supporting evidence, so a downstream system (or human reviewer) can audit which structural relations argue for a given verdict.
- Composable across data types. The same reasoning apparatus applies to scalars, vectors, matrices, sampled functions, sampled manifolds, complex (quantum) state vectors, distributions, time- series windows, and recursive concept hierarchies. The operators see all of these through a common interface.
3. Where SEM has been applied
| Domain | Capability used |
|---|---|
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
| Scientific law discovery | Recovering analytic relationships from raw measurements |
| Drug / molecule screening | Structural similarity beyond fingerprints |
| Network monitoring | Silent-failure detection in encrypted traffic |
| Causal inference | Discovering missing variables from observational data |
| Image / signal analysis | Structural feature extraction with explainability |
| LLM explainability | Interpreting embedding-space behaviour |
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
| Trading & market structure | Regime-switch decisions with abstain semantics |
In each case the value is the same: the system either gives a high-confidence answer or refuses to, and never delivers a confident wrong answer disguised as a probability.
4. How SEM differs from machine learning
| Machine learning | SEM | |
|---|---|---|
| Has training phase | yes | no |
| Has hyper-parameters | yes | no |
| Can detect missing entities | no | yes |
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
| Output | numeric / probabilistic | structural with verdict |
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
| Scale of usable data | requires many examples | works on small data, even single-digit examples |
SEM and ML are not exclusive — SEM is sometimes layered on top of neural-network embeddings to provide an explainability and abstention layer, and ML can supply the embeddings SEM reasons over.
5. The sem_cython12 library
sem_cython12 is the high-performance numerical kernel layer that
backs SEM's reasoning operators. It is delivered as a pre-compiled
Linux shared object plus a thin Python wrapper; users do not compile
anything at install time.
The library exposes one module:
sem_cython12.wrapper— Python API over the compiled kernels.
Inside the module, the public functions are grouped by purpose.
5.1 Configuration
| Function | Purpose |
|---|---|
available() -> bool |
Reports whether the compiled extension loaded |
backend() -> str |
'cython12' or 'python-fallback' |
get_num_threads() -> int |
Active OpenMP worker count |
set_num_threads(n: int) |
Set OpenMP worker count (≥ 1) |
OpenMP thread count defaults to roughly 50 % of the host's logical
cores, so other processes are not starved on shared machines. The
caller can override via set_num_threads() or the SEM_NUM_THREADS
environment variable.
5.2 Distance and similarity
| Function | What it does |
|---|---|
batch_max_similarity(X_query, X_members, lam) |
For each row of X_query, returns a similarity score in [0, 1] summarising its closeness to the most similar row of X_members. lam (> 0) is the scale that determines how quickly similarity decays with separation. |
concept_support_matrix(X_query, member_mats, lam) |
The same operation applied across K independent reference sets, returning a (Q, K) score matrix. |
pairwise_distances(X) |
Symmetric (N, N) distance matrix between rows of X. |
nn_distances(X) |
Per-row minimum positive distance to any other row. |
These four cover the bulk of SEM's structural-similarity workload.
5.3 Pareto / dominance reasoning
| Function | What it computes |
|---|---|
pareto_core_mask(S) |
Boolean mask of rows not strictly dominated in the maximisation order |
one_sided_mask(S) |
Per-row, per-column mask used for non-redundant-witness selection |
non_redundant_witnesses(S) |
Indices of rows that survive both the Pareto and one-sided filters |
These let the caller reason about which observations meaningfully contribute to bridging multiple structural classes — versus those that are merely peaks of a single class.
5.4 Vector reduction
| Function | What it computes |
|---|---|
extend_frontier_kernel(...) |
Fused centroid + radius computation for incremental hypothesis generation |
Used by higher-level routines that need to enumerate candidate relational hypotheses bridging multiple regions of structural space.
5.5 Performance
Measured on commodity x86_64 hardware with 8 OpenMP threads against the equivalent pure-numpy reference implementations:
| Operation | Speed-up |
|---|---|
batch_max_similarity (N=2000, D=50) |
~14× |
pareto_core_mask (N=1000, k=8) |
~50× |
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
All routines release the GIL during their inner loops, so calling them concurrently from Python threads is safe.
6. A worked Python example
The following snippet uses only sem_cython12.wrapper and numpy.
It shows how a downstream pipeline would identify the structurally
informative members of a small synthetic dataset — those that
mediate between two clusters rather than sitting at one cluster's
peak.
import numpy as np
from sem_cython12 import wrapper as cy
assert cy.available(), "compiled extension did not load"
print("backend:", cy.backend(), " threads:", cy.get_num_threads())
# Two well-separated clusters in 4-D, plus three "bridging" candidates
# whose similarity profile spans both clusters.
rng = np.random.default_rng(0)
cluster_a = rng.standard_normal((20, 4)) + 3.0
cluster_b = rng.standard_normal((20, 4)) - 3.0
bridges = np.array([
[ 0.0, 0.0, 0.0, 0.0],
[ 0.5, 0.5, -0.2, 0.1],
[-0.3, 0.1, 0.4, -0.2],
])
members = np.vstack([cluster_a, cluster_b, bridges])
# 1. Build a 2-class similarity matrix:
# columns = (sim to cluster_a, sim to cluster_b)
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
S = np.column_stack([sim_a, sim_b]) # (N, 2)
# 2. Find the Pareto frontier of (sim_a, sim_b).
# Members whose support vector is strictly dominated by another
# member are excluded.
keep_mask = cy.pareto_core_mask(S)
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
# 3. Of those, which are NOT one-sided peaks?
# A one-sided member is a peak of exactly one cluster and gains
# nothing on the other. We want members that score on BOTH.
non_redundant = cy.non_redundant_witnesses(S)
print("Non-redundant witnesses:", non_redundant.tolist())
# 4. Inspect the ones that survived: these are the data points that
# structurally connect the two clusters.
for idx in non_redundant:
print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}")
A typical run prints something like:
backend: cython12 threads: 4
Pareto-frontier members: 8 / 43
Non-redundant witnesses: [40, 41, 42]
row 40: sim_a=0.428 sim_b=0.428
row 41: sim_a=0.412 sim_b=0.401
row 42: sim_a=0.402 sim_b=0.395
The library has filtered out the 40 cluster members (which sit at their own cluster's peak and contribute nothing across cluster boundaries) and identified the three synthetic "bridges" as the structurally informative observations. This is the kind of elementary operation that higher-level SEM reasoning composes into concept discovery, gap detection and prototype prediction.
7. When to consider SEM
| Situation | Consider SEM |
|---|---|
| You have small data (10–10,000 examples) and need a defensible decision | Yes |
| You need to know what is missing from your data | Yes |
| You need a model that refuses to guess when the data is ambiguous | Yes |
| You want explanations that are inherent to the inference, not bolted on | Yes |
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
| You have a regression task with smooth dependencies | Stay with classical statistics |
8. Library availability
sem_cython12 is distributed as a pre-compiled Linux x86_64 / CPython
3.12 shared object. Installation is:
git clone https://git.sevana.biz/vvs/sem_cython12.git
cd sem_cython12
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
The package contains sem_cython12/__init__.py, sem_cython12/wrapper.py,
and the compiled .so, plus requirements.txt and a README describing
the public API.
9. Summary
SEM is a structural reasoning system whose promise is decision
quality, not raw accuracy. Its key product is a verdict-qualified
prediction: the system tells you whether it is confident, whether
the data is genuinely ambiguous, or whether the observation lies
outside the apparatus's coherent coverage. The sem_cython12
library provides the high-performance numerical layer beneath this
reasoning, exposing a small, well-defined Python API that downstream
applications compose into domain-specific pipelines.