Files

T

vvs fa87dbb473 Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README

2026-05-09 19:24:57 +01:00

11 KiB

Raw Permalink Blame History

SEM — An Overview of Structural Reasoning

A non-internal introduction to the SEM (Similarity Energy Model) reasoning system, its applications, and the sem_cython12 library.

1. What SEM is

SEM is a reasoning system for discovering structure in observed data and producing decision-qualified predictions about new observations. Unlike conventional machine learning, SEM is not a parameterised model fitted to training data: its outputs are derived directly from the geometry of the observed world set. Where ML asks "what is the most likely label?", SEM asks "what is the structural position of this observation relative to everything we have seen?" — and reports the answer as a verdict, not a probability.

The system has been used as a discovery engine, an anomaly detector, a missing-mediator predictor, a regime-change identifier, and an explainable inference layer over neural-network embeddings. Each application reuses the same small set of structural operators.

2. Properties that distinguish SEM

Parameter-free. No learning rates, no regularisation coefficients, no tuning knobs in the reasoning pipeline. Every scale or boundary the system consults is computed from the data itself.
Threshold-free. No if score > 0.85 decisions. Where conventional pipelines impose a numeric cut-off, SEM uses data-derived structural boundaries that adapt to the observed geometry.
Three-valued verdict. A prediction returns one of:
- confident — a single best-fitting concept dominates;
- gap — multiple concepts are equally admissible, signalling that the query lies in a region the current theory has not resolved;
- incoherent — no concept admits the query consistently; further data is required. This refusal-to-guess is the system's most useful safety property: it never collapses uncertainty into a forced label.
Detects what is missing. SEM identifies positions where observed data should produce a structural witness but does not, and predicts the features the missing entity should carry. Conventional ML cannot signal that a hidden mediator or unobserved variable is required.
Explainable by construction. Every prediction comes with a decomposition of the supporting evidence, so a downstream system (or human reviewer) can audit which structural relations argue for a given verdict.
Composable across data types. The same reasoning apparatus applies to scalars, vectors, matrices, sampled functions, sampled manifolds, complex (quantum) state vectors, distributions, time- series windows, and recursive concept hierarchies. The operators see all of these through a common interface.

3. Where SEM has been applied

Domain	Capability used
Multivariate time series	Regime detection, forecast verdicts, anomaly identification
Scientific law discovery	Recovering analytic relationships from raw measurements
Drug / molecule screening	Structural similarity beyond fingerprints
Network monitoring	Silent-failure detection in encrypted traffic
Causal inference	Discovering missing variables from observational data
Image / signal analysis	Structural feature extraction with explainability
LLM explainability	Interpreting embedding-space behaviour
Geopolitical forecasting	Producing confident / abstain forecasts on event data
Trading & market structure	Regime-switch decisions with abstain semantics

In each case the value is the same: the system either gives a high-confidence answer or refuses to, and never delivers a confident wrong answer disguised as a probability.

4. How SEM differs from machine learning

	Machine learning	SEM
Has training phase	yes	no
Has hyper-parameters	yes	no
Can detect missing entities	no	yes
Refuses to predict	no (returns argmax)	yes (gap / incoherent verdict)
Output	numeric / probabilistic	structural with verdict
Explanation	post-hoc (SHAP, LIME, attention)	inherent in the inference
Scale of usable data	requires many examples	works on small data, even single-digit examples

SEM and ML are not exclusive — SEM is sometimes layered on top of neural-network embeddings to provide an explainability and abstention layer, and ML can supply the embeddings SEM reasons over.

5. The `sem_cython12` library

sem_cython12 is the high-performance numerical kernel layer that backs SEM's reasoning operators. It is delivered as a pre-compiled Linux shared object plus a thin Python wrapper; users do not compile anything at install time.

The library exposes one module:

sem_cython12.wrapper — Python API over the compiled kernels.

Inside the module, the public functions are grouped by purpose.

5.1 Configuration

Function	Purpose
`available() -> bool`	Reports whether the compiled extension loaded
`backend() -> str`	`'cython12'` or `'python-fallback'`
`get_num_threads() -> int`	Active OpenMP worker count
`set_num_threads(n: int)`	Set OpenMP worker count (≥ 1)

OpenMP thread count defaults to roughly 50 % of the host's logical cores, so other processes are not starved on shared machines. The caller can override via set_num_threads() or the SEM_NUM_THREADS environment variable.

5.2 Distance and similarity

Function	What it does
`batch_max_similarity(X_query, X_members, lam)`	For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation.
`concept_support_matrix(X_query, member_mats, lam)`	The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix.
`pairwise_distances(X)`	Symmetric `(N, N)` distance matrix between rows of `X`.
`nn_distances(X)`	Per-row minimum positive distance to any other row.

These four cover the bulk of SEM's structural-similarity workload.

5.3 Pareto / dominance reasoning

Function	What it computes
`pareto_core_mask(S)`	Boolean mask of rows not strictly dominated in the maximisation order
`one_sided_mask(S)`	Per-row, per-column mask used for non-redundant-witness selection
`non_redundant_witnesses(S)`	Indices of rows that survive both the Pareto and one-sided filters

These let the caller reason about which observations meaningfully contribute to bridging multiple structural classes — versus those that are merely peaks of a single class.

5.4 Vector reduction

Function	What it computes
`extend_frontier_kernel(...)`	Fused centroid + radius computation for incremental hypothesis generation

Used by higher-level routines that need to enumerate candidate relational hypotheses bridging multiple regions of structural space.

5.5 Performance

Measured on commodity x86_64 hardware with 8 OpenMP threads against the equivalent pure-numpy reference implementations:

Operation	Speed-up
`batch_max_similarity` (N=2000, D=50)	~14×
`pareto_core_mask` (N=1000, k=8)	~50×
Streaming kNN ingest (sliding-window, len=600)	~100×
Higher-arity hypothesis frontier (k=4, m=20)	brute force is intractable; pruned form runs sub-second

All routines release the GIL during their inner loops, so calling them concurrently from Python threads is safe.

6. A worked Python example

The following snippet uses only sem_cython12.wrapper and numpy. It shows how a downstream pipeline would identify the structurally informative members of a small synthetic dataset — those that mediate between two clusters rather than sitting at one cluster's peak.

import numpy as np
from sem_cython12 import wrapper as cy

assert cy.available(), "compiled extension did not load"
print("backend:", cy.backend(), "  threads:", cy.get_num_threads())

# Two well-separated clusters in 4-D, plus three "bridging" candidates
# whose similarity profile spans both clusters.
rng = np.random.default_rng(0)
cluster_a = rng.standard_normal((20, 4)) +  3.0
cluster_b = rng.standard_normal((20, 4)) -  3.0
bridges   = np.array([
    [ 0.0, 0.0,  0.0, 0.0],
    [ 0.5, 0.5, -0.2, 0.1],
    [-0.3, 0.1,  0.4, -0.2],
])
members = np.vstack([cluster_a, cluster_b, bridges])

# 1. Build a 2-class similarity matrix:
#    columns = (sim to cluster_a, sim to cluster_b)
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
S = np.column_stack([sim_a, sim_b])               # (N, 2)

# 2. Find the Pareto frontier of (sim_a, sim_b).
#    Members whose support vector is strictly dominated by another
#    member are excluded.
keep_mask = cy.pareto_core_mask(S)
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))

# 3. Of those, which are NOT one-sided peaks?
#    A one-sided member is a peak of exactly one cluster and gains
#    nothing on the other.  We want members that score on BOTH.
non_redundant = cy.non_redundant_witnesses(S)
print("Non-redundant witnesses:", non_redundant.tolist())

# 4. Inspect the ones that survived: these are the data points that
#    structurally connect the two clusters.
for idx in non_redundant:
    print(f"  row {idx}:  sim_a={S[idx, 0]:.3f}  sim_b={S[idx, 1]:.3f}")

A typical run prints something like:

backend: cython12   threads: 4
Pareto-frontier members: 8 / 43
Non-redundant witnesses: [40, 41, 42]
  row 40:  sim_a=0.428  sim_b=0.428
  row 41:  sim_a=0.412  sim_b=0.401
  row 42:  sim_a=0.402  sim_b=0.395

The library has filtered out the 40 cluster members (which sit at their own cluster's peak and contribute nothing across cluster boundaries) and identified the three synthetic "bridges" as the structurally informative observations. This is the kind of elementary operation that higher-level SEM reasoning composes into concept discovery, gap detection and prototype prediction.

7. When to consider SEM

Situation	Consider SEM
You have small data (10–10,000 examples) and need a defensible decision	Yes
You need to know what is missing from your data	Yes
You need a model that refuses to guess when the data is ambiguous	Yes
You want explanations that are inherent to the inference, not bolted on	Yes
You have millions of labelled examples and need raw classification accuracy	Stay with ML
You have a regression task with smooth dependencies	Stay with classical statistics

8. Library availability

sem_cython12 is distributed as a pre-compiled Linux x86_64 / CPython 3.12 shared object. Installation is:

git clone https://git.sevana.biz/vvs/sem_cython12.git
cd sem_cython12
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH

The package contains sem_cython12/__init__.py, sem_cython12/wrapper.py, and the compiled .so, plus requirements.txt and a README describing the public API.

9. Summary

SEM is a structural reasoning system whose promise is decision quality, not raw accuracy. Its key product is a verdict-qualified prediction: the system tells you whether it is confident, whether the data is genuinely ambiguous, or whether the observation lies outside the apparatus's coherent coverage. The sem_cython12 library provides the high-performance numerical layer beneath this reasoning, exposing a small, well-defined Python API that downstream applications compose into domain-specific pipelines.

11 KiB Raw Permalink Blame History Unescape Escape