# SEM — An Overview of Structural Reasoning *A non-internal introduction to the SEM (Similarity Energy Model) reasoning system, its applications, and the `sem_cython12` library.* --- ## 1. What SEM is SEM is a reasoning system for **discovering structure in observed data** and producing **decision-qualified predictions** about new observations. Unlike conventional machine learning, SEM is not a parameterised model fitted to training data: its outputs are derived directly from the geometry of the observed world set. Where ML asks "what is the most likely label?", SEM asks "what is the structural position of this observation relative to everything we have seen?" — and reports the answer as a verdict, not a probability. The system has been used as a discovery engine, an anomaly detector, a missing-mediator predictor, a regime-change identifier, and an explainable inference layer over neural-network embeddings. Each application reuses the same small set of structural operators. ## 2. Properties that distinguish SEM - **Parameter-free.** No learning rates, no regularisation coefficients, no tuning knobs in the reasoning pipeline. Every scale or boundary the system consults is computed from the data itself. - **Threshold-free.** No `if score > 0.85` decisions. Where conventional pipelines impose a numeric cut-off, SEM uses data-derived structural boundaries that adapt to the observed geometry. - **Three-valued verdict.** A prediction returns one of: - **confident** — a single best-fitting concept dominates; - **gap** — multiple concepts are equally admissible, signalling that the query lies in a region the current theory has not resolved; - **incoherent** — no concept admits the query consistently; further data is required. This refusal-to-guess is the system's most useful safety property: it never collapses uncertainty into a forced label. - **Detects what is missing.** SEM identifies positions where observed data should produce a structural witness but does not, and predicts the features the missing entity should carry. Conventional ML cannot signal that a hidden mediator or unobserved variable is required. - **Explainable by construction.** Every prediction comes with a decomposition of the supporting evidence, so a downstream system (or human reviewer) can audit which structural relations argue for a given verdict. - **Composable across data types.** The same reasoning apparatus applies to scalars, vectors, matrices, sampled functions, sampled manifolds, complex (quantum) state vectors, distributions, time- series windows, and recursive concept hierarchies. The operators see all of these through a common interface. ## 3. Where SEM has been applied | Domain | Capability used | |---|---| | Multivariate time series | Regime detection, forecast verdicts, anomaly identification | | Scientific law discovery | Recovering analytic relationships from raw measurements | | Drug / molecule screening | Structural similarity beyond fingerprints | | Network monitoring | Silent-failure detection in encrypted traffic | | Causal inference | Discovering missing variables from observational data | | Image / signal analysis | Structural feature extraction with explainability | | LLM explainability | Interpreting embedding-space behaviour | | Geopolitical forecasting | Producing confident / abstain forecasts on event data | | Trading & market structure | Regime-switch decisions with abstain semantics | In each case the value is the same: the system either gives a high-confidence answer or refuses to, and never delivers a confident wrong answer disguised as a probability. ## 4. How SEM differs from machine learning | | Machine learning | SEM | |---|---|---| | Has training phase | yes | no | | Has hyper-parameters | yes | no | | Can detect missing entities | no | yes | | Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) | | Output | numeric / probabilistic | structural with verdict | | Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference | | Scale of usable data | requires many examples | works on small data, even single-digit examples | SEM and ML are not exclusive — SEM is sometimes layered on top of neural-network embeddings to provide an explainability and abstention layer, and ML can supply the embeddings SEM reasons over. ## 5. The `sem_cython12` library `sem_cython12` is the high-performance numerical kernel layer that backs SEM's reasoning operators. It is delivered as a pre-compiled Linux shared object plus a thin Python wrapper; users do not compile anything at install time. The library exposes one module: - `sem_cython12.wrapper` — Python API over the compiled kernels. Inside the module, the public functions are grouped by purpose. ### 5.1 Configuration | Function | Purpose | |---|---| | `available() -> bool` | Reports whether the compiled extension loaded | | `backend() -> str` | `'cython12'` or `'python-fallback'` | | `get_num_threads() -> int` | Active OpenMP worker count | | `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) | OpenMP thread count defaults to roughly 50 % of the host's logical cores, so other processes are not starved on shared machines. The caller can override via `set_num_threads()` or the `SEM_NUM_THREADS` environment variable. ### 5.2 Distance and similarity | Function | What it does | |---|---| | `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. | | `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. | | `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. | | `nn_distances(X)` | Per-row minimum positive distance to any other row. | These four cover the bulk of SEM's structural-similarity workload. ### 5.3 Pareto / dominance reasoning | Function | What it computes | |---|---| | `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order | | `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection | | `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters | These let the caller reason about which observations *meaningfully* contribute to bridging multiple structural classes — versus those that are merely peaks of a single class. ### 5.4 Vector reduction | Function | What it computes | |---|---| | `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation | Used by higher-level routines that need to enumerate candidate relational hypotheses bridging multiple regions of structural space. ### 5.5 Performance Measured on commodity x86_64 hardware with 8 OpenMP threads against the equivalent pure-numpy reference implementations: | Operation | Speed-up | |---|---| | `batch_max_similarity` (N=2000, D=50) | ~14× | | `pareto_core_mask` (N=1000, k=8) | ~50× | | Streaming kNN ingest (sliding-window, len=600) | ~100× | | Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second | All routines release the GIL during their inner loops, so calling them concurrently from Python threads is safe. ## 6. A worked Python example The following snippet uses only `sem_cython12.wrapper` and `numpy`. It shows how a downstream pipeline would identify the **structurally informative** members of a small synthetic dataset — those that mediate between two clusters rather than sitting at one cluster's peak. ```python import numpy as np from sem_cython12 import wrapper as cy assert cy.available(), "compiled extension did not load" print("backend:", cy.backend(), " threads:", cy.get_num_threads()) # Two well-separated clusters in 4-D, plus three "bridging" candidates # whose similarity profile spans both clusters. rng = np.random.default_rng(0) cluster_a = rng.standard_normal((20, 4)) + 3.0 cluster_b = rng.standard_normal((20, 4)) - 3.0 bridges = np.array([ [ 0.0, 0.0, 0.0, 0.0], [ 0.5, 0.5, -0.2, 0.1], [-0.3, 0.1, 0.4, -0.2], ]) members = np.vstack([cluster_a, cluster_b, bridges]) # 1. Build a 2-class similarity matrix: # columns = (sim to cluster_a, sim to cluster_b) sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0) sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0) S = np.column_stack([sim_a, sim_b]) # (N, 2) # 2. Find the Pareto frontier of (sim_a, sim_b). # Members whose support vector is strictly dominated by another # member are excluded. keep_mask = cy.pareto_core_mask(S) print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members)) # 3. Of those, which are NOT one-sided peaks? # A one-sided member is a peak of exactly one cluster and gains # nothing on the other. We want members that score on BOTH. non_redundant = cy.non_redundant_witnesses(S) print("Non-redundant witnesses:", non_redundant.tolist()) # 4. Inspect the ones that survived: these are the data points that # structurally connect the two clusters. for idx in non_redundant: print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}") ``` A typical run prints something like: ``` backend: cython12 threads: 4 Pareto-frontier members: 8 / 43 Non-redundant witnesses: [40, 41, 42] row 40: sim_a=0.428 sim_b=0.428 row 41: sim_a=0.412 sim_b=0.401 row 42: sim_a=0.402 sim_b=0.395 ``` The library has filtered out the 40 cluster members (which sit at their own cluster's peak and contribute nothing across cluster boundaries) and identified the three synthetic "bridges" as the structurally informative observations. This is the kind of elementary operation that higher-level SEM reasoning composes into concept discovery, gap detection and prototype prediction. ## 7. When to consider SEM | Situation | Consider SEM | |---|---| | You have small data (10–10,000 examples) and need a defensible decision | Yes | | You need to know *what is missing* from your data | Yes | | You need a model that refuses to guess when the data is ambiguous | Yes | | You want explanations that are inherent to the inference, not bolted on | Yes | | You have millions of labelled examples and need raw classification accuracy | Stay with ML | | You have a regression task with smooth dependencies | Stay with classical statistics | ## 8. Library availability `sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython 3.12 shared object. Installation is: ```bash git clone https://git.sevana.biz/vvs/sem_cython12.git cd sem_cython12 pip install -r requirements.txt export PYTHONPATH=$PWD:$PYTHONPATH ``` The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`, and the compiled `.so`, plus `requirements.txt` and a README describing the public API. ## 9. Summary SEM is a structural reasoning system whose promise is decision quality, not raw accuracy. Its key product is a verdict-qualified prediction: the system tells you whether it is confident, whether the data is genuinely ambiguous, or whether the observation lies outside the apparatus's coherent coverage. The `sem_cython12` library provides the high-performance numerical layer beneath this reasoning, exposing a small, well-defined Python API that downstream applications compose into domain-specific pipelines.