4 Commits

Author SHA1 Message Date
vvs ed5ca0cafc v1.1.0: extend binary matrix to CPython 3.10/3.11/3.13 on Linux and Windows
- Linux x86_64: add cp310, cp311, cp313 (.so), built in conda-forge envs.
- Windows AMD64: add cp310, cp311, cp313 (.pyd), built with MSVC v14.50.
- All eight binaries verified to produce identical numerical output.
- README compatibility table + build provenance updated.
- macOS still deferred.
2026-05-10 11:15:11 +01:00
vvs fa87dbb473 Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README 2026-05-09 19:24:57 +01:00
dmytro.bogovych 80f99d1d15 - add 'what is this' section to README.md 2026-05-09 20:46:56 +03:00
vvs c886ded981 Vendor demos under demos/ and link from README for landing-page visibility 2026-05-09 15:25:52 +01:00
14 changed files with 1091 additions and 15 deletions
+26
View File
@@ -9,6 +9,32 @@ release `MAJOR.MINOR.PATCH` increments
- `MINOR` on backwards-compatible feature additions,
- `PATCH` on backwards-compatible bug fixes.
## [1.1.0] - 2026-05-10
Binary matrix expanded to four CPython versions on both supported
platforms.
### Added
- Pre-compiled Linux x86_64 binaries for **CPython 3.10, 3.11, 3.13**
(`sem_core12.cpython-3{10,11,13}-x86_64-linux-gnu.so`). Built in
isolated conda-forge environments with conda-forge gcc, same
OpenMP and optimisation flags as the cp312 binary.
- Pre-compiled Windows AMD64 binaries for **CPython 3.10, 3.11, 3.13**
(`sem_core12.cp3{10,11,13}-win_amd64.pyd`). Built with MSVC v14.50
against the matching CPython installed via `winget`.
### Verified
- All eight binaries (4 Linux + 4 Windows) produce identical numerical
output for the same fixed-seed input on `batch_max_similarity`.
### Compatibility notes
- macOS is still not provided in this release. Contact
`sales@sevana.biz` if you need a macOS build.
- numpy requirement unchanged: `numpy >= 1.23`.
## [1.0.0] - 2026-05-09
First public release.
+87 -13
View File
@@ -4,34 +4,82 @@ OpenMP-parallel numerical kernel library for Python. Pre-built
Linux and Windows binaries included; no compilation required at
install time.
## What is this for?
`sem_cython12` is a small, focused toolbox of fast C-level routines
exposed through a thin numpy wrapper. It is not a general-purpose
numerical library; it accelerates three specific jobs that are
awkward or slow to do in pure numpy once `N` reaches the thousands:
1. **Similarity / distance over batches of vectors.** Full
pairwise distance matrices, nearest-neighbour distances, and
kernel-based `[0, 1]` similarity scores of a query set against
one or many reference sets. Useful for nearest-neighbour
search, kernel-density-style scoring, and "how close is each
query to this concept?" lookups.
2. **Multi-objective ("best-tradeoff") filtering of score matrices.**
Given a matrix of `N` candidates × `k` criteria, select the
rows on the Pareto frontier, isolate rows that only spike on a
single criterion, and recover the rows that contribute
meaningfully across several criteria - candidates a naive
sum-of-scores ranker would miss.
3. **An incremental aggregation primitive** for streaming
clustering / frontier-expansion algorithms: a fused bulk update
that, given `F` running summaries (centre + radius) and `A`
new contributions, produces all `F·A` updated summaries in one
parallel pass.
The kernels release the GIL, scale near-linearly to ~8 OpenMP
threads on commodity x86, and operate on shared-memory numpy
arrays with no inter-process serialisation. The Python wrapper
handles contiguous-float64 casting and degrades loudly (via
`available()` / `backend()` plus `RuntimeError`) when the compiled
extension cannot load on the host - there is no slow pure-Python
fallback path.
The [`demos/`](./demos/) directory contains three runnable
end-to-end examples (Iris boundary discovery, parameter-free
anomaly detection, multi-criteria candidate selection) that
exercise these three jobs against well-known baselines.
## Contents
- `sem_cython12/sem_core12.cpython-312-x86_64-linux-gnu.so` -
compiled extension (Linux, CPython 3.12, x86_64).
- `sem_cython12/sem_core12.cp312-win_amd64.pyd` -
compiled extension (Windows, CPython 3.12, AMD64).
- `sem_cython12/sem_core12.cpython-3{10,11,12,13}-x86_64-linux-gnu.so` -
compiled extensions (Linux, x86_64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
- `sem_cython12/sem_core12.cp3{10,11,12,13}-win_amd64.pyd` -
compiled extensions (Windows, AMD64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
- `sem_cython12/wrapper.py` - Python API.
- `sem_cython12/__init__.py` - package entry.
Python's import system selects the correct binary for the running
interpreter automatically — install the whole package and the right
`.so` / `.pyd` is picked up by ABI tag.
## Compatibility
| Platform | Architecture | Python | Runtime requirements |
|-----------------|--------------|-----------|-----------------------------|
| Linux | x86_64 | CPython 3.12 | glibc >= 2.31, libgomp |
| Windows 10/11 | AMD64 | CPython 3.12 | vcomp (ships with Windows) |
|-----------------|--------------|------------------------|-----------------------------|
| Linux | x86_64 | CPython 3.10/3.11/3.12/3.13 | glibc >= 2.31, libgomp |
| Windows 10/11 | AMD64 | CPython 3.10/3.11/3.12/3.13 | vcomp (ships with Windows) |
| macOS | - | - | not provided (contact sales@sevana.biz) |
Single Python dependency: `numpy >= 1.23` (see `requirements.txt`).
## How the binaries were built
- **Linux (`*.so`)**: gcc 13.3, OpenMP via `libgomp`, flags
`-O3 -ffast-math -march=native -fopenmp`.
- **Windows (`*.pyd`)**: MSVC v14.50 (Visual Studio Build Tools 2026),
OpenMP via `vcomp`, flags `/O2 /openmp`.
- **Linux (`*.so`), cp312**: system gcc 13.3 on Ubuntu, OpenMP via
`libgomp`, flags `-O3 -ffast-math -march=native -fopenmp`.
- **Linux (`*.so`), cp310 / cp311 / cp313**: conda-forge gcc inside
isolated `python=3.10/3.11/3.13` envs (clean, system-Python-free
build), same OpenMP and optimisation flags.
- **Windows (`*.pyd`), all four versions**: MSVC v14.50 (Visual Studio
Build Tools 2026), OpenMP via `vcomp`, flags `/O2 /openmp`. Each
built against the matching CPython interpreter installed via
`winget`.
Both binaries target CPython 3.12 (cp312) ABI. No other Python
version is supported in this release.
All eight binaries pass the same numerical smoke test
(`batch_max_similarity` over fixed-seed data) and produce identical
output to within float64 round-off.
## Install
@@ -109,6 +157,32 @@ internally cast to contiguous `float64`. Outputs are numpy arrays.
See the wrapper docstrings for exact semantics of each function.
## Documentation
- [`docs/SEM_Overview.md`](./docs/SEM_Overview.md) — non-internal
introduction to SEM (Similarity Energy Model), what it does, and
how the `sem_cython12` library fits in.
- [`docs/SEM_Mathematical_Apparatus.md`](./docs/SEM_Mathematical_Apparatus.md)
— capabilities-level description of the operators and engines
exposed by the library.
## Demos
Three runnable demos live in [`demos/`](./demos/):
1. [`01_iris_boundary.py`](./demos/01_iris_boundary.py) — rediscovers
the famous Iris versicolor/virginica boundary specimens with no
training, using only `concept_support_matrix` and `pairwise_distances`.
2. [`02_anomaly_detection.py`](./demos/02_anomaly_detection.py) —
parameter-free anomaly detection that matches IsolationForest's
AUC=1.0 on a synthetic benchmark, using only `batch_max_similarity`.
3. [`03_multicriteria_selection.py`](./demos/03_multicriteria_selection.py)
— recovers 5/5 hidden balanced candidates that naive sum-of-scores
ranking misses, using `pareto_core_mask` and `non_redundant_witnesses`.
A standalone copy of the demos repository is also published at
https://git.sevana.biz/vvs/sem_cython12-demos.
## Performance notes
Threads are configured globally per process; calling
+99
View File
@@ -0,0 +1,99 @@
"""Demo 1 - Iris boundary rediscovery (no training).
The Iris dataset (Fisher 1936) contains 50 specimens of three species:
setosa, versicolor, virginica. setosa is fully separable from the
other two; versicolor and virginica overlap on petal geometry. Every
classifier built on Iris since 1936 stumbles on the same handful of
boundary specimens.
We find them WITHOUT training a classifier:
1. Group specimens by species.
2. Auto-derive a kernel scale from the data's own geometry.
3. Compute the (150, 3) similarity matrix.
4. For each specimen, look at how strongly it scores on the
species it is NOT labelled with. Highest cross-species score
ranks the most ambiguous specimens.
Run:
python 01_iris_boundary.py
"""
from __future__ import annotations
import numpy as np
from sklearn.datasets import load_iris
from sem_cython12 import wrapper as cy
def main() -> int:
if not cy.available():
print("ERROR: sem_cython12 compiled extension did not load.")
return 1
iris = load_iris()
X = iris.data # (150, 4)
y = iris.target # (150,)
species_names = iris.target_names
# Auto-derived kernel scale (median pairwise distance over the
# whole dataset; no human picks this number).
pd = cy.pairwise_distances(X)
iu = np.triu_indices(pd.shape[0], k=1)
lam = float(np.median(pd[iu]))
print(f"Auto-derived kernel scale lam = {lam:.4f}\n")
# Per-species reference sets
member_sets = [X[y == k] for k in range(3)]
# (150, 3) similarity matrix
S = cy.concept_support_matrix(X, member_sets, lam=lam)
# For each specimen, compute the highest similarity to a species
# OTHER than its own. A specimen with high cross-species support
# is structurally ambiguous - close to a non-self species.
cross_score = np.empty(150)
for i in range(150):
own = y[i]
cross_score[i] = max(S[i, j] for j in range(3) if j != own)
# Rank specimens by cross-species score. Top entries = the famous
# boundary cases.
order = np.argsort(cross_score)[::-1]
print(f"Top 10 most ambiguous specimens (highest cross-species score):\n")
print(f" {'rank':>4} {'idx':>4} {'species':>11} "
f"{'sim->setosa':>12} {'sim->versic':>12} {'sim->virgin':>12} cross")
for rank, idx in enumerate(order[:10], 1):
sims = S[idx]
own = species_names[y[idx]]
print(f" {rank:>4} {idx:>4} {own:>11} "
f"{sims[0]:>12.4f} {sims[1]:>12.4f} {sims[2]:>12.4f} {cross_score[idx]:.4f}")
# Distribution of those top 10 by species
top10_species = [int(y[i]) for i in order[:10]]
counts = {0: 0, 1: 0, 2: 0}
for s in top10_species:
counts[s] += 1
print()
print("Top 10 distribution by species:")
for k, name in enumerate(species_names):
print(f" {name:12s}: {counts[k]} of 10")
print()
print("Observation:")
print(" setosa is fully separable from the other two (Fisher 1936),")
print(" so we expect zero or near-zero setosa specimens in the top 10.")
print(" versicolor and virginica overlap in petal geometry - that")
print(" overlap is exactly where the boundary specimens live.")
if counts[0] == 0:
print()
print("*** Confirmed: zero setosa specimens; the top-10 boundary cases ***")
print("*** all come from the famous versicolor/virginica overlap zone. ***")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+102
View File
@@ -0,0 +1,102 @@
"""Demo 2 - Parameter-free anomaly detection.
Split a dataset into 'reference' (known-normal) and 'query' (a mix of
normal and anomalous), and score each query by its similarity to the
reference set. No labels touched on the query side, no thresholds
set by hand, no training step.
We compare against sklearn's IsolationForest (with default settings)
on the same data.
Run:
python 02_anomaly_detection.py
"""
from __future__ import annotations
import numpy as np
from sem_cython12 import wrapper as cy
def main() -> int:
if not cy.available():
print("ERROR: sem_cython12 compiled extension did not load.")
return 1
rng = np.random.default_rng(0)
N_NORMAL = 500
N_ANOMALY = 10
D = 5
# Generate data
normal = rng.standard_normal((N_NORMAL, D))
anomalies = rng.standard_normal((N_ANOMALY, D)) + 8.0
# Split: 80% of normals are 'reference' (known good), 20% are
# query. Queries also include all 10 anomalies.
perm = rng.permutation(N_NORMAL)
n_ref = int(0.8 * N_NORMAL)
ref_idx = perm[:n_ref]
query_normal_idx = perm[n_ref:]
reference = normal[ref_idx]
query_normal = normal[query_normal_idx]
queries = np.vstack([query_normal, anomalies])
y_query = np.concatenate([
np.zeros(len(query_normal_idx), dtype=int),
np.ones(N_ANOMALY, dtype=int),
])
# Auto-derive scale from the reference set's geometry
nn = cy.nn_distances(reference)
lam = float(np.median(nn[np.isfinite(nn)]))
# Score each query by similarity to the reference.
# Lower similarity = farther from anything known = anomaly.
sim = cy.batch_max_similarity(queries, reference, lam=lam)
scores_sem = -sim # higher score = more anomalous
top_k_sem = np.argsort(scores_sem)[::-1][:N_ANOMALY]
correct_sem = int(np.sum(y_query[top_k_sem] == 1))
print("=" * 60)
print("SEM (sem_cython12 - one batch_max_similarity call)")
print("=" * 60)
print(f" Top-{N_ANOMALY} retrieved as anomalous: precision = {correct_sem}/{N_ANOMALY}")
try:
from sklearn.metrics import roc_auc_score
auc_sem = roc_auc_score(y_query, scores_sem)
print(f" ROC AUC = {auc_sem:.4f}")
from sklearn.ensemble import IsolationForest
iso = IsolationForest(random_state=0, contamination='auto')
iso.fit(reference)
scores_iso = -iso.score_samples(queries)
top_k_iso = np.argsort(scores_iso)[::-1][:N_ANOMALY]
correct_iso = int(np.sum(y_query[top_k_iso] == 1))
auc_iso = roc_auc_score(y_query, scores_iso)
print()
print("=" * 60)
print("Baseline: sklearn IsolationForest (default settings)")
print("=" * 60)
print(f" Top-{N_ANOMALY} retrieved as anomalous: precision = {correct_iso}/{N_ANOMALY}")
print(f" ROC AUC = {auc_iso:.4f}")
print()
print("=" * 60)
if auc_sem >= auc_iso - 0.01:
margin = auc_sem - auc_iso
sign = "+" if margin >= 0 else ""
print(f"SEM matches IsolationForest within noise"
f" ({sign}{margin:+.4f} AUC),")
print("with one function call and zero tuning.")
else:
print(f"IsolationForest leads by {auc_iso - auc_sem:.4f} AUC; "
f"SEM is competitive without parameters.")
except ImportError:
print("\n(Install scikit-learn to see the IsolationForest comparison.)")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+106
View File
@@ -0,0 +1,106 @@
"""Demo 3 - Multi-criteria candidate selection.
You have 100 candidates evaluated on 4 independent criteria
(quality, cost-efficiency, robustness, compatibility - or whatever
your domain calls them). You want to pick the ones worth a deeper
look.
Naive ranking by total score finds the high-mean candidates - which
are often single-criterion peaks that compensate with weakness on
the rest.
SEM's two-stage filter
1) best-tradeoff filter ('Pareto core')
2) cross-criterion filter ('non-redundant witnesses')
finds the genuine all-rounders: candidates that are not strictly
worse than another on every axis AND that contribute meaningfully on
multiple axes (not just one).
Run:
python 03_multicriteria_selection.py
"""
from __future__ import annotations
import numpy as np
from sem_cython12 import wrapper as cy
def main() -> int:
if not cy.available():
print("ERROR: sem_cython12 compiled extension did not load.")
return 1
rng = np.random.default_rng(7)
N, K = 100, 4
criteria_names = ["Quality", "Cost-efficiency", "Robustness", "Compatibility"]
# Most candidates: noisy uniform draws across the criteria
S = rng.uniform(0.30, 0.95, size=(N, K))
# Inject 5 hidden 'all-rounders' that score moderately well on EVERY
# criterion - none top any single axis, but they're well-balanced.
S[0:5] = rng.uniform(0.65, 0.85, size=(5, K))
# ---- Naive ranking by sum of scores ---------------------------------
naive_order = np.argsort(S.sum(axis=1))[::-1]
naive_top10 = naive_order[:10]
# ---- SEM ranking ----------------------------------------------------
pareto_mask = cy.pareto_core_mask(S)
pareto_idx = np.where(pareto_mask == 1)[0]
nrw = cy.non_redundant_witnesses(S)
# ---- Reporting ------------------------------------------------------
print(f"Candidates : {N}")
print(f"Criteria : {K} ({', '.join(criteria_names)})")
print()
print(f"Best-tradeoff frontier size : {len(pareto_idx)}")
print(f"Cross-criterion winners (NRW) : {len(nrw)}")
print(f"Hidden all-rounders we injected : 5 (indices 0-4)")
print()
overlap_with_hidden = set(nrw.tolist()) & set(range(5))
naive_overlap_with_hidden = set(naive_top10.tolist()) & set(range(5))
print(f"NRW recovered hidden all-rounders : "
f"{len(overlap_with_hidden)}/5 {sorted(overlap_with_hidden)}")
print(f"Naive top-10 found hidden all-rounders: "
f"{len(naive_overlap_with_hidden)}/5 {sorted(naive_overlap_with_hidden)}")
print()
# Profile of NRW candidates
print("Cross-criterion winners (NRW) - score profiles:")
print(f" {'idx':>4} " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
f" {'min':>5} {'mean':>5}")
for i in nrw:
scores = S[i]
print(f" {int(i):>4} " +
" ".join(f"{v:9.3f}" for v in scores) +
f" {scores.min():5.2f} {scores.mean():5.2f}")
print()
print("Naive top-3 (by total score) - score profiles for comparison:")
print(f" {'idx':>4} " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
f" {'min':>5} {'mean':>5}")
for i in naive_top10[:3]:
scores = S[i]
print(f" {int(i):>4} " +
" ".join(f"{v:9.3f}" for v in scores) +
f" {scores.min():5.2f} {scores.mean():5.2f}")
print()
# Wow line - honest comparison
n_nrw_hits = len(overlap_with_hidden)
n_naive_hits = len(naive_overlap_with_hidden)
print(f"*** SEM's NRW filter recovered {n_nrw_hits}/5 hidden all-rounders. ***")
print(f"*** Naive sum-of-scores top-10 found only {n_naive_hits}/5. ***")
if n_nrw_hits > n_naive_hits:
print(f"*** SEM surfaces {n_nrw_hits - n_naive_hits} candidates the naive ranking misses ***")
print(f"*** because they don't peak on any single criterion. ***")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+128
View File
@@ -0,0 +1,128 @@
# sem_cython12 - sample projects
Three short, runnable Python projects that demonstrate the `sem_cython12`
library on small but realistic problems. Each demo is a single file,
self-contained, and produces a clear printable result.
The demos use **only** `sem_cython12.wrapper`, `numpy`, and (for the
Iris and anomaly demos) `scikit-learn`.
## What each demo shows
| File | Domain | "Wow" |
|---|---|---|
| [`01_iris_boundary.py`](./01_iris_boundary.py) | The 1936 Iris dataset | Rediscovers the famous versicolor/virginica boundary specimens **without training a classifier** and without setting any threshold. |
| [`02_anomaly_detection.py`](./02_anomaly_detection.py) | Synthetic 5-D anomalies | Detects 10/10 injected anomalies with **a single function call** and matches/beats sklearn's IsolationForest on ROC AUC. |
| [`03_multicriteria_selection.py`](./03_multicriteria_selection.py) | Multi-criteria candidate ranking | Identifies the **hidden all-rounders** that naive sum-of-scores ranking misses entirely. |
## Install
```bash
# Get the library (private repo)
git clone https://git.sevana.biz/vvs/sem_cython12.git ../sem_cython12
export PYTHONPATH="$(pwd)/../sem_cython12:$PYTHONPATH"
# Demo dependencies
pip install -r requirements.txt
```
The pre-built Linux x86_64 / CPython 3.12 binary ships with the
library; no compilation step is required.
## Run
```bash
python 01_iris_boundary.py
python 02_anomaly_detection.py
python 03_multicriteria_selection.py
```
Each demo finishes in well under a second on a laptop.
## What you'll see
### 01_iris_boundary.py
```
Auto-derived kernel scale lam = 3.4762
Top 10 most ambiguous specimens (highest cross-species score):
rank idx species sim->setosa sim->versic sim->virgin cross
1 138 virginica 0.2330 0.9096 1.0000 0.9096
2 70 versicolor 0.2396 1.0000 0.9096 0.9096
3 127 virginica 0.2222 0.8806 1.0000 0.8806
4 83 versicolor 0.2084 1.0000 0.8689 0.8689
5 133 virginica 0.2062 0.8689 1.0000 0.8689
...
Top 10 distribution by species:
setosa : 0 of 10
versicolor : 3 of 10
virginica : 7 of 10
*** Confirmed: zero setosa specimens; the top-10 boundary cases ***
*** all come from the famous versicolor/virginica overlap zone. ***
```
### 02_anomaly_detection.py
```
SEM (sem_cython12 - one batch_max_similarity call)
Top-10 retrieved as anomalous: precision = 10/10
ROC AUC = 1.0000
Baseline: sklearn IsolationForest (default settings)
Top-10 retrieved as anomalous: precision = 10/10
ROC AUC = 1.0000
SEM matches IsolationForest within noise (+0.0000 AUC),
with one function call and zero tuning.
```
### 03_multicriteria_selection.py
```
Best-tradeoff frontier size : 35
Cross-criterion winners (NRW) : 31
Hidden all-rounders we injected : 5 (indices 0-4)
NRW recovered hidden all-rounders : 5/5 [0, 1, 2, 3, 4]
Naive top-10 found hidden all-rounders: 3/5 [1, 2, 3]
*** SEM's NRW filter recovered 5/5 hidden all-rounders. ***
*** Naive sum-of-scores top-10 found only 3/5. ***
*** SEM surfaces 2 candidates the naive ranking misses ***
*** because they don't peak on any single criterion. ***
```
## What to try next
- Replace the synthetic data in `02_*` with your own observations and
see what gets flagged.
- Replace the synthetic candidate matrix in `03_*` with your
real-world multi-criteria evaluation (job applicants, vendor
proposals, product features, drug screens).
- Extend `01_*` to your own classification problems: any time you
have multiple classes with overlapping members, the NRW operator
surfaces the structurally informative boundary cases.
The library has more capabilities than these three demos exercise.
See the `sem_cython12.wrapper` API for the full operator set
(pairwise distances, multi-class similarity matrix, incremental
aggregation, etc.).
## Licence
The demos and the underlying `sem_cython12` library are licensed
under the terms in the [LICENSE](./LICENSE) file:
- Research and non-commercial use: free under the conditions
stated in the licence.
- Commercial use: requires a separate written commercial licence.
Contact `sales@sevana.biz`.
- The Software is provided strictly "AS IS", without warranty of
any kind.
Please read the LICENSE file in full before using the demos or the
underlying library.
+270
View File
@@ -0,0 +1,270 @@
# SEM — Mathematical Apparatus (Capability Catalog)
*A non-internal catalog of the operators SEM offers, what each is for,
and which entry points of the `sem_cython12` library back them.*
This document describes WHAT the apparatus does and WHERE to use it.
It does not describe HOW any operator works internally — algorithms,
formulas, lemmas and proofs are intentionally not reproduced here.
---
## Conventions
- "Item" / "world" / "observation": one row of input data. Items live
in some payload space (real numbers, vectors, matrices, sampled
functions, sampled manifolds, distributions, complex amplitudes,
time-series windows, recursive concept trees) — the apparatus
treats them uniformly via a small set of structural operators.
- "Concept": a subset of items that share structural meaning. The
apparatus can either be told the concepts (labelled mode) or
discover them from data (unsupervised mode).
- "Witness": an item whose structural position carries information
beyond merely belonging to one concept.
- "Verdict": the system's qualified output for a new observation -
one of `confident`, `gap`, `incoherent` (see §4.6).
All of the apparatus is parameter-free and threshold-free: there are
no fitting parameters, no numeric cut-offs, no fidelity knobs.
---
## 1. Structural similarity primitives
These are the lowest-level building blocks. Each is exposed directly
in `sem_cython12.wrapper`.
### 1.1 Pairwise similarity
| | |
|---|---|
| Purpose | Score how close a query item is to the most similar member of a reference set. |
| Output | A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far). |
| Applications | Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation. |
| Cython entry point | `batch_max_similarity(X_query, X_members, lam)` |
### 1.2 Multi-class similarity matrix
| | |
|---|---|
| Purpose | The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix. |
| Applications | Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters. |
| Cython entry point | `concept_support_matrix(X_query, member_mats, lam)` |
### 1.3 Pairwise distance matrix
| | |
|---|---|
| Purpose | Symmetric `(N, N)` distance matrix between rows of `X`. |
| Applications | Graph construction, clustering, scale estimation, downstream filtering and ranking. |
| Cython entry point | `pairwise_distances(X)` |
### 1.4 Nearest-neighbour distance vector
| | |
|---|---|
| Purpose | For each row, the minimum positive distance to any other row. Rows with no positive-distance neighbour receive `inf`. |
| Applications | Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification. |
| Cython entry point | `nn_distances(X)` |
---
## 2. Multi-criterion filtering primitives
Given a real-valued matrix `S` of shape `(N, k)` (rows are items,
columns are independent criteria — each in maximisation orientation),
these primitives identify structurally informative subsets of rows.
### 2.1 Best-tradeoff filter
| | |
|---|---|
| Purpose | Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion). |
| Applications | Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis. |
| Cython entry point | `pareto_core_mask(S)` |
### 2.2 One-sided peak flagging
| | |
|---|---|
| Purpose | Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone. |
| Applications | Removing items that are only locally informative; finding cross-criterion contributors; bridge identification. |
| Cython entry point | `one_sided_mask(S)` |
### 2.3 Non-redundant witness identification
| | |
|---|---|
| Purpose | The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one. |
| Applications | Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis. |
| Cython entry point | `non_redundant_witnesses(S)` |
---
## 3. Incremental aggregation primitive
### 3.1 Fused centroid + radius update
| | |
|---|---|
| Purpose | One-pass bulk update for an incremental aggregation step. Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item. |
| Applications | Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines. |
| Cython entry point | `extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)` |
---
## 4. Higher-level apparatus
Built on the primitives in §1–§3. These are the operators that
distinguish SEM as a reasoning system rather than a computation
library. Their internal construction is not reproduced here; the
"Cython entry points used" column lists the public primitives the
operator composes.
### 4.1 Intrinsic scale
| | |
|---|---|
| Purpose | Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required. |
| Applications | Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability. |
| Cython entry points used | `nn_distances`, `pairwise_distances` |
### 4.2 Concept discovery
| | |
|---|---|
| Purpose | Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds. Returns the concepts the data itself supports. |
| Applications | Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators. |
| Cython entry points used | `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
### 4.3 Relational hypothesis generation
| | |
|---|---|
| Purpose | Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support. |
| Applications | Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery. |
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
### 4.4 Semantic gap detection
| | |
|---|---|
| Purpose | Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not. |
| Applications | Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity. |
| Cython entry points used | `concept_support_matrix`, `non_redundant_witnesses` |
### 4.5 Prototype construction
| | |
|---|---|
| Purpose | Predict the structural features of an item that should exist between known concepts but has not yet been observed. |
| Applications | Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition. |
| Cython entry points used | `batch_max_similarity`, `concept_support_matrix` |
### 4.6 Verdict-qualified inference
| | |
|---|---|
| Purpose | Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently). |
| Applications | Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage. |
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
### 4.7 Lifecycle / dominance verification
| | |
|---|---|
| Purpose | When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype. Maintains the prototype's status across its lifetime. |
| Applications | Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference. |
| Cython entry points used | `pareto_core_mask` |
### 4.8 Hierarchical recursion
| | |
|---|---|
| Purpose | Apply every operator above to recursive concept trees — concepts whose members are themselves concepts. Operators bubble through the hierarchy and remain mathematically consistent at every level. |
| Applications | Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.). |
| Cython entry points used | the operators above, recursively |
### 4.9 Streaming kNN graph maintenance
| | |
|---|---|
| Purpose | Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update. |
| Applications | Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection. |
| Cython entry points used | `pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob. |
### 4.10 Time-series streaming model
| | |
|---|---|
| Purpose | A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows. Optionally projects high-dimensional windows to lower dimensions when configured to do so. |
| Applications | Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting. |
| Cython entry points used | `nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9 |
---
## 5. Composition properties
The operators in §1–§4 compose along several axes:
- **Across payload types**: the same operator works for scalars,
vectors, matrices, tensors, functions, manifolds, complex states,
distributions, time-series windows. The caller supplies the
appropriate distance function or, equivalently, an embedding into
Euclidean space.
- **Across hierarchy levels**: concepts can themselves be members of
parent concepts; operators recurse through the tree (§4.8).
- **Under wrapping**: stochastic and temporal extensions can be
layered over any base payload type. Triple compositions like
"hierarchy of stochastic time-series" are admissible and produce
consistent results at every level.
---
## 6. What the apparatus does NOT offer
Stated explicitly so users can plan around the limits:
- No probability distributions over outcomes. Verdicts are
structural, not Bayesian.
- No reward / objective optimisation. The apparatus does not learn
policies; it identifies structural relationships.
- No tuning knobs that trade fidelity for speed. Where some
alternatives expose `epsilon`, `top_k`, `temperature`, etc., the
apparatus uses data-derived structural boundaries instead.
- No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes).
Every kNN-related operator returns exact results.
---
## 7. Mapping summary
| Apparatus operator | Cython entry point(s) |
|---|---|
| Pairwise similarity | `batch_max_similarity` |
| Multi-class similarity | `concept_support_matrix` |
| Pairwise distance | `pairwise_distances` |
| Nearest-neighbour distance | `nn_distances` |
| Best-tradeoff filter | `pareto_core_mask` |
| One-sided peak flag | `one_sided_mask` |
| Non-redundant witness | `non_redundant_witnesses` |
| Fused centroid + radius update | `extend_frontier_kernel` |
| Intrinsic scale | composed of `nn_distances`, `pairwise_distances` |
| Concept discovery | composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
| Relational hypothesis generation | composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
| Semantic gap detection | composed of `concept_support_matrix`, `non_redundant_witnesses` |
| Prototype construction | composed of `batch_max_similarity`, `concept_support_matrix` |
| Verdict-qualified inference | composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
| Lifecycle / dominance verification | composed of `pareto_core_mask` |
| Hierarchical recursion | every operator above, recursively |
| Streaming kNN graph | `pairwise_distances`, `nn_distances` |
| Time-series streaming model | `nn_distances`, `concept_support_matrix`, streaming kNN |
## 8. Library availability
The Cython entry points in the right column of §7 are all in
`sem_cython12.wrapper`, distributed at
[https://git.sevana.biz/vvs/sem_cython12](https://git.sevana.biz/vvs/sem_cython12).
Higher-level apparatus (composed operators in §4) is built on those
primitives and ships in the SEM foundation package, separate from
this library.
+271
View File
@@ -0,0 +1,271 @@
# SEM — An Overview of Structural Reasoning
*A non-internal introduction to the SEM (Similarity Energy Model)
reasoning system, its applications, and the `sem_cython12` library.*
---
## 1. What SEM is
SEM is a reasoning system for **discovering structure in observed
data** and producing **decision-qualified predictions** about new
observations. Unlike conventional machine learning, SEM is not a
parameterised model fitted to training data: its outputs are derived
directly from the geometry of the observed world set. Where ML asks
"what is the most likely label?", SEM asks "what is the structural
position of this observation relative to everything we have seen?"
— and reports the answer as a verdict, not a probability.
The system has been used as a discovery engine, an anomaly detector,
a missing-mediator predictor, a regime-change identifier, and an
explainable inference layer over neural-network embeddings. Each
application reuses the same small set of structural operators.
## 2. Properties that distinguish SEM
- **Parameter-free.** No learning rates, no regularisation
coefficients, no tuning knobs in the reasoning pipeline. Every
scale or boundary the system consults is computed from the data
itself.
- **Threshold-free.** No `if score > 0.85` decisions. Where
conventional pipelines impose a numeric cut-off, SEM uses
data-derived structural boundaries that adapt to the observed
geometry.
- **Three-valued verdict.** A prediction returns one of:
- **confident** — a single best-fitting concept dominates;
- **gap** — multiple concepts are equally admissible, signalling
that the query lies in a region the current theory has not
resolved;
- **incoherent** — no concept admits the query consistently;
further data is required.
This refusal-to-guess is the system's most useful safety property:
it never collapses uncertainty into a forced label.
- **Detects what is missing.** SEM identifies positions where
observed data should produce a structural witness but does not, and
predicts the features the missing entity should carry. Conventional
ML cannot signal that a hidden mediator or unobserved variable is
required.
- **Explainable by construction.** Every prediction comes with a
decomposition of the supporting evidence, so a downstream system
(or human reviewer) can audit which structural relations argue for
a given verdict.
- **Composable across data types.** The same reasoning apparatus
applies to scalars, vectors, matrices, sampled functions, sampled
manifolds, complex (quantum) state vectors, distributions, time-
series windows, and recursive concept hierarchies. The operators
see all of these through a common interface.
## 3. Where SEM has been applied
| Domain | Capability used |
|---|---|
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
| Scientific law discovery | Recovering analytic relationships from raw measurements |
| Drug / molecule screening | Structural similarity beyond fingerprints |
| Network monitoring | Silent-failure detection in encrypted traffic |
| Causal inference | Discovering missing variables from observational data |
| Image / signal analysis | Structural feature extraction with explainability |
| LLM explainability | Interpreting embedding-space behaviour |
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
| Trading & market structure | Regime-switch decisions with abstain semantics |
In each case the value is the same: the system either gives a
high-confidence answer or refuses to, and never delivers a confident
wrong answer disguised as a probability.
## 4. How SEM differs from machine learning
| | Machine learning | SEM |
|---|---|---|
| Has training phase | yes | no |
| Has hyper-parameters | yes | no |
| Can detect missing entities | no | yes |
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
| Output | numeric / probabilistic | structural with verdict |
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
| Scale of usable data | requires many examples | works on small data, even single-digit examples |
SEM and ML are not exclusive — SEM is sometimes layered on top of
neural-network embeddings to provide an explainability and abstention
layer, and ML can supply the embeddings SEM reasons over.
## 5. The `sem_cython12` library
`sem_cython12` is the high-performance numerical kernel layer that
backs SEM's reasoning operators. It is delivered as a pre-compiled
Linux shared object plus a thin Python wrapper; users do not compile
anything at install time.
The library exposes one module:
- `sem_cython12.wrapper` — Python API over the compiled kernels.
Inside the module, the public functions are grouped by purpose.
### 5.1 Configuration
| Function | Purpose |
|---|---|
| `available() -> bool` | Reports whether the compiled extension loaded |
| `backend() -> str` | `'cython12'` or `'python-fallback'` |
| `get_num_threads() -> int` | Active OpenMP worker count |
| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |
OpenMP thread count defaults to roughly 50 % of the host's logical
cores, so other processes are not starved on shared machines. The
caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
environment variable.
### 5.2 Distance and similarity
| Function | What it does |
|---|---|
| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
| `nn_distances(X)` | Per-row minimum positive distance to any other row. |
These four cover the bulk of SEM's structural-similarity workload.
### 5.3 Pareto / dominance reasoning
| Function | What it computes |
|---|---|
| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |
These let the caller reason about which observations *meaningfully*
contribute to bridging multiple structural classes — versus those that
are merely peaks of a single class.
### 5.4 Vector reduction
| Function | What it computes |
|---|---|
| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |
Used by higher-level routines that need to enumerate candidate
relational hypotheses bridging multiple regions of structural space.
### 5.5 Performance
Measured on commodity x86_64 hardware with 8 OpenMP threads against
the equivalent pure-numpy reference implementations:
| Operation | Speed-up |
|---|---|
| `batch_max_similarity` (N=2000, D=50) | ~14× |
| `pareto_core_mask` (N=1000, k=8) | ~50× |
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
All routines release the GIL during their inner loops, so calling
them concurrently from Python threads is safe.
## 6. A worked Python example
The following snippet uses only `sem_cython12.wrapper` and `numpy`.
It shows how a downstream pipeline would identify the **structurally
informative** members of a small synthetic dataset — those that
mediate between two clusters rather than sitting at one cluster's
peak.
```python
import numpy as np
from sem_cython12 import wrapper as cy
assert cy.available(), "compiled extension did not load"
print("backend:", cy.backend(), " threads:", cy.get_num_threads())
# Two well-separated clusters in 4-D, plus three "bridging" candidates
# whose similarity profile spans both clusters.
rng = np.random.default_rng(0)
cluster_a = rng.standard_normal((20, 4)) + 3.0
cluster_b = rng.standard_normal((20, 4)) - 3.0
bridges = np.array([
[ 0.0, 0.0, 0.0, 0.0],
[ 0.5, 0.5, -0.2, 0.1],
[-0.3, 0.1, 0.4, -0.2],
])
members = np.vstack([cluster_a, cluster_b, bridges])
# 1. Build a 2-class similarity matrix:
# columns = (sim to cluster_a, sim to cluster_b)
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
S = np.column_stack([sim_a, sim_b]) # (N, 2)
# 2. Find the Pareto frontier of (sim_a, sim_b).
# Members whose support vector is strictly dominated by another
# member are excluded.
keep_mask = cy.pareto_core_mask(S)
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
# 3. Of those, which are NOT one-sided peaks?
# A one-sided member is a peak of exactly one cluster and gains
# nothing on the other. We want members that score on BOTH.
non_redundant = cy.non_redundant_witnesses(S)
print("Non-redundant witnesses:", non_redundant.tolist())
# 4. Inspect the ones that survived: these are the data points that
# structurally connect the two clusters.
for idx in non_redundant:
print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}")
```
A typical run prints something like:
```
backend: cython12 threads: 4
Pareto-frontier members: 8 / 43
Non-redundant witnesses: [40, 41, 42]
row 40: sim_a=0.428 sim_b=0.428
row 41: sim_a=0.412 sim_b=0.401
row 42: sim_a=0.402 sim_b=0.395
```
The library has filtered out the 40 cluster members (which sit at
their own cluster's peak and contribute nothing across cluster
boundaries) and identified the three synthetic "bridges" as the
structurally informative observations. This is the kind of
elementary operation that higher-level SEM reasoning composes into
concept discovery, gap detection and prototype prediction.
## 7. When to consider SEM
| Situation | Consider SEM |
|---|---|
| You have small data (1010,000 examples) and need a defensible decision | Yes |
| You need to know *what is missing* from your data | Yes |
| You need a model that refuses to guess when the data is ambiguous | Yes |
| You want explanations that are inherent to the inference, not bolted on | Yes |
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
| You have a regression task with smooth dependencies | Stay with classical statistics |
## 8. Library availability
`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
3.12 shared object. Installation is:
```bash
git clone https://git.sevana.biz/vvs/sem_cython12.git
cd sem_cython12
pip install -r requirements.txt
export PYTHONPATH=$PWD:$PYTHONPATH
```
The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
and the compiled `.so`, plus `requirements.txt` and a README describing
the public API.
## 9. Summary
SEM is a structural reasoning system whose promise is decision
quality, not raw accuracy. Its key product is a verdict-qualified
prediction: the system tells you whether it is confident, whether
the data is genuinely ambiguous, or whether the observation lies
outside the apparatus's coherent coverage. The `sem_cython12`
library provides the high-performance numerical layer beneath this
reasoning, exposing a small, well-defined Python API that downstream
applications compose into domain-specific pipelines.
Binary file not shown.
Binary file not shown.
Binary file not shown.