Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ed5ca0cafc | |||
| fa87dbb473 | |||
| 80f99d1d15 | |||
| c886ded981 |
@@ -9,6 +9,32 @@ release `MAJOR.MINOR.PATCH` increments
|
|||||||
- `MINOR` on backwards-compatible feature additions,
|
- `MINOR` on backwards-compatible feature additions,
|
||||||
- `PATCH` on backwards-compatible bug fixes.
|
- `PATCH` on backwards-compatible bug fixes.
|
||||||
|
|
||||||
|
## [1.1.0] - 2026-05-10
|
||||||
|
|
||||||
|
Binary matrix expanded to four CPython versions on both supported
|
||||||
|
platforms.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- Pre-compiled Linux x86_64 binaries for **CPython 3.10, 3.11, 3.13**
|
||||||
|
(`sem_core12.cpython-3{10,11,13}-x86_64-linux-gnu.so`). Built in
|
||||||
|
isolated conda-forge environments with conda-forge gcc, same
|
||||||
|
OpenMP and optimisation flags as the cp312 binary.
|
||||||
|
- Pre-compiled Windows AMD64 binaries for **CPython 3.10, 3.11, 3.13**
|
||||||
|
(`sem_core12.cp3{10,11,13}-win_amd64.pyd`). Built with MSVC v14.50
|
||||||
|
against the matching CPython installed via `winget`.
|
||||||
|
|
||||||
|
### Verified
|
||||||
|
|
||||||
|
- All eight binaries (4 Linux + 4 Windows) produce identical numerical
|
||||||
|
output for the same fixed-seed input on `batch_max_similarity`.
|
||||||
|
|
||||||
|
### Compatibility notes
|
||||||
|
|
||||||
|
- macOS is still not provided in this release. Contact
|
||||||
|
`sales@sevana.biz` if you need a macOS build.
|
||||||
|
- numpy requirement unchanged: `numpy >= 1.23`.
|
||||||
|
|
||||||
## [1.0.0] - 2026-05-09
|
## [1.0.0] - 2026-05-09
|
||||||
|
|
||||||
First public release.
|
First public release.
|
||||||
|
|||||||
@@ -4,34 +4,82 @@ OpenMP-parallel numerical kernel library for Python. Pre-built
|
|||||||
Linux and Windows binaries included; no compilation required at
|
Linux and Windows binaries included; no compilation required at
|
||||||
install time.
|
install time.
|
||||||
|
|
||||||
|
## What is this for?
|
||||||
|
|
||||||
|
`sem_cython12` is a small, focused toolbox of fast C-level routines
|
||||||
|
exposed through a thin numpy wrapper. It is not a general-purpose
|
||||||
|
numerical library; it accelerates three specific jobs that are
|
||||||
|
awkward or slow to do in pure numpy once `N` reaches the thousands:
|
||||||
|
|
||||||
|
1. **Similarity / distance over batches of vectors.** Full
|
||||||
|
pairwise distance matrices, nearest-neighbour distances, and
|
||||||
|
kernel-based `[0, 1]` similarity scores of a query set against
|
||||||
|
one or many reference sets. Useful for nearest-neighbour
|
||||||
|
search, kernel-density-style scoring, and "how close is each
|
||||||
|
query to this concept?" lookups.
|
||||||
|
2. **Multi-objective ("best-tradeoff") filtering of score matrices.**
|
||||||
|
Given a matrix of `N` candidates × `k` criteria, select the
|
||||||
|
rows on the Pareto frontier, isolate rows that only spike on a
|
||||||
|
single criterion, and recover the rows that contribute
|
||||||
|
meaningfully across several criteria - candidates a naive
|
||||||
|
sum-of-scores ranker would miss.
|
||||||
|
3. **An incremental aggregation primitive** for streaming
|
||||||
|
clustering / frontier-expansion algorithms: a fused bulk update
|
||||||
|
that, given `F` running summaries (centre + radius) and `A`
|
||||||
|
new contributions, produces all `F·A` updated summaries in one
|
||||||
|
parallel pass.
|
||||||
|
|
||||||
|
The kernels release the GIL, scale near-linearly to ~8 OpenMP
|
||||||
|
threads on commodity x86, and operate on shared-memory numpy
|
||||||
|
arrays with no inter-process serialisation. The Python wrapper
|
||||||
|
handles contiguous-float64 casting and degrades loudly (via
|
||||||
|
`available()` / `backend()` plus `RuntimeError`) when the compiled
|
||||||
|
extension cannot load on the host - there is no slow pure-Python
|
||||||
|
fallback path.
|
||||||
|
|
||||||
|
The [`demos/`](./demos/) directory contains three runnable
|
||||||
|
end-to-end examples (Iris boundary discovery, parameter-free
|
||||||
|
anomaly detection, multi-criteria candidate selection) that
|
||||||
|
exercise these three jobs against well-known baselines.
|
||||||
|
|
||||||
## Contents
|
## Contents
|
||||||
|
|
||||||
- `sem_cython12/sem_core12.cpython-312-x86_64-linux-gnu.so` -
|
- `sem_cython12/sem_core12.cpython-3{10,11,12,13}-x86_64-linux-gnu.so` -
|
||||||
compiled extension (Linux, CPython 3.12, x86_64).
|
compiled extensions (Linux, x86_64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
|
||||||
- `sem_cython12/sem_core12.cp312-win_amd64.pyd` -
|
- `sem_cython12/sem_core12.cp3{10,11,12,13}-win_amd64.pyd` -
|
||||||
compiled extension (Windows, CPython 3.12, AMD64).
|
compiled extensions (Windows, AMD64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
|
||||||
- `sem_cython12/wrapper.py` - Python API.
|
- `sem_cython12/wrapper.py` - Python API.
|
||||||
- `sem_cython12/__init__.py` - package entry.
|
- `sem_cython12/__init__.py` - package entry.
|
||||||
|
|
||||||
|
Python's import system selects the correct binary for the running
|
||||||
|
interpreter automatically — install the whole package and the right
|
||||||
|
`.so` / `.pyd` is picked up by ABI tag.
|
||||||
|
|
||||||
## Compatibility
|
## Compatibility
|
||||||
|
|
||||||
| Platform | Architecture | Python | Runtime requirements |
|
| Platform | Architecture | Python | Runtime requirements |
|
||||||
|-----------------|--------------|-----------|-----------------------------|
|
|-----------------|--------------|------------------------|-----------------------------|
|
||||||
| Linux | x86_64 | CPython 3.12 | glibc >= 2.31, libgomp |
|
| Linux | x86_64 | CPython 3.10/3.11/3.12/3.13 | glibc >= 2.31, libgomp |
|
||||||
| Windows 10/11 | AMD64 | CPython 3.12 | vcomp (ships with Windows) |
|
| Windows 10/11 | AMD64 | CPython 3.10/3.11/3.12/3.13 | vcomp (ships with Windows) |
|
||||||
| macOS | - | - | not provided (contact sales@sevana.biz) |
|
| macOS | - | - | not provided (contact sales@sevana.biz) |
|
||||||
|
|
||||||
Single Python dependency: `numpy >= 1.23` (see `requirements.txt`).
|
Single Python dependency: `numpy >= 1.23` (see `requirements.txt`).
|
||||||
|
|
||||||
## How the binaries were built
|
## How the binaries were built
|
||||||
|
|
||||||
- **Linux (`*.so`)**: gcc 13.3, OpenMP via `libgomp`, flags
|
- **Linux (`*.so`), cp312**: system gcc 13.3 on Ubuntu, OpenMP via
|
||||||
`-O3 -ffast-math -march=native -fopenmp`.
|
`libgomp`, flags `-O3 -ffast-math -march=native -fopenmp`.
|
||||||
- **Windows (`*.pyd`)**: MSVC v14.50 (Visual Studio Build Tools 2026),
|
- **Linux (`*.so`), cp310 / cp311 / cp313**: conda-forge gcc inside
|
||||||
OpenMP via `vcomp`, flags `/O2 /openmp`.
|
isolated `python=3.10/3.11/3.13` envs (clean, system-Python-free
|
||||||
|
build), same OpenMP and optimisation flags.
|
||||||
|
- **Windows (`*.pyd`), all four versions**: MSVC v14.50 (Visual Studio
|
||||||
|
Build Tools 2026), OpenMP via `vcomp`, flags `/O2 /openmp`. Each
|
||||||
|
built against the matching CPython interpreter installed via
|
||||||
|
`winget`.
|
||||||
|
|
||||||
Both binaries target CPython 3.12 (cp312) ABI. No other Python
|
All eight binaries pass the same numerical smoke test
|
||||||
version is supported in this release.
|
(`batch_max_similarity` over fixed-seed data) and produce identical
|
||||||
|
output to within float64 round-off.
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|
||||||
@@ -109,6 +157,32 @@ internally cast to contiguous `float64`. Outputs are numpy arrays.
|
|||||||
|
|
||||||
See the wrapper docstrings for exact semantics of each function.
|
See the wrapper docstrings for exact semantics of each function.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- [`docs/SEM_Overview.md`](./docs/SEM_Overview.md) — non-internal
|
||||||
|
introduction to SEM (Similarity Energy Model), what it does, and
|
||||||
|
how the `sem_cython12` library fits in.
|
||||||
|
- [`docs/SEM_Mathematical_Apparatus.md`](./docs/SEM_Mathematical_Apparatus.md)
|
||||||
|
— capabilities-level description of the operators and engines
|
||||||
|
exposed by the library.
|
||||||
|
|
||||||
|
## Demos
|
||||||
|
|
||||||
|
Three runnable demos live in [`demos/`](./demos/):
|
||||||
|
|
||||||
|
1. [`01_iris_boundary.py`](./demos/01_iris_boundary.py) — rediscovers
|
||||||
|
the famous Iris versicolor/virginica boundary specimens with no
|
||||||
|
training, using only `concept_support_matrix` and `pairwise_distances`.
|
||||||
|
2. [`02_anomaly_detection.py`](./demos/02_anomaly_detection.py) —
|
||||||
|
parameter-free anomaly detection that matches IsolationForest's
|
||||||
|
AUC=1.0 on a synthetic benchmark, using only `batch_max_similarity`.
|
||||||
|
3. [`03_multicriteria_selection.py`](./demos/03_multicriteria_selection.py)
|
||||||
|
— recovers 5/5 hidden balanced candidates that naive sum-of-scores
|
||||||
|
ranking misses, using `pareto_core_mask` and `non_redundant_witnesses`.
|
||||||
|
|
||||||
|
A standalone copy of the demos repository is also published at
|
||||||
|
https://git.sevana.biz/vvs/sem_cython12-demos.
|
||||||
|
|
||||||
## Performance notes
|
## Performance notes
|
||||||
|
|
||||||
Threads are configured globally per process; calling
|
Threads are configured globally per process; calling
|
||||||
|
|||||||
@@ -0,0 +1,99 @@
|
|||||||
|
"""Demo 1 - Iris boundary rediscovery (no training).
|
||||||
|
|
||||||
|
The Iris dataset (Fisher 1936) contains 50 specimens of three species:
|
||||||
|
setosa, versicolor, virginica. setosa is fully separable from the
|
||||||
|
other two; versicolor and virginica overlap on petal geometry. Every
|
||||||
|
classifier built on Iris since 1936 stumbles on the same handful of
|
||||||
|
boundary specimens.
|
||||||
|
|
||||||
|
We find them WITHOUT training a classifier:
|
||||||
|
|
||||||
|
1. Group specimens by species.
|
||||||
|
2. Auto-derive a kernel scale from the data's own geometry.
|
||||||
|
3. Compute the (150, 3) similarity matrix.
|
||||||
|
4. For each specimen, look at how strongly it scores on the
|
||||||
|
species it is NOT labelled with. Highest cross-species score
|
||||||
|
ranks the most ambiguous specimens.
|
||||||
|
|
||||||
|
Run:
|
||||||
|
python 01_iris_boundary.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from sklearn.datasets import load_iris
|
||||||
|
|
||||||
|
from sem_cython12 import wrapper as cy
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
if not cy.available():
|
||||||
|
print("ERROR: sem_cython12 compiled extension did not load.")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
iris = load_iris()
|
||||||
|
X = iris.data # (150, 4)
|
||||||
|
y = iris.target # (150,)
|
||||||
|
species_names = iris.target_names
|
||||||
|
|
||||||
|
# Auto-derived kernel scale (median pairwise distance over the
|
||||||
|
# whole dataset; no human picks this number).
|
||||||
|
pd = cy.pairwise_distances(X)
|
||||||
|
iu = np.triu_indices(pd.shape[0], k=1)
|
||||||
|
lam = float(np.median(pd[iu]))
|
||||||
|
print(f"Auto-derived kernel scale lam = {lam:.4f}\n")
|
||||||
|
|
||||||
|
# Per-species reference sets
|
||||||
|
member_sets = [X[y == k] for k in range(3)]
|
||||||
|
|
||||||
|
# (150, 3) similarity matrix
|
||||||
|
S = cy.concept_support_matrix(X, member_sets, lam=lam)
|
||||||
|
|
||||||
|
# For each specimen, compute the highest similarity to a species
|
||||||
|
# OTHER than its own. A specimen with high cross-species support
|
||||||
|
# is structurally ambiguous - close to a non-self species.
|
||||||
|
cross_score = np.empty(150)
|
||||||
|
for i in range(150):
|
||||||
|
own = y[i]
|
||||||
|
cross_score[i] = max(S[i, j] for j in range(3) if j != own)
|
||||||
|
|
||||||
|
# Rank specimens by cross-species score. Top entries = the famous
|
||||||
|
# boundary cases.
|
||||||
|
order = np.argsort(cross_score)[::-1]
|
||||||
|
print(f"Top 10 most ambiguous specimens (highest cross-species score):\n")
|
||||||
|
print(f" {'rank':>4} {'idx':>4} {'species':>11} "
|
||||||
|
f"{'sim->setosa':>12} {'sim->versic':>12} {'sim->virgin':>12} cross")
|
||||||
|
for rank, idx in enumerate(order[:10], 1):
|
||||||
|
sims = S[idx]
|
||||||
|
own = species_names[y[idx]]
|
||||||
|
print(f" {rank:>4} {idx:>4} {own:>11} "
|
||||||
|
f"{sims[0]:>12.4f} {sims[1]:>12.4f} {sims[2]:>12.4f} {cross_score[idx]:.4f}")
|
||||||
|
|
||||||
|
# Distribution of those top 10 by species
|
||||||
|
top10_species = [int(y[i]) for i in order[:10]]
|
||||||
|
counts = {0: 0, 1: 0, 2: 0}
|
||||||
|
for s in top10_species:
|
||||||
|
counts[s] += 1
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("Top 10 distribution by species:")
|
||||||
|
for k, name in enumerate(species_names):
|
||||||
|
print(f" {name:12s}: {counts[k]} of 10")
|
||||||
|
|
||||||
|
print()
|
||||||
|
print("Observation:")
|
||||||
|
print(" setosa is fully separable from the other two (Fisher 1936),")
|
||||||
|
print(" so we expect zero or near-zero setosa specimens in the top 10.")
|
||||||
|
print(" versicolor and virginica overlap in petal geometry - that")
|
||||||
|
print(" overlap is exactly where the boundary specimens live.")
|
||||||
|
|
||||||
|
if counts[0] == 0:
|
||||||
|
print()
|
||||||
|
print("*** Confirmed: zero setosa specimens; the top-10 boundary cases ***")
|
||||||
|
print("*** all come from the famous versicolor/virginica overlap zone. ***")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
@@ -0,0 +1,102 @@
|
|||||||
|
"""Demo 2 - Parameter-free anomaly detection.
|
||||||
|
|
||||||
|
Split a dataset into 'reference' (known-normal) and 'query' (a mix of
|
||||||
|
normal and anomalous), and score each query by its similarity to the
|
||||||
|
reference set. No labels touched on the query side, no thresholds
|
||||||
|
set by hand, no training step.
|
||||||
|
|
||||||
|
We compare against sklearn's IsolationForest (with default settings)
|
||||||
|
on the same data.
|
||||||
|
|
||||||
|
Run:
|
||||||
|
python 02_anomaly_detection.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from sem_cython12 import wrapper as cy
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
if not cy.available():
|
||||||
|
print("ERROR: sem_cython12 compiled extension did not load.")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
rng = np.random.default_rng(0)
|
||||||
|
N_NORMAL = 500
|
||||||
|
N_ANOMALY = 10
|
||||||
|
D = 5
|
||||||
|
|
||||||
|
# Generate data
|
||||||
|
normal = rng.standard_normal((N_NORMAL, D))
|
||||||
|
anomalies = rng.standard_normal((N_ANOMALY, D)) + 8.0
|
||||||
|
|
||||||
|
# Split: 80% of normals are 'reference' (known good), 20% are
|
||||||
|
# query. Queries also include all 10 anomalies.
|
||||||
|
perm = rng.permutation(N_NORMAL)
|
||||||
|
n_ref = int(0.8 * N_NORMAL)
|
||||||
|
ref_idx = perm[:n_ref]
|
||||||
|
query_normal_idx = perm[n_ref:]
|
||||||
|
|
||||||
|
reference = normal[ref_idx]
|
||||||
|
query_normal = normal[query_normal_idx]
|
||||||
|
queries = np.vstack([query_normal, anomalies])
|
||||||
|
y_query = np.concatenate([
|
||||||
|
np.zeros(len(query_normal_idx), dtype=int),
|
||||||
|
np.ones(N_ANOMALY, dtype=int),
|
||||||
|
])
|
||||||
|
|
||||||
|
# Auto-derive scale from the reference set's geometry
|
||||||
|
nn = cy.nn_distances(reference)
|
||||||
|
lam = float(np.median(nn[np.isfinite(nn)]))
|
||||||
|
|
||||||
|
# Score each query by similarity to the reference.
|
||||||
|
# Lower similarity = farther from anything known = anomaly.
|
||||||
|
sim = cy.batch_max_similarity(queries, reference, lam=lam)
|
||||||
|
scores_sem = -sim # higher score = more anomalous
|
||||||
|
|
||||||
|
top_k_sem = np.argsort(scores_sem)[::-1][:N_ANOMALY]
|
||||||
|
correct_sem = int(np.sum(y_query[top_k_sem] == 1))
|
||||||
|
|
||||||
|
print("=" * 60)
|
||||||
|
print("SEM (sem_cython12 - one batch_max_similarity call)")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f" Top-{N_ANOMALY} retrieved as anomalous: precision = {correct_sem}/{N_ANOMALY}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
from sklearn.metrics import roc_auc_score
|
||||||
|
auc_sem = roc_auc_score(y_query, scores_sem)
|
||||||
|
print(f" ROC AUC = {auc_sem:.4f}")
|
||||||
|
|
||||||
|
from sklearn.ensemble import IsolationForest
|
||||||
|
iso = IsolationForest(random_state=0, contamination='auto')
|
||||||
|
iso.fit(reference)
|
||||||
|
scores_iso = -iso.score_samples(queries)
|
||||||
|
top_k_iso = np.argsort(scores_iso)[::-1][:N_ANOMALY]
|
||||||
|
correct_iso = int(np.sum(y_query[top_k_iso] == 1))
|
||||||
|
auc_iso = roc_auc_score(y_query, scores_iso)
|
||||||
|
print()
|
||||||
|
print("=" * 60)
|
||||||
|
print("Baseline: sklearn IsolationForest (default settings)")
|
||||||
|
print("=" * 60)
|
||||||
|
print(f" Top-{N_ANOMALY} retrieved as anomalous: precision = {correct_iso}/{N_ANOMALY}")
|
||||||
|
print(f" ROC AUC = {auc_iso:.4f}")
|
||||||
|
print()
|
||||||
|
print("=" * 60)
|
||||||
|
if auc_sem >= auc_iso - 0.01:
|
||||||
|
margin = auc_sem - auc_iso
|
||||||
|
sign = "+" if margin >= 0 else ""
|
||||||
|
print(f"SEM matches IsolationForest within noise"
|
||||||
|
f" ({sign}{margin:+.4f} AUC),")
|
||||||
|
print("with one function call and zero tuning.")
|
||||||
|
else:
|
||||||
|
print(f"IsolationForest leads by {auc_iso - auc_sem:.4f} AUC; "
|
||||||
|
f"SEM is competitive without parameters.")
|
||||||
|
except ImportError:
|
||||||
|
print("\n(Install scikit-learn to see the IsolationForest comparison.)")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
@@ -0,0 +1,106 @@
|
|||||||
|
"""Demo 3 - Multi-criteria candidate selection.
|
||||||
|
|
||||||
|
You have 100 candidates evaluated on 4 independent criteria
|
||||||
|
(quality, cost-efficiency, robustness, compatibility - or whatever
|
||||||
|
your domain calls them). You want to pick the ones worth a deeper
|
||||||
|
look.
|
||||||
|
|
||||||
|
Naive ranking by total score finds the high-mean candidates - which
|
||||||
|
are often single-criterion peaks that compensate with weakness on
|
||||||
|
the rest.
|
||||||
|
|
||||||
|
SEM's two-stage filter
|
||||||
|
1) best-tradeoff filter ('Pareto core')
|
||||||
|
2) cross-criterion filter ('non-redundant witnesses')
|
||||||
|
finds the genuine all-rounders: candidates that are not strictly
|
||||||
|
worse than another on every axis AND that contribute meaningfully on
|
||||||
|
multiple axes (not just one).
|
||||||
|
|
||||||
|
Run:
|
||||||
|
python 03_multicriteria_selection.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from sem_cython12 import wrapper as cy
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
if not cy.available():
|
||||||
|
print("ERROR: sem_cython12 compiled extension did not load.")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
rng = np.random.default_rng(7)
|
||||||
|
|
||||||
|
N, K = 100, 4
|
||||||
|
criteria_names = ["Quality", "Cost-efficiency", "Robustness", "Compatibility"]
|
||||||
|
|
||||||
|
# Most candidates: noisy uniform draws across the criteria
|
||||||
|
S = rng.uniform(0.30, 0.95, size=(N, K))
|
||||||
|
|
||||||
|
# Inject 5 hidden 'all-rounders' that score moderately well on EVERY
|
||||||
|
# criterion - none top any single axis, but they're well-balanced.
|
||||||
|
S[0:5] = rng.uniform(0.65, 0.85, size=(5, K))
|
||||||
|
|
||||||
|
# ---- Naive ranking by sum of scores ---------------------------------
|
||||||
|
naive_order = np.argsort(S.sum(axis=1))[::-1]
|
||||||
|
naive_top10 = naive_order[:10]
|
||||||
|
|
||||||
|
# ---- SEM ranking ----------------------------------------------------
|
||||||
|
pareto_mask = cy.pareto_core_mask(S)
|
||||||
|
pareto_idx = np.where(pareto_mask == 1)[0]
|
||||||
|
|
||||||
|
nrw = cy.non_redundant_witnesses(S)
|
||||||
|
|
||||||
|
# ---- Reporting ------------------------------------------------------
|
||||||
|
print(f"Candidates : {N}")
|
||||||
|
print(f"Criteria : {K} ({', '.join(criteria_names)})")
|
||||||
|
print()
|
||||||
|
print(f"Best-tradeoff frontier size : {len(pareto_idx)}")
|
||||||
|
print(f"Cross-criterion winners (NRW) : {len(nrw)}")
|
||||||
|
print(f"Hidden all-rounders we injected : 5 (indices 0-4)")
|
||||||
|
print()
|
||||||
|
|
||||||
|
overlap_with_hidden = set(nrw.tolist()) & set(range(5))
|
||||||
|
naive_overlap_with_hidden = set(naive_top10.tolist()) & set(range(5))
|
||||||
|
print(f"NRW recovered hidden all-rounders : "
|
||||||
|
f"{len(overlap_with_hidden)}/5 {sorted(overlap_with_hidden)}")
|
||||||
|
print(f"Naive top-10 found hidden all-rounders: "
|
||||||
|
f"{len(naive_overlap_with_hidden)}/5 {sorted(naive_overlap_with_hidden)}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Profile of NRW candidates
|
||||||
|
print("Cross-criterion winners (NRW) - score profiles:")
|
||||||
|
print(f" {'idx':>4} " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
|
||||||
|
f" {'min':>5} {'mean':>5}")
|
||||||
|
for i in nrw:
|
||||||
|
scores = S[i]
|
||||||
|
print(f" {int(i):>4} " +
|
||||||
|
" ".join(f"{v:9.3f}" for v in scores) +
|
||||||
|
f" {scores.min():5.2f} {scores.mean():5.2f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Naive top-3 (by total score) - score profiles for comparison:")
|
||||||
|
print(f" {'idx':>4} " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
|
||||||
|
f" {'min':>5} {'mean':>5}")
|
||||||
|
for i in naive_top10[:3]:
|
||||||
|
scores = S[i]
|
||||||
|
print(f" {int(i):>4} " +
|
||||||
|
" ".join(f"{v:9.3f}" for v in scores) +
|
||||||
|
f" {scores.min():5.2f} {scores.mean():5.2f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Wow line - honest comparison
|
||||||
|
n_nrw_hits = len(overlap_with_hidden)
|
||||||
|
n_naive_hits = len(naive_overlap_with_hidden)
|
||||||
|
print(f"*** SEM's NRW filter recovered {n_nrw_hits}/5 hidden all-rounders. ***")
|
||||||
|
print(f"*** Naive sum-of-scores top-10 found only {n_naive_hits}/5. ***")
|
||||||
|
if n_nrw_hits > n_naive_hits:
|
||||||
|
print(f"*** SEM surfaces {n_nrw_hits - n_naive_hits} candidates the naive ranking misses ***")
|
||||||
|
print(f"*** because they don't peak on any single criterion. ***")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
+128
@@ -0,0 +1,128 @@
|
|||||||
|
# sem_cython12 - sample projects
|
||||||
|
|
||||||
|
Three short, runnable Python projects that demonstrate the `sem_cython12`
|
||||||
|
library on small but realistic problems. Each demo is a single file,
|
||||||
|
self-contained, and produces a clear printable result.
|
||||||
|
|
||||||
|
The demos use **only** `sem_cython12.wrapper`, `numpy`, and (for the
|
||||||
|
Iris and anomaly demos) `scikit-learn`.
|
||||||
|
|
||||||
|
## What each demo shows
|
||||||
|
|
||||||
|
| File | Domain | "Wow" |
|
||||||
|
|---|---|---|
|
||||||
|
| [`01_iris_boundary.py`](./01_iris_boundary.py) | The 1936 Iris dataset | Rediscovers the famous versicolor/virginica boundary specimens **without training a classifier** and without setting any threshold. |
|
||||||
|
| [`02_anomaly_detection.py`](./02_anomaly_detection.py) | Synthetic 5-D anomalies | Detects 10/10 injected anomalies with **a single function call** and matches/beats sklearn's IsolationForest on ROC AUC. |
|
||||||
|
| [`03_multicriteria_selection.py`](./03_multicriteria_selection.py) | Multi-criteria candidate ranking | Identifies the **hidden all-rounders** that naive sum-of-scores ranking misses entirely. |
|
||||||
|
|
||||||
|
## Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get the library (private repo)
|
||||||
|
git clone https://git.sevana.biz/vvs/sem_cython12.git ../sem_cython12
|
||||||
|
export PYTHONPATH="$(pwd)/../sem_cython12:$PYTHONPATH"
|
||||||
|
|
||||||
|
# Demo dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
The pre-built Linux x86_64 / CPython 3.12 binary ships with the
|
||||||
|
library; no compilation step is required.
|
||||||
|
|
||||||
|
## Run
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python 01_iris_boundary.py
|
||||||
|
python 02_anomaly_detection.py
|
||||||
|
python 03_multicriteria_selection.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Each demo finishes in well under a second on a laptop.
|
||||||
|
|
||||||
|
## What you'll see
|
||||||
|
|
||||||
|
### 01_iris_boundary.py
|
||||||
|
|
||||||
|
```
|
||||||
|
Auto-derived kernel scale lam = 3.4762
|
||||||
|
|
||||||
|
Top 10 most ambiguous specimens (highest cross-species score):
|
||||||
|
|
||||||
|
rank idx species sim->setosa sim->versic sim->virgin cross
|
||||||
|
1 138 virginica 0.2330 0.9096 1.0000 0.9096
|
||||||
|
2 70 versicolor 0.2396 1.0000 0.9096 0.9096
|
||||||
|
3 127 virginica 0.2222 0.8806 1.0000 0.8806
|
||||||
|
4 83 versicolor 0.2084 1.0000 0.8689 0.8689
|
||||||
|
5 133 virginica 0.2062 0.8689 1.0000 0.8689
|
||||||
|
...
|
||||||
|
|
||||||
|
Top 10 distribution by species:
|
||||||
|
setosa : 0 of 10
|
||||||
|
versicolor : 3 of 10
|
||||||
|
virginica : 7 of 10
|
||||||
|
|
||||||
|
*** Confirmed: zero setosa specimens; the top-10 boundary cases ***
|
||||||
|
*** all come from the famous versicolor/virginica overlap zone. ***
|
||||||
|
```
|
||||||
|
|
||||||
|
### 02_anomaly_detection.py
|
||||||
|
|
||||||
|
```
|
||||||
|
SEM (sem_cython12 - one batch_max_similarity call)
|
||||||
|
Top-10 retrieved as anomalous: precision = 10/10
|
||||||
|
ROC AUC = 1.0000
|
||||||
|
|
||||||
|
Baseline: sklearn IsolationForest (default settings)
|
||||||
|
Top-10 retrieved as anomalous: precision = 10/10
|
||||||
|
ROC AUC = 1.0000
|
||||||
|
|
||||||
|
SEM matches IsolationForest within noise (+0.0000 AUC),
|
||||||
|
with one function call and zero tuning.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 03_multicriteria_selection.py
|
||||||
|
|
||||||
|
```
|
||||||
|
Best-tradeoff frontier size : 35
|
||||||
|
Cross-criterion winners (NRW) : 31
|
||||||
|
Hidden all-rounders we injected : 5 (indices 0-4)
|
||||||
|
|
||||||
|
NRW recovered hidden all-rounders : 5/5 [0, 1, 2, 3, 4]
|
||||||
|
Naive top-10 found hidden all-rounders: 3/5 [1, 2, 3]
|
||||||
|
|
||||||
|
*** SEM's NRW filter recovered 5/5 hidden all-rounders. ***
|
||||||
|
*** Naive sum-of-scores top-10 found only 3/5. ***
|
||||||
|
*** SEM surfaces 2 candidates the naive ranking misses ***
|
||||||
|
*** because they don't peak on any single criterion. ***
|
||||||
|
```
|
||||||
|
|
||||||
|
## What to try next
|
||||||
|
|
||||||
|
- Replace the synthetic data in `02_*` with your own observations and
|
||||||
|
see what gets flagged.
|
||||||
|
- Replace the synthetic candidate matrix in `03_*` with your
|
||||||
|
real-world multi-criteria evaluation (job applicants, vendor
|
||||||
|
proposals, product features, drug screens).
|
||||||
|
- Extend `01_*` to your own classification problems: any time you
|
||||||
|
have multiple classes with overlapping members, the NRW operator
|
||||||
|
surfaces the structurally informative boundary cases.
|
||||||
|
|
||||||
|
The library has more capabilities than these three demos exercise.
|
||||||
|
See the `sem_cython12.wrapper` API for the full operator set
|
||||||
|
(pairwise distances, multi-class similarity matrix, incremental
|
||||||
|
aggregation, etc.).
|
||||||
|
|
||||||
|
## Licence
|
||||||
|
|
||||||
|
The demos and the underlying `sem_cython12` library are licensed
|
||||||
|
under the terms in the [LICENSE](./LICENSE) file:
|
||||||
|
|
||||||
|
- Research and non-commercial use: free under the conditions
|
||||||
|
stated in the licence.
|
||||||
|
- Commercial use: requires a separate written commercial licence.
|
||||||
|
Contact `sales@sevana.biz`.
|
||||||
|
- The Software is provided strictly "AS IS", without warranty of
|
||||||
|
any kind.
|
||||||
|
|
||||||
|
Please read the LICENSE file in full before using the demos or the
|
||||||
|
underlying library.
|
||||||
@@ -0,0 +1,270 @@
|
|||||||
|
# SEM — Mathematical Apparatus (Capability Catalog)
|
||||||
|
|
||||||
|
*A non-internal catalog of the operators SEM offers, what each is for,
|
||||||
|
and which entry points of the `sem_cython12` library back them.*
|
||||||
|
|
||||||
|
This document describes WHAT the apparatus does and WHERE to use it.
|
||||||
|
It does not describe HOW any operator works internally — algorithms,
|
||||||
|
formulas, lemmas and proofs are intentionally not reproduced here.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conventions
|
||||||
|
|
||||||
|
- "Item" / "world" / "observation": one row of input data. Items live
|
||||||
|
in some payload space (real numbers, vectors, matrices, sampled
|
||||||
|
functions, sampled manifolds, distributions, complex amplitudes,
|
||||||
|
time-series windows, recursive concept trees) — the apparatus
|
||||||
|
treats them uniformly via a small set of structural operators.
|
||||||
|
- "Concept": a subset of items that share structural meaning. The
|
||||||
|
apparatus can either be told the concepts (labelled mode) or
|
||||||
|
discover them from data (unsupervised mode).
|
||||||
|
- "Witness": an item whose structural position carries information
|
||||||
|
beyond merely belonging to one concept.
|
||||||
|
- "Verdict": the system's qualified output for a new observation -
|
||||||
|
one of `confident`, `gap`, `incoherent` (see §4.6).
|
||||||
|
|
||||||
|
All of the apparatus is parameter-free and threshold-free: there are
|
||||||
|
no fitting parameters, no numeric cut-offs, no fidelity knobs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Structural similarity primitives
|
||||||
|
|
||||||
|
These are the lowest-level building blocks. Each is exposed directly
|
||||||
|
in `sem_cython12.wrapper`.
|
||||||
|
|
||||||
|
### 1.1 Pairwise similarity
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Score how close a query item is to the most similar member of a reference set. |
|
||||||
|
| Output | A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far). |
|
||||||
|
| Applications | Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation. |
|
||||||
|
| Cython entry point | `batch_max_similarity(X_query, X_members, lam)` |
|
||||||
|
|
||||||
|
### 1.2 Multi-class similarity matrix
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix. |
|
||||||
|
| Applications | Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters. |
|
||||||
|
| Cython entry point | `concept_support_matrix(X_query, member_mats, lam)` |
|
||||||
|
|
||||||
|
### 1.3 Pairwise distance matrix
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Symmetric `(N, N)` distance matrix between rows of `X`. |
|
||||||
|
| Applications | Graph construction, clustering, scale estimation, downstream filtering and ranking. |
|
||||||
|
| Cython entry point | `pairwise_distances(X)` |
|
||||||
|
|
||||||
|
### 1.4 Nearest-neighbour distance vector
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | For each row, the minimum positive distance to any other row. Rows with no positive-distance neighbour receive `inf`. |
|
||||||
|
| Applications | Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification. |
|
||||||
|
| Cython entry point | `nn_distances(X)` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Multi-criterion filtering primitives
|
||||||
|
|
||||||
|
Given a real-valued matrix `S` of shape `(N, k)` (rows are items,
|
||||||
|
columns are independent criteria — each in maximisation orientation),
|
||||||
|
these primitives identify structurally informative subsets of rows.
|
||||||
|
|
||||||
|
### 2.1 Best-tradeoff filter
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion). |
|
||||||
|
| Applications | Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis. |
|
||||||
|
| Cython entry point | `pareto_core_mask(S)` |
|
||||||
|
|
||||||
|
### 2.2 One-sided peak flagging
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone. |
|
||||||
|
| Applications | Removing items that are only locally informative; finding cross-criterion contributors; bridge identification. |
|
||||||
|
| Cython entry point | `one_sided_mask(S)` |
|
||||||
|
|
||||||
|
### 2.3 Non-redundant witness identification
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one. |
|
||||||
|
| Applications | Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis. |
|
||||||
|
| Cython entry point | `non_redundant_witnesses(S)` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Incremental aggregation primitive
|
||||||
|
|
||||||
|
### 3.1 Fused centroid + radius update
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | One-pass bulk update for an incremental aggregation step. Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item. |
|
||||||
|
| Applications | Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines. |
|
||||||
|
| Cython entry point | `extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Higher-level apparatus
|
||||||
|
|
||||||
|
Built on the primitives in §1–§3. These are the operators that
|
||||||
|
distinguish SEM as a reasoning system rather than a computation
|
||||||
|
library. Their internal construction is not reproduced here; the
|
||||||
|
"Cython entry points used" column lists the public primitives the
|
||||||
|
operator composes.
|
||||||
|
|
||||||
|
### 4.1 Intrinsic scale
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required. |
|
||||||
|
| Applications | Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability. |
|
||||||
|
| Cython entry points used | `nn_distances`, `pairwise_distances` |
|
||||||
|
|
||||||
|
### 4.2 Concept discovery
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds. Returns the concepts the data itself supports. |
|
||||||
|
| Applications | Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators. |
|
||||||
|
| Cython entry points used | `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
|
||||||
|
|
||||||
|
### 4.3 Relational hypothesis generation
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support. |
|
||||||
|
| Applications | Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery. |
|
||||||
|
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
|
||||||
|
|
||||||
|
### 4.4 Semantic gap detection
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not. |
|
||||||
|
| Applications | Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity. |
|
||||||
|
| Cython entry points used | `concept_support_matrix`, `non_redundant_witnesses` |
|
||||||
|
|
||||||
|
### 4.5 Prototype construction
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Predict the structural features of an item that should exist between known concepts but has not yet been observed. |
|
||||||
|
| Applications | Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition. |
|
||||||
|
| Cython entry points used | `batch_max_similarity`, `concept_support_matrix` |
|
||||||
|
|
||||||
|
### 4.6 Verdict-qualified inference
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently). |
|
||||||
|
| Applications | Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage. |
|
||||||
|
| Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
|
||||||
|
|
||||||
|
### 4.7 Lifecycle / dominance verification
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype. Maintains the prototype's status across its lifetime. |
|
||||||
|
| Applications | Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference. |
|
||||||
|
| Cython entry points used | `pareto_core_mask` |
|
||||||
|
|
||||||
|
### 4.8 Hierarchical recursion
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Apply every operator above to recursive concept trees — concepts whose members are themselves concepts. Operators bubble through the hierarchy and remain mathematically consistent at every level. |
|
||||||
|
| Applications | Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.). |
|
||||||
|
| Cython entry points used | the operators above, recursively |
|
||||||
|
|
||||||
|
### 4.9 Streaming kNN graph maintenance
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update. |
|
||||||
|
| Applications | Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection. |
|
||||||
|
| Cython entry points used | `pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob. |
|
||||||
|
|
||||||
|
### 4.10 Time-series streaming model
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|---|---|
|
||||||
|
| Purpose | A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows. Optionally projects high-dimensional windows to lower dimensions when configured to do so. |
|
||||||
|
| Applications | Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting. |
|
||||||
|
| Cython entry points used | `nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Composition properties
|
||||||
|
|
||||||
|
The operators in §1–§4 compose along several axes:
|
||||||
|
|
||||||
|
- **Across payload types**: the same operator works for scalars,
|
||||||
|
vectors, matrices, tensors, functions, manifolds, complex states,
|
||||||
|
distributions, time-series windows. The caller supplies the
|
||||||
|
appropriate distance function or, equivalently, an embedding into
|
||||||
|
Euclidean space.
|
||||||
|
- **Across hierarchy levels**: concepts can themselves be members of
|
||||||
|
parent concepts; operators recurse through the tree (§4.8).
|
||||||
|
- **Under wrapping**: stochastic and temporal extensions can be
|
||||||
|
layered over any base payload type. Triple compositions like
|
||||||
|
"hierarchy of stochastic time-series" are admissible and produce
|
||||||
|
consistent results at every level.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. What the apparatus does NOT offer
|
||||||
|
|
||||||
|
Stated explicitly so users can plan around the limits:
|
||||||
|
|
||||||
|
- No probability distributions over outcomes. Verdicts are
|
||||||
|
structural, not Bayesian.
|
||||||
|
- No reward / objective optimisation. The apparatus does not learn
|
||||||
|
policies; it identifies structural relationships.
|
||||||
|
- No tuning knobs that trade fidelity for speed. Where some
|
||||||
|
alternatives expose `epsilon`, `top_k`, `temperature`, etc., the
|
||||||
|
apparatus uses data-derived structural boundaries instead.
|
||||||
|
- No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes).
|
||||||
|
Every kNN-related operator returns exact results.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Mapping summary
|
||||||
|
|
||||||
|
| Apparatus operator | Cython entry point(s) |
|
||||||
|
|---|---|
|
||||||
|
| Pairwise similarity | `batch_max_similarity` |
|
||||||
|
| Multi-class similarity | `concept_support_matrix` |
|
||||||
|
| Pairwise distance | `pairwise_distances` |
|
||||||
|
| Nearest-neighbour distance | `nn_distances` |
|
||||||
|
| Best-tradeoff filter | `pareto_core_mask` |
|
||||||
|
| One-sided peak flag | `one_sided_mask` |
|
||||||
|
| Non-redundant witness | `non_redundant_witnesses` |
|
||||||
|
| Fused centroid + radius update | `extend_frontier_kernel` |
|
||||||
|
| Intrinsic scale | composed of `nn_distances`, `pairwise_distances` |
|
||||||
|
| Concept discovery | composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
|
||||||
|
| Relational hypothesis generation | composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
|
||||||
|
| Semantic gap detection | composed of `concept_support_matrix`, `non_redundant_witnesses` |
|
||||||
|
| Prototype construction | composed of `batch_max_similarity`, `concept_support_matrix` |
|
||||||
|
| Verdict-qualified inference | composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
|
||||||
|
| Lifecycle / dominance verification | composed of `pareto_core_mask` |
|
||||||
|
| Hierarchical recursion | every operator above, recursively |
|
||||||
|
| Streaming kNN graph | `pairwise_distances`, `nn_distances` |
|
||||||
|
| Time-series streaming model | `nn_distances`, `concept_support_matrix`, streaming kNN |
|
||||||
|
|
||||||
|
## 8. Library availability
|
||||||
|
|
||||||
|
The Cython entry points in the right column of §7 are all in
|
||||||
|
`sem_cython12.wrapper`, distributed at
|
||||||
|
[https://git.sevana.biz/vvs/sem_cython12](https://git.sevana.biz/vvs/sem_cython12).
|
||||||
|
Higher-level apparatus (composed operators in §4) is built on those
|
||||||
|
primitives and ships in the SEM foundation package, separate from
|
||||||
|
this library.
|
||||||
@@ -0,0 +1,271 @@
|
|||||||
|
# SEM — An Overview of Structural Reasoning
|
||||||
|
|
||||||
|
*A non-internal introduction to the SEM (Similarity Energy Model)
|
||||||
|
reasoning system, its applications, and the `sem_cython12` library.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What SEM is
|
||||||
|
|
||||||
|
SEM is a reasoning system for **discovering structure in observed
|
||||||
|
data** and producing **decision-qualified predictions** about new
|
||||||
|
observations. Unlike conventional machine learning, SEM is not a
|
||||||
|
parameterised model fitted to training data: its outputs are derived
|
||||||
|
directly from the geometry of the observed world set. Where ML asks
|
||||||
|
"what is the most likely label?", SEM asks "what is the structural
|
||||||
|
position of this observation relative to everything we have seen?"
|
||||||
|
— and reports the answer as a verdict, not a probability.
|
||||||
|
|
||||||
|
The system has been used as a discovery engine, an anomaly detector,
|
||||||
|
a missing-mediator predictor, a regime-change identifier, and an
|
||||||
|
explainable inference layer over neural-network embeddings. Each
|
||||||
|
application reuses the same small set of structural operators.
|
||||||
|
|
||||||
|
## 2. Properties that distinguish SEM
|
||||||
|
|
||||||
|
- **Parameter-free.** No learning rates, no regularisation
|
||||||
|
coefficients, no tuning knobs in the reasoning pipeline. Every
|
||||||
|
scale or boundary the system consults is computed from the data
|
||||||
|
itself.
|
||||||
|
- **Threshold-free.** No `if score > 0.85` decisions. Where
|
||||||
|
conventional pipelines impose a numeric cut-off, SEM uses
|
||||||
|
data-derived structural boundaries that adapt to the observed
|
||||||
|
geometry.
|
||||||
|
- **Three-valued verdict.** A prediction returns one of:
|
||||||
|
- **confident** — a single best-fitting concept dominates;
|
||||||
|
- **gap** — multiple concepts are equally admissible, signalling
|
||||||
|
that the query lies in a region the current theory has not
|
||||||
|
resolved;
|
||||||
|
- **incoherent** — no concept admits the query consistently;
|
||||||
|
further data is required.
|
||||||
|
This refusal-to-guess is the system's most useful safety property:
|
||||||
|
it never collapses uncertainty into a forced label.
|
||||||
|
- **Detects what is missing.** SEM identifies positions where
|
||||||
|
observed data should produce a structural witness but does not, and
|
||||||
|
predicts the features the missing entity should carry. Conventional
|
||||||
|
ML cannot signal that a hidden mediator or unobserved variable is
|
||||||
|
required.
|
||||||
|
- **Explainable by construction.** Every prediction comes with a
|
||||||
|
decomposition of the supporting evidence, so a downstream system
|
||||||
|
(or human reviewer) can audit which structural relations argue for
|
||||||
|
a given verdict.
|
||||||
|
- **Composable across data types.** The same reasoning apparatus
|
||||||
|
applies to scalars, vectors, matrices, sampled functions, sampled
|
||||||
|
manifolds, complex (quantum) state vectors, distributions, time-
|
||||||
|
series windows, and recursive concept hierarchies. The operators
|
||||||
|
see all of these through a common interface.
|
||||||
|
|
||||||
|
## 3. Where SEM has been applied
|
||||||
|
|
||||||
|
| Domain | Capability used |
|
||||||
|
|---|---|
|
||||||
|
| Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
|
||||||
|
| Scientific law discovery | Recovering analytic relationships from raw measurements |
|
||||||
|
| Drug / molecule screening | Structural similarity beyond fingerprints |
|
||||||
|
| Network monitoring | Silent-failure detection in encrypted traffic |
|
||||||
|
| Causal inference | Discovering missing variables from observational data |
|
||||||
|
| Image / signal analysis | Structural feature extraction with explainability |
|
||||||
|
| LLM explainability | Interpreting embedding-space behaviour |
|
||||||
|
| Geopolitical forecasting | Producing confident / abstain forecasts on event data |
|
||||||
|
| Trading & market structure | Regime-switch decisions with abstain semantics |
|
||||||
|
|
||||||
|
In each case the value is the same: the system either gives a
|
||||||
|
high-confidence answer or refuses to, and never delivers a confident
|
||||||
|
wrong answer disguised as a probability.
|
||||||
|
|
||||||
|
## 4. How SEM differs from machine learning
|
||||||
|
|
||||||
|
| | Machine learning | SEM |
|
||||||
|
|---|---|---|
|
||||||
|
| Has training phase | yes | no |
|
||||||
|
| Has hyper-parameters | yes | no |
|
||||||
|
| Can detect missing entities | no | yes |
|
||||||
|
| Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
|
||||||
|
| Output | numeric / probabilistic | structural with verdict |
|
||||||
|
| Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
|
||||||
|
| Scale of usable data | requires many examples | works on small data, even single-digit examples |
|
||||||
|
|
||||||
|
SEM and ML are not exclusive — SEM is sometimes layered on top of
|
||||||
|
neural-network embeddings to provide an explainability and abstention
|
||||||
|
layer, and ML can supply the embeddings SEM reasons over.
|
||||||
|
|
||||||
|
## 5. The `sem_cython12` library
|
||||||
|
|
||||||
|
`sem_cython12` is the high-performance numerical kernel layer that
|
||||||
|
backs SEM's reasoning operators. It is delivered as a pre-compiled
|
||||||
|
Linux shared object plus a thin Python wrapper; users do not compile
|
||||||
|
anything at install time.
|
||||||
|
|
||||||
|
The library exposes one module:
|
||||||
|
|
||||||
|
- `sem_cython12.wrapper` — Python API over the compiled kernels.
|
||||||
|
|
||||||
|
Inside the module, the public functions are grouped by purpose.
|
||||||
|
|
||||||
|
### 5.1 Configuration
|
||||||
|
|
||||||
|
| Function | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `available() -> bool` | Reports whether the compiled extension loaded |
|
||||||
|
| `backend() -> str` | `'cython12'` or `'python-fallback'` |
|
||||||
|
| `get_num_threads() -> int` | Active OpenMP worker count |
|
||||||
|
| `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |
|
||||||
|
|
||||||
|
OpenMP thread count defaults to roughly 50 % of the host's logical
|
||||||
|
cores, so other processes are not starved on shared machines. The
|
||||||
|
caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
|
||||||
|
environment variable.
|
||||||
|
|
||||||
|
### 5.2 Distance and similarity
|
||||||
|
|
||||||
|
| Function | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`. `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
|
||||||
|
| `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
|
||||||
|
| `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
|
||||||
|
| `nn_distances(X)` | Per-row minimum positive distance to any other row. |
|
||||||
|
|
||||||
|
These four cover the bulk of SEM's structural-similarity workload.
|
||||||
|
|
||||||
|
### 5.3 Pareto / dominance reasoning
|
||||||
|
|
||||||
|
| Function | What it computes |
|
||||||
|
|---|---|
|
||||||
|
| `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
|
||||||
|
| `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
|
||||||
|
| `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |
|
||||||
|
|
||||||
|
These let the caller reason about which observations *meaningfully*
|
||||||
|
contribute to bridging multiple structural classes — versus those that
|
||||||
|
are merely peaks of a single class.
|
||||||
|
|
||||||
|
### 5.4 Vector reduction
|
||||||
|
|
||||||
|
| Function | What it computes |
|
||||||
|
|---|---|
|
||||||
|
| `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |
|
||||||
|
|
||||||
|
Used by higher-level routines that need to enumerate candidate
|
||||||
|
relational hypotheses bridging multiple regions of structural space.
|
||||||
|
|
||||||
|
### 5.5 Performance
|
||||||
|
|
||||||
|
Measured on commodity x86_64 hardware with 8 OpenMP threads against
|
||||||
|
the equivalent pure-numpy reference implementations:
|
||||||
|
|
||||||
|
| Operation | Speed-up |
|
||||||
|
|---|---|
|
||||||
|
| `batch_max_similarity` (N=2000, D=50) | ~14× |
|
||||||
|
| `pareto_core_mask` (N=1000, k=8) | ~50× |
|
||||||
|
| Streaming kNN ingest (sliding-window, len=600) | ~100× |
|
||||||
|
| Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
|
||||||
|
|
||||||
|
All routines release the GIL during their inner loops, so calling
|
||||||
|
them concurrently from Python threads is safe.
|
||||||
|
|
||||||
|
## 6. A worked Python example
|
||||||
|
|
||||||
|
The following snippet uses only `sem_cython12.wrapper` and `numpy`.
|
||||||
|
It shows how a downstream pipeline would identify the **structurally
|
||||||
|
informative** members of a small synthetic dataset — those that
|
||||||
|
mediate between two clusters rather than sitting at one cluster's
|
||||||
|
peak.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import numpy as np
|
||||||
|
from sem_cython12 import wrapper as cy
|
||||||
|
|
||||||
|
assert cy.available(), "compiled extension did not load"
|
||||||
|
print("backend:", cy.backend(), " threads:", cy.get_num_threads())
|
||||||
|
|
||||||
|
# Two well-separated clusters in 4-D, plus three "bridging" candidates
|
||||||
|
# whose similarity profile spans both clusters.
|
||||||
|
rng = np.random.default_rng(0)
|
||||||
|
cluster_a = rng.standard_normal((20, 4)) + 3.0
|
||||||
|
cluster_b = rng.standard_normal((20, 4)) - 3.0
|
||||||
|
bridges = np.array([
|
||||||
|
[ 0.0, 0.0, 0.0, 0.0],
|
||||||
|
[ 0.5, 0.5, -0.2, 0.1],
|
||||||
|
[-0.3, 0.1, 0.4, -0.2],
|
||||||
|
])
|
||||||
|
members = np.vstack([cluster_a, cluster_b, bridges])
|
||||||
|
|
||||||
|
# 1. Build a 2-class similarity matrix:
|
||||||
|
# columns = (sim to cluster_a, sim to cluster_b)
|
||||||
|
sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
|
||||||
|
sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
|
||||||
|
S = np.column_stack([sim_a, sim_b]) # (N, 2)
|
||||||
|
|
||||||
|
# 2. Find the Pareto frontier of (sim_a, sim_b).
|
||||||
|
# Members whose support vector is strictly dominated by another
|
||||||
|
# member are excluded.
|
||||||
|
keep_mask = cy.pareto_core_mask(S)
|
||||||
|
print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
|
||||||
|
|
||||||
|
# 3. Of those, which are NOT one-sided peaks?
|
||||||
|
# A one-sided member is a peak of exactly one cluster and gains
|
||||||
|
# nothing on the other. We want members that score on BOTH.
|
||||||
|
non_redundant = cy.non_redundant_witnesses(S)
|
||||||
|
print("Non-redundant witnesses:", non_redundant.tolist())
|
||||||
|
|
||||||
|
# 4. Inspect the ones that survived: these are the data points that
|
||||||
|
# structurally connect the two clusters.
|
||||||
|
for idx in non_redundant:
|
||||||
|
print(f" row {idx}: sim_a={S[idx, 0]:.3f} sim_b={S[idx, 1]:.3f}")
|
||||||
|
```
|
||||||
|
|
||||||
|
A typical run prints something like:
|
||||||
|
|
||||||
|
```
|
||||||
|
backend: cython12 threads: 4
|
||||||
|
Pareto-frontier members: 8 / 43
|
||||||
|
Non-redundant witnesses: [40, 41, 42]
|
||||||
|
row 40: sim_a=0.428 sim_b=0.428
|
||||||
|
row 41: sim_a=0.412 sim_b=0.401
|
||||||
|
row 42: sim_a=0.402 sim_b=0.395
|
||||||
|
```
|
||||||
|
|
||||||
|
The library has filtered out the 40 cluster members (which sit at
|
||||||
|
their own cluster's peak and contribute nothing across cluster
|
||||||
|
boundaries) and identified the three synthetic "bridges" as the
|
||||||
|
structurally informative observations. This is the kind of
|
||||||
|
elementary operation that higher-level SEM reasoning composes into
|
||||||
|
concept discovery, gap detection and prototype prediction.
|
||||||
|
|
||||||
|
## 7. When to consider SEM
|
||||||
|
|
||||||
|
| Situation | Consider SEM |
|
||||||
|
|---|---|
|
||||||
|
| You have small data (10–10,000 examples) and need a defensible decision | Yes |
|
||||||
|
| You need to know *what is missing* from your data | Yes |
|
||||||
|
| You need a model that refuses to guess when the data is ambiguous | Yes |
|
||||||
|
| You want explanations that are inherent to the inference, not bolted on | Yes |
|
||||||
|
| You have millions of labelled examples and need raw classification accuracy | Stay with ML |
|
||||||
|
| You have a regression task with smooth dependencies | Stay with classical statistics |
|
||||||
|
|
||||||
|
## 8. Library availability
|
||||||
|
|
||||||
|
`sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
|
||||||
|
3.12 shared object. Installation is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://git.sevana.biz/vvs/sem_cython12.git
|
||||||
|
cd sem_cython12
|
||||||
|
pip install -r requirements.txt
|
||||||
|
export PYTHONPATH=$PWD:$PYTHONPATH
|
||||||
|
```
|
||||||
|
|
||||||
|
The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
|
||||||
|
and the compiled `.so`, plus `requirements.txt` and a README describing
|
||||||
|
the public API.
|
||||||
|
|
||||||
|
## 9. Summary
|
||||||
|
|
||||||
|
SEM is a structural reasoning system whose promise is decision
|
||||||
|
quality, not raw accuracy. Its key product is a verdict-qualified
|
||||||
|
prediction: the system tells you whether it is confident, whether
|
||||||
|
the data is genuinely ambiguous, or whether the observation lies
|
||||||
|
outside the apparatus's coherent coverage. The `sem_cython12`
|
||||||
|
library provides the high-performance numerical layer beneath this
|
||||||
|
reasoning, exposing a small, well-defined Python API that downstream
|
||||||
|
applications compose into domain-specific pipelines.
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Reference in New Issue
Block a user