v1.1.0: extend binary matrix to CPython 3.10/3.11/3.13 on Linux and Windows

- Linux x86_64: add cp310, cp311, cp313 (.so), built in conda-forge envs. - Windows AMD64: add cp310, cp311, cp313 (.pyd), built with MSVC v14.50. - All eight binaries verified to produce identical numerical output. - README compatibility table + build provenance updated. - macOS still deferred.
Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README
2026-05-10 11:15:11 +01:00 · 2026-05-09 19:24:57 +01:00 · 2026-05-09 20:46:56 +03:00 · 2026-05-09 15:25:52 +01:00
14 changed files with 1091 additions and 15 deletions
@@ -9,6 +9,32 @@ release `MAJOR.MINOR.PATCH` increments
 - `MINOR` on backwards-compatible feature additions,
 - `PATCH` on backwards-compatible bug fixes.
 ## [1.1.0] - 2026-05-10
 Binary matrix expanded to four CPython versions on both supported
 platforms.
 ### Added
 - Pre-compiled Linux x86_64 binaries for **CPython 3.10, 3.11, 3.13**
  (`sem_core12.cpython-3{10,11,13}-x86_64-linux-gnu.so`).  Built in
  isolated conda-forge environments with conda-forge gcc, same
  OpenMP and optimisation flags as the cp312 binary.
 - Pre-compiled Windows AMD64 binaries for **CPython 3.10, 3.11, 3.13**
  (`sem_core12.cp3{10,11,13}-win_amd64.pyd`).  Built with MSVC v14.50
  against the matching CPython installed via `winget`.
 ### Verified
 - All eight binaries (4 Linux + 4 Windows) produce identical numerical
  output for the same fixed-seed input on `batch_max_similarity`.
 ### Compatibility notes
 - macOS is still not provided in this release.  Contact
  `sales@sevana.biz` if you need a macOS build.
 - numpy requirement unchanged: `numpy >= 1.23`.
 ## [1.0.0] - 2026-05-09
 First public release.
@@ -4,34 +4,82 @@ OpenMP-parallel numerical kernel library for Python.  Pre-built
 Linux and Windows binaries included; no compilation required at
 install time.
 ## What is this for?
 `sem_cython12` is a small, focused toolbox of fast C-level routines
 exposed through a thin numpy wrapper.  It is not a general-purpose
 numerical library; it accelerates three specific jobs that are
 awkward or slow to do in pure numpy once `N` reaches the thousands:
 1. **Similarity / distance over batches of vectors.**  Full
   pairwise distance matrices, nearest-neighbour distances, and
   kernel-based `[0, 1]` similarity scores of a query set against
   one or many reference sets.  Useful for nearest-neighbour
   search, kernel-density-style scoring, and "how close is each
   query to this concept?" lookups.
 2. **Multi-objective ("best-tradeoff") filtering of score matrices.**
   Given a matrix of `N` candidates × `k` criteria, select the
   rows on the Pareto frontier, isolate rows that only spike on a
   single criterion, and recover the rows that contribute
   meaningfully across several criteria - candidates a naive
   sum-of-scores ranker would miss.
 3. **An incremental aggregation primitive** for streaming
   clustering / frontier-expansion algorithms: a fused bulk update
   that, given `F` running summaries (centre + radius) and `A`
   new contributions, produces all `F·A` updated summaries in one
   parallel pass.
 The kernels release the GIL, scale near-linearly to ~8 OpenMP
 threads on commodity x86, and operate on shared-memory numpy
 arrays with no inter-process serialisation.  The Python wrapper
 handles contiguous-float64 casting and degrades loudly (via
 `available()` / `backend()` plus `RuntimeError`) when the compiled
 extension cannot load on the host - there is no slow pure-Python
 fallback path.
 The [`demos/`](./demos/) directory contains three runnable
 end-to-end examples (Iris boundary discovery, parameter-free
 anomaly detection, multi-criteria candidate selection) that
 exercise these three jobs against well-known baselines.
 ## Contents
- `sem_cython12/sem_core12.cpython-312-x86_64-linux-gnu.so` -
+- `sem_cython12/sem_core12.cpython-3{10,11,12,13}-x86_64-linux-gnu.so` -
-  compiled extension (Linux, CPython 3.12, x86_64).
+  compiled extensions (Linux, x86_64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
- `sem_cython12/sem_core12.cp312-win_amd64.pyd` -
+- `sem_cython12/sem_core12.cp3{10,11,12,13}-win_amd64.pyd` -
-  compiled extension (Windows, CPython 3.12, AMD64).
+  compiled extensions (Windows, AMD64) for CPython 3.10 / 3.11 / 3.12 / 3.13.
 - `sem_cython12/wrapper.py` - Python API.
 - `sem_cython12/__init__.py` - package entry.
 Python's import system selects the correct binary for the running
 interpreter automatically — install the whole package and the right
 `.so` / `.pyd` is picked up by ABI tag.
 ## Compatibility
-| Platform        | Architecture | Python    | Runtime requirements        |
+| Platform        | Architecture | Python                 | Runtime requirements        |
-|-----------------|--------------|-----------|-----------------------------|
+|-----------------|--------------|------------------------|-----------------------------|
-| Linux           | x86_64       | CPython 3.12 | glibc >= 2.31, libgomp   |
+| Linux           | x86_64       | CPython 3.10/3.11/3.12/3.13 | glibc >= 2.31, libgomp   |
-| Windows 10/11   | AMD64        | CPython 3.12 | vcomp (ships with Windows) |
+| Windows 10/11   | AMD64        | CPython 3.10/3.11/3.12/3.13 | vcomp (ships with Windows) |
-| macOS           | -            | -         | not provided (contact sales@sevana.biz) |
+| macOS           | -            | -                      | not provided (contact sales@sevana.biz) |
 Single Python dependency: `numpy >= 1.23` (see `requirements.txt`).
 ## How the binaries were built
- **Linux (`*.so`)**: gcc 13.3, OpenMP via `libgomp`, flags
+- **Linux (`*.so`), cp312**: system gcc 13.3 on Ubuntu, OpenMP via
-  `-O3 -ffast-math -march=native -fopenmp`.
+  `libgomp`, flags `-O3 -ffast-math -march=native -fopenmp`.
- **Windows (`*.pyd`)**: MSVC v14.50 (Visual Studio Build Tools 2026),
+- **Linux (`*.so`), cp310 / cp311 / cp313**: conda-forge gcc inside
-  OpenMP via `vcomp`, flags `/O2 /openmp`.
+  isolated `python=3.10/3.11/3.13` envs (clean, system-Python-free
  build), same OpenMP and optimisation flags.
 - **Windows (`*.pyd`), all four versions**: MSVC v14.50 (Visual Studio
  Build Tools 2026), OpenMP via `vcomp`, flags `/O2 /openmp`. Each
  built against the matching CPython interpreter installed via
  `winget`.
-Both binaries target CPython 3.12 (cp312) ABI.  No other Python
+All eight binaries pass the same numerical smoke test
-version is supported in this release.
+(`batch_max_similarity` over fixed-seed data) and produce identical
 output to within float64 round-off.
 ## Install
@@ -109,6 +157,32 @@ internally cast to contiguous `float64`.  Outputs are numpy arrays.
 See the wrapper docstrings for exact semantics of each function.
 ## Documentation
 - [`docs/SEM_Overview.md`](./docs/SEM_Overview.md) — non-internal
  introduction to SEM (Similarity Energy Model), what it does, and
  how the `sem_cython12` library fits in.
 - [`docs/SEM_Mathematical_Apparatus.md`](./docs/SEM_Mathematical_Apparatus.md)
  — capabilities-level description of the operators and engines
  exposed by the library.
 ## Demos
 Three runnable demos live in [`demos/`](./demos/):
 1. [`01_iris_boundary.py`](./demos/01_iris_boundary.py) — rediscovers
   the famous Iris versicolor/virginica boundary specimens with no
   training, using only `concept_support_matrix` and `pairwise_distances`.
 2. [`02_anomaly_detection.py`](./demos/02_anomaly_detection.py) —
   parameter-free anomaly detection that matches IsolationForest's
   AUC=1.0 on a synthetic benchmark, using only `batch_max_similarity`.
 3. [`03_multicriteria_selection.py`](./demos/03_multicriteria_selection.py)
   — recovers 5/5 hidden balanced candidates that naive sum-of-scores
   ranking misses, using `pareto_core_mask` and `non_redundant_witnesses`.
 A standalone copy of the demos repository is also published at
 https://git.sevana.biz/vvs/sem_cython12-demos.
 ## Performance notes
 Threads are configured globally per process; calling
@@ -0,0 +1,99 @@
 """Demo 1 - Iris boundary rediscovery (no training).
 The Iris dataset (Fisher 1936) contains 50 specimens of three species:
 setosa, versicolor, virginica.  setosa is fully separable from the
 other two; versicolor and virginica overlap on petal geometry.  Every
 classifier built on Iris since 1936 stumbles on the same handful of
 boundary specimens.
 We find them WITHOUT training a classifier:
  1. Group specimens by species.
  2. Auto-derive a kernel scale from the data's own geometry.
  3. Compute the (150, 3) similarity matrix.
  4. For each specimen, look at how strongly it scores on the
     species it is NOT labelled with.  Highest cross-species score
     ranks the most ambiguous specimens.
 Run:
    python 01_iris_boundary.py
 """
 from __future__ import annotations
 import numpy as np
 from sklearn.datasets import load_iris
 from sem_cython12 import wrapper as cy
 def main() -> int:
    if not cy.available():
        print("ERROR: sem_cython12 compiled extension did not load.")
        return 1
    iris = load_iris()
    X = iris.data                           # (150, 4)
    y = iris.target                         # (150,)
    species_names = iris.target_names
    # Auto-derived kernel scale (median pairwise distance over the
    # whole dataset; no human picks this number).
    pd = cy.pairwise_distances(X)
    iu = np.triu_indices(pd.shape[0], k=1)
    lam = float(np.median(pd[iu]))
    print(f"Auto-derived kernel scale lam = {lam:.4f}\n")
    # Per-species reference sets
    member_sets = [X[y == k] for k in range(3)]
    # (150, 3) similarity matrix
    S = cy.concept_support_matrix(X, member_sets, lam=lam)
    # For each specimen, compute the highest similarity to a species
    # OTHER than its own.  A specimen with high cross-species support
    # is structurally ambiguous - close to a non-self species.
    cross_score = np.empty(150)
    for i in range(150):
        own = y[i]
        cross_score[i] = max(S[i, j] for j in range(3) if j != own)
    # Rank specimens by cross-species score.  Top entries = the famous
    # boundary cases.
    order = np.argsort(cross_score)[::-1]
    print(f"Top 10 most ambiguous specimens (highest cross-species score):\n")
    print(f"  {'rank':>4} {'idx':>4} {'species':>11} "
          f"{'sim->setosa':>12} {'sim->versic':>12} {'sim->virgin':>12}  cross")
    for rank, idx in enumerate(order[:10], 1):
        sims = S[idx]
        own = species_names[y[idx]]
        print(f"  {rank:>4} {idx:>4} {own:>11} "
              f"{sims[0]:>12.4f} {sims[1]:>12.4f} {sims[2]:>12.4f}  {cross_score[idx]:.4f}")
    # Distribution of those top 10 by species
    top10_species = [int(y[i]) for i in order[:10]]
    counts = {0: 0, 1: 0, 2: 0}
    for s in top10_species:
        counts[s] += 1
    print()
    print("Top 10 distribution by species:")
    for k, name in enumerate(species_names):
        print(f"  {name:12s}: {counts[k]} of 10")
    print()
    print("Observation:")
    print("  setosa is fully separable from the other two (Fisher 1936),")
    print("  so we expect zero or near-zero setosa specimens in the top 10.")
    print("  versicolor and virginica overlap in petal geometry - that")
    print("  overlap is exactly where the boundary specimens live.")
    if counts[0] == 0:
        print()
        print("*** Confirmed: zero setosa specimens; the top-10 boundary cases ***")
        print("*** all come from the famous versicolor/virginica overlap zone. ***")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
@@ -0,0 +1,102 @@
 """Demo 2 - Parameter-free anomaly detection.
 Split a dataset into 'reference' (known-normal) and 'query' (a mix of
 normal and anomalous), and score each query by its similarity to the
 reference set.  No labels touched on the query side, no thresholds
 set by hand, no training step.
 We compare against sklearn's IsolationForest (with default settings)
 on the same data.
 Run:
    python 02_anomaly_detection.py
 """
 from __future__ import annotations
 import numpy as np
 from sem_cython12 import wrapper as cy
 def main() -> int:
    if not cy.available():
        print("ERROR: sem_cython12 compiled extension did not load.")
        return 1
    rng = np.random.default_rng(0)
    N_NORMAL = 500
    N_ANOMALY = 10
    D = 5
    # Generate data
    normal = rng.standard_normal((N_NORMAL, D))
    anomalies = rng.standard_normal((N_ANOMALY, D)) + 8.0
    # Split: 80% of normals are 'reference' (known good), 20% are
    # query.  Queries also include all 10 anomalies.
    perm = rng.permutation(N_NORMAL)
    n_ref = int(0.8 * N_NORMAL)
    ref_idx = perm[:n_ref]
    query_normal_idx = perm[n_ref:]
    reference = normal[ref_idx]
    query_normal = normal[query_normal_idx]
    queries = np.vstack([query_normal, anomalies])
    y_query = np.concatenate([
        np.zeros(len(query_normal_idx), dtype=int),
        np.ones(N_ANOMALY, dtype=int),
    ])
    # Auto-derive scale from the reference set's geometry
    nn = cy.nn_distances(reference)
    lam = float(np.median(nn[np.isfinite(nn)]))
    # Score each query by similarity to the reference.
    # Lower similarity = farther from anything known = anomaly.
    sim = cy.batch_max_similarity(queries, reference, lam=lam)
    scores_sem = -sim                     # higher score = more anomalous
    top_k_sem = np.argsort(scores_sem)[::-1][:N_ANOMALY]
    correct_sem = int(np.sum(y_query[top_k_sem] == 1))
    print("=" * 60)
    print("SEM  (sem_cython12 - one batch_max_similarity call)")
    print("=" * 60)
    print(f"  Top-{N_ANOMALY} retrieved as anomalous:  precision = {correct_sem}/{N_ANOMALY}")
    try:
        from sklearn.metrics import roc_auc_score
        auc_sem = roc_auc_score(y_query, scores_sem)
        print(f"  ROC AUC                          = {auc_sem:.4f}")
        from sklearn.ensemble import IsolationForest
        iso = IsolationForest(random_state=0, contamination='auto')
        iso.fit(reference)
        scores_iso = -iso.score_samples(queries)
        top_k_iso = np.argsort(scores_iso)[::-1][:N_ANOMALY]
        correct_iso = int(np.sum(y_query[top_k_iso] == 1))
        auc_iso = roc_auc_score(y_query, scores_iso)
        print()
        print("=" * 60)
        print("Baseline: sklearn IsolationForest (default settings)")
        print("=" * 60)
        print(f"  Top-{N_ANOMALY} retrieved as anomalous:  precision = {correct_iso}/{N_ANOMALY}")
        print(f"  ROC AUC                          = {auc_iso:.4f}")
        print()
        print("=" * 60)
        if auc_sem >= auc_iso - 0.01:
            margin = auc_sem - auc_iso
            sign = "+" if margin >= 0 else ""
            print(f"SEM matches IsolationForest within noise"
                  f" ({sign}{margin:+.4f} AUC),")
            print("with one function call and zero tuning.")
        else:
            print(f"IsolationForest leads by {auc_iso - auc_sem:.4f} AUC; "
                  f"SEM is competitive without parameters.")
    except ImportError:
        print("\n(Install scikit-learn to see the IsolationForest comparison.)")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
@@ -0,0 +1,106 @@
 """Demo 3 - Multi-criteria candidate selection.
 You have 100 candidates evaluated on 4 independent criteria
 (quality, cost-efficiency, robustness, compatibility - or whatever
 your domain calls them).  You want to pick the ones worth a deeper
 look.
 Naive ranking by total score finds the high-mean candidates - which
 are often single-criterion peaks that compensate with weakness on
 the rest.
 SEM's two-stage filter
  1) best-tradeoff filter ('Pareto core')
  2) cross-criterion filter ('non-redundant witnesses')
 finds the genuine all-rounders: candidates that are not strictly
 worse than another on every axis AND that contribute meaningfully on
 multiple axes (not just one).
 Run:
    python 03_multicriteria_selection.py
 """
 from __future__ import annotations
 import numpy as np
 from sem_cython12 import wrapper as cy
 def main() -> int:
    if not cy.available():
        print("ERROR: sem_cython12 compiled extension did not load.")
        return 1
    rng = np.random.default_rng(7)
    N, K = 100, 4
    criteria_names = ["Quality", "Cost-efficiency", "Robustness", "Compatibility"]
    # Most candidates: noisy uniform draws across the criteria
    S = rng.uniform(0.30, 0.95, size=(N, K))
    # Inject 5 hidden 'all-rounders' that score moderately well on EVERY
    # criterion - none top any single axis, but they're well-balanced.
    S[0:5] = rng.uniform(0.65, 0.85, size=(5, K))
    # ---- Naive ranking by sum of scores ---------------------------------
    naive_order = np.argsort(S.sum(axis=1))[::-1]
    naive_top10 = naive_order[:10]
    # ---- SEM ranking ----------------------------------------------------
    pareto_mask = cy.pareto_core_mask(S)
    pareto_idx = np.where(pareto_mask == 1)[0]
    nrw = cy.non_redundant_witnesses(S)
    # ---- Reporting ------------------------------------------------------
    print(f"Candidates                       : {N}")
    print(f"Criteria                         : {K} ({', '.join(criteria_names)})")
    print()
    print(f"Best-tradeoff frontier size      : {len(pareto_idx)}")
    print(f"Cross-criterion winners (NRW)    : {len(nrw)}")
    print(f"Hidden all-rounders we injected  : 5 (indices 0-4)")
    print()
    overlap_with_hidden = set(nrw.tolist()) & set(range(5))
    naive_overlap_with_hidden = set(naive_top10.tolist()) & set(range(5))
    print(f"NRW recovered hidden all-rounders     : "
          f"{len(overlap_with_hidden)}/5  {sorted(overlap_with_hidden)}")
    print(f"Naive top-10 found hidden all-rounders: "
          f"{len(naive_overlap_with_hidden)}/5  {sorted(naive_overlap_with_hidden)}")
    print()
    # Profile of NRW candidates
    print("Cross-criterion winners (NRW) - score profiles:")
    print(f"  {'idx':>4}  " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
          f"   {'min':>5}  {'mean':>5}")
    for i in nrw:
        scores = S[i]
        print(f"  {int(i):>4}  " +
              " ".join(f"{v:9.3f}" for v in scores) +
              f"   {scores.min():5.2f}  {scores.mean():5.2f}")
    print()
    print("Naive top-3 (by total score) - score profiles for comparison:")
    print(f"  {'idx':>4}  " + " ".join(f"{n[:8]:>9}" for n in criteria_names) +
          f"   {'min':>5}  {'mean':>5}")
    for i in naive_top10[:3]:
        scores = S[i]
        print(f"  {int(i):>4}  " +
              " ".join(f"{v:9.3f}" for v in scores) +
              f"   {scores.min():5.2f}  {scores.mean():5.2f}")
    print()
    # Wow line - honest comparison
    n_nrw_hits = len(overlap_with_hidden)
    n_naive_hits = len(naive_overlap_with_hidden)
    print(f"*** SEM's NRW filter recovered {n_nrw_hits}/5 hidden all-rounders. ***")
    print(f"*** Naive sum-of-scores top-10 found only {n_naive_hits}/5.            ***")
    if n_nrw_hits > n_naive_hits:
        print(f"*** SEM surfaces {n_nrw_hits - n_naive_hits} candidates the naive ranking misses     ***")
        print(f"*** because they don't peak on any single criterion.        ***")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
@@ -0,0 +1,128 @@
 # sem_cython12 - sample projects
 Three short, runnable Python projects that demonstrate the `sem_cython12`
 library on small but realistic problems.  Each demo is a single file,
 self-contained, and produces a clear printable result.
 The demos use **only** `sem_cython12.wrapper`, `numpy`, and (for the
 Iris and anomaly demos) `scikit-learn`.
 ## What each demo shows
 | File | Domain | "Wow" |
 |---|---|---|
 | [`01_iris_boundary.py`](./01_iris_boundary.py) | The 1936 Iris dataset | Rediscovers the famous versicolor/virginica boundary specimens **without training a classifier** and without setting any threshold. |
 | [`02_anomaly_detection.py`](./02_anomaly_detection.py) | Synthetic 5-D anomalies | Detects 10/10 injected anomalies with **a single function call** and matches/beats sklearn's IsolationForest on ROC AUC. |
 | [`03_multicriteria_selection.py`](./03_multicriteria_selection.py) | Multi-criteria candidate ranking | Identifies the **hidden all-rounders** that naive sum-of-scores ranking misses entirely. |
 ## Install
 ```bash
 # Get the library (private repo)
 git clone https://git.sevana.biz/vvs/sem_cython12.git ../sem_cython12
 export PYTHONPATH="$(pwd)/../sem_cython12:$PYTHONPATH"
 # Demo dependencies
 pip install -r requirements.txt
 ```
 The pre-built Linux x86_64 / CPython 3.12 binary ships with the
 library; no compilation step is required.
 ## Run
 ```bash
 python 01_iris_boundary.py
 python 02_anomaly_detection.py
 python 03_multicriteria_selection.py
 ```
 Each demo finishes in well under a second on a laptop.
 ## What you'll see
 ### 01_iris_boundary.py
 ```
 Auto-derived kernel scale lam = 3.4762
 Top 10 most ambiguous specimens (highest cross-species score):
  rank  idx     species  sim->setosa  sim->versic  sim->virgin  cross
     1  138   virginica       0.2330       0.9096       1.0000  0.9096
     2   70  versicolor       0.2396       1.0000       0.9096  0.9096
     3  127   virginica       0.2222       0.8806       1.0000  0.8806
     4   83  versicolor       0.2084       1.0000       0.8689  0.8689
     5  133   virginica       0.2062       0.8689       1.0000  0.8689
     ...
 Top 10 distribution by species:
  setosa      : 0 of 10
  versicolor  : 3 of 10
  virginica   : 7 of 10
 *** Confirmed: zero setosa specimens; the top-10 boundary cases ***
 *** all come from the famous versicolor/virginica overlap zone. ***
 ```
 ### 02_anomaly_detection.py
 ```
 SEM  (sem_cython12 - one batch_max_similarity call)
  Top-10 retrieved as anomalous:  precision = 10/10
  ROC AUC                          = 1.0000
 Baseline: sklearn IsolationForest (default settings)
  Top-10 retrieved as anomalous:  precision = 10/10
  ROC AUC                          = 1.0000
 SEM matches IsolationForest within noise (+0.0000 AUC),
 with one function call and zero tuning.
 ```
 ### 03_multicriteria_selection.py
 ```
 Best-tradeoff frontier size      : 35
 Cross-criterion winners (NRW)    : 31
 Hidden all-rounders we injected  : 5 (indices 0-4)
 NRW recovered hidden all-rounders     : 5/5  [0, 1, 2, 3, 4]
 Naive top-10 found hidden all-rounders: 3/5  [1, 2, 3]
 *** SEM's NRW filter recovered 5/5 hidden all-rounders. ***
 *** Naive sum-of-scores top-10 found only 3/5.          ***
 *** SEM surfaces 2 candidates the naive ranking misses  ***
 *** because they don't peak on any single criterion.    ***
 ```
 ## What to try next
 - Replace the synthetic data in `02_*` with your own observations and
  see what gets flagged.
 - Replace the synthetic candidate matrix in `03_*` with your
  real-world multi-criteria evaluation (job applicants, vendor
  proposals, product features, drug screens).
 - Extend `01_*` to your own classification problems: any time you
  have multiple classes with overlapping members, the NRW operator
  surfaces the structurally informative boundary cases.
 The library has more capabilities than these three demos exercise.
 See the `sem_cython12.wrapper` API for the full operator set
 (pairwise distances, multi-class similarity matrix, incremental
 aggregation, etc.).
 ## Licence
 The demos and the underlying `sem_cython12` library are licensed
 under the terms in the [LICENSE](./LICENSE) file:
 - Research and non-commercial use: free under the conditions
  stated in the licence.
 - Commercial use: requires a separate written commercial licence.
  Contact `sales@sevana.biz`.
 - The Software is provided strictly "AS IS", without warranty of
  any kind.
 Please read the LICENSE file in full before using the demos or the
 underlying library.
@@ -0,0 +1,270 @@
 # SEM — Mathematical Apparatus (Capability Catalog)
 *A non-internal catalog of the operators SEM offers, what each is for,
 and which entry points of the `sem_cython12` library back them.*
 This document describes WHAT the apparatus does and WHERE to use it.
 It does not describe HOW any operator works internally — algorithms,
 formulas, lemmas and proofs are intentionally not reproduced here.
 ---
 ## Conventions
 - "Item" / "world" / "observation": one row of input data.  Items live
  in some payload space (real numbers, vectors, matrices, sampled
  functions, sampled manifolds, distributions, complex amplitudes,
  time-series windows, recursive concept trees) — the apparatus
  treats them uniformly via a small set of structural operators.
 - "Concept": a subset of items that share structural meaning.  The
  apparatus can either be told the concepts (labelled mode) or
  discover them from data (unsupervised mode).
 - "Witness": an item whose structural position carries information
  beyond merely belonging to one concept.
 - "Verdict": the system's qualified output for a new observation -
  one of `confident`, `gap`, `incoherent` (see §4.6).
 All of the apparatus is parameter-free and threshold-free: there are
 no fitting parameters, no numeric cut-offs, no fidelity knobs.
 ---
 ## 1.  Structural similarity primitives
 These are the lowest-level building blocks.  Each is exposed directly
 in `sem_cython12.wrapper`.
 ### 1.1  Pairwise similarity
 | | |
 |---|---|
 | Purpose | Score how close a query item is to the most similar member of a reference set. |
 | Output | A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far). |
 | Applications | Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation. |
 | Cython entry point | `batch_max_similarity(X_query, X_members, lam)` |
 ### 1.2  Multi-class similarity matrix
 | | |
 |---|---|
 | Purpose | The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix. |
 | Applications | Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters. |
 | Cython entry point | `concept_support_matrix(X_query, member_mats, lam)` |
 ### 1.3  Pairwise distance matrix
 | | |
 |---|---|
 | Purpose | Symmetric `(N, N)` distance matrix between rows of `X`. |
 | Applications | Graph construction, clustering, scale estimation, downstream filtering and ranking. |
 | Cython entry point | `pairwise_distances(X)` |
 ### 1.4  Nearest-neighbour distance vector
 | | |
 |---|---|
 | Purpose | For each row, the minimum positive distance to any other row.  Rows with no positive-distance neighbour receive `inf`. |
 | Applications | Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification. |
 | Cython entry point | `nn_distances(X)` |
 ---
 ## 2.  Multi-criterion filtering primitives
 Given a real-valued matrix `S` of shape `(N, k)` (rows are items,
 columns are independent criteria — each in maximisation orientation),
 these primitives identify structurally informative subsets of rows.
 ### 2.1  Best-tradeoff filter
 | | |
 |---|---|
 | Purpose | Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion). |
 | Applications | Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis. |
 | Cython entry point | `pareto_core_mask(S)` |
 ### 2.2  One-sided peak flagging
 | | |
 |---|---|
 | Purpose | Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone. |
 | Applications | Removing items that are only locally informative; finding cross-criterion contributors; bridge identification. |
 | Cython entry point | `one_sided_mask(S)` |
 ### 2.3  Non-redundant witness identification
 | | |
 |---|---|
 | Purpose | The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one. |
 | Applications | Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis. |
 | Cython entry point | `non_redundant_witnesses(S)` |
 ---
 ## 3.  Incremental aggregation primitive
 ### 3.1  Fused centroid + radius update
 | | |
 |---|---|
 | Purpose | One-pass bulk update for an incremental aggregation step.  Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item. |
 | Applications | Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines. |
 | Cython entry point | `extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)` |
 ---
 ## 4.  Higher-level apparatus
 Built on the primitives in §1–§3.  These are the operators that
 distinguish SEM as a reasoning system rather than a computation
 library.  Their internal construction is not reproduced here; the
 "Cython entry points used" column lists the public primitives the
 operator composes.
 ### 4.1  Intrinsic scale
 | | |
 |---|---|
 | Purpose | Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required. |
 | Applications | Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability. |
 | Cython entry points used | `nn_distances`, `pairwise_distances` |
 ### 4.2  Concept discovery
 | | |
 |---|---|
 | Purpose | Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds.  Returns the concepts the data itself supports. |
 | Applications | Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators. |
 | Cython entry points used | `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
 ### 4.3  Relational hypothesis generation
 | | |
 |---|---|
 | Purpose | Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support. |
 | Applications | Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery. |
 | Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
 ### 4.4  Semantic gap detection
 | | |
 |---|---|
 | Purpose | Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not. |
 | Applications | Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity. |
 | Cython entry points used | `concept_support_matrix`, `non_redundant_witnesses` |
 ### 4.5  Prototype construction
 | | |
 |---|---|
 | Purpose | Predict the structural features of an item that should exist between known concepts but has not yet been observed. |
 | Applications | Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition. |
 | Cython entry points used | `batch_max_similarity`, `concept_support_matrix` |
 ### 4.6  Verdict-qualified inference
 | | |
 |---|---|
 | Purpose | Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently). |
 | Applications | Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage. |
 | Cython entry points used | `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
 ### 4.7  Lifecycle / dominance verification
 | | |
 |---|---|
 | Purpose | When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype.  Maintains the prototype's status across its lifetime. |
 | Applications | Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference. |
 | Cython entry points used | `pareto_core_mask` |
 ### 4.8  Hierarchical recursion
 | | |
 |---|---|
 | Purpose | Apply every operator above to recursive concept trees — concepts whose members are themselves concepts.  Operators bubble through the hierarchy and remain mathematically consistent at every level. |
 | Applications | Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.). |
 | Cython entry points used | the operators above, recursively |
 ### 4.9  Streaming kNN graph maintenance
 | | |
 |---|---|
 | Purpose | Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update. |
 | Applications | Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection. |
 | Cython entry points used | `pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob. |
 ### 4.10  Time-series streaming model
 | | |
 |---|---|
 | Purpose | A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows.  Optionally projects high-dimensional windows to lower dimensions when configured to do so. |
 | Applications | Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting. |
 | Cython entry points used | `nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9 |
 ---
 ## 5.  Composition properties
 The operators in §1–§4 compose along several axes:
 - **Across payload types**: the same operator works for scalars,
  vectors, matrices, tensors, functions, manifolds, complex states,
  distributions, time-series windows.  The caller supplies the
  appropriate distance function or, equivalently, an embedding into
  Euclidean space.
 - **Across hierarchy levels**: concepts can themselves be members of
  parent concepts; operators recurse through the tree (§4.8).
 - **Under wrapping**: stochastic and temporal extensions can be
  layered over any base payload type.  Triple compositions like
  "hierarchy of stochastic time-series" are admissible and produce
  consistent results at every level.
 ---
 ## 6.  What the apparatus does NOT offer
 Stated explicitly so users can plan around the limits:
 - No probability distributions over outcomes.  Verdicts are
  structural, not Bayesian.
 - No reward / objective optimisation.  The apparatus does not learn
  policies; it identifies structural relationships.
 - No tuning knobs that trade fidelity for speed.  Where some
  alternatives expose `epsilon`, `top_k`, `temperature`, etc., the
  apparatus uses data-derived structural boundaries instead.
 - No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes).
  Every kNN-related operator returns exact results.
 ---
 ## 7.  Mapping summary
 | Apparatus operator | Cython entry point(s) |
 |---|---|
 | Pairwise similarity | `batch_max_similarity` |
 | Multi-class similarity | `concept_support_matrix` |
 | Pairwise distance | `pairwise_distances` |
 | Nearest-neighbour distance | `nn_distances` |
 | Best-tradeoff filter | `pareto_core_mask` |
 | One-sided peak flag | `one_sided_mask` |
 | Non-redundant witness | `non_redundant_witnesses` |
 | Fused centroid + radius update | `extend_frontier_kernel` |
 | Intrinsic scale | composed of `nn_distances`, `pairwise_distances` |
 | Concept discovery | composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask` |
 | Relational hypothesis generation | composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel` |
 | Semantic gap detection | composed of `concept_support_matrix`, `non_redundant_witnesses` |
 | Prototype construction | composed of `batch_max_similarity`, `concept_support_matrix` |
 | Verdict-qualified inference | composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity` |
 | Lifecycle / dominance verification | composed of `pareto_core_mask` |
 | Hierarchical recursion | every operator above, recursively |
 | Streaming kNN graph | `pairwise_distances`, `nn_distances` |
 | Time-series streaming model | `nn_distances`, `concept_support_matrix`, streaming kNN |
 ## 8.  Library availability
 The Cython entry points in the right column of §7 are all in
 `sem_cython12.wrapper`, distributed at
 [https://git.sevana.biz/vvs/sem_cython12](https://git.sevana.biz/vvs/sem_cython12).
 Higher-level apparatus (composed operators in §4) is built on those
 primitives and ships in the SEM foundation package, separate from
 this library.
@@ -0,0 +1,271 @@
 # SEM — An Overview of Structural Reasoning
 *A non-internal introduction to the SEM (Similarity Energy Model)
 reasoning system, its applications, and the `sem_cython12` library.*
 ---
 ## 1.  What SEM is
 SEM is a reasoning system for **discovering structure in observed
 data** and producing **decision-qualified predictions** about new
 observations.  Unlike conventional machine learning, SEM is not a
 parameterised model fitted to training data: its outputs are derived
 directly from the geometry of the observed world set.  Where ML asks
 "what is the most likely label?", SEM asks "what is the structural
 position of this observation relative to everything we have seen?"
 — and reports the answer as a verdict, not a probability.
 The system has been used as a discovery engine, an anomaly detector,
 a missing-mediator predictor, a regime-change identifier, and an
 explainable inference layer over neural-network embeddings.  Each
 application reuses the same small set of structural operators.
 ## 2.  Properties that distinguish SEM
 - **Parameter-free.**  No learning rates, no regularisation
  coefficients, no tuning knobs in the reasoning pipeline.  Every
  scale or boundary the system consults is computed from the data
  itself.
 - **Threshold-free.**  No `if score > 0.85` decisions.  Where
  conventional pipelines impose a numeric cut-off, SEM uses
  data-derived structural boundaries that adapt to the observed
  geometry.
 - **Three-valued verdict.**  A prediction returns one of:
  - **confident** — a single best-fitting concept dominates;
  - **gap** — multiple concepts are equally admissible, signalling
    that the query lies in a region the current theory has not
    resolved;
  - **incoherent** — no concept admits the query consistently;
    further data is required.
  This refusal-to-guess is the system's most useful safety property:
  it never collapses uncertainty into a forced label.
 - **Detects what is missing.**  SEM identifies positions where
  observed data should produce a structural witness but does not, and
  predicts the features the missing entity should carry.  Conventional
  ML cannot signal that a hidden mediator or unobserved variable is
  required.
 - **Explainable by construction.**  Every prediction comes with a
  decomposition of the supporting evidence, so a downstream system
  (or human reviewer) can audit which structural relations argue for
  a given verdict.
 - **Composable across data types.**  The same reasoning apparatus
  applies to scalars, vectors, matrices, sampled functions, sampled
  manifolds, complex (quantum) state vectors, distributions, time-
  series windows, and recursive concept hierarchies.  The operators
  see all of these through a common interface.
 ## 3.  Where SEM has been applied
 | Domain | Capability used |
 |---|---|
 | Multivariate time series | Regime detection, forecast verdicts, anomaly identification |
 | Scientific law discovery | Recovering analytic relationships from raw measurements |
 | Drug / molecule screening | Structural similarity beyond fingerprints |
 | Network monitoring | Silent-failure detection in encrypted traffic |
 | Causal inference | Discovering missing variables from observational data |
 | Image / signal analysis | Structural feature extraction with explainability |
 | LLM explainability | Interpreting embedding-space behaviour |
 | Geopolitical forecasting | Producing confident / abstain forecasts on event data |
 | Trading & market structure | Regime-switch decisions with abstain semantics |
 In each case the value is the same: the system either gives a
 high-confidence answer or refuses to, and never delivers a confident
 wrong answer disguised as a probability.
 ## 4.  How SEM differs from machine learning
 |  | Machine learning | SEM |
 |---|---|---|
 | Has training phase | yes | no |
 | Has hyper-parameters | yes | no |
 | Can detect missing entities | no | yes |
 | Refuses to predict | no (returns argmax) | yes (gap / incoherent verdict) |
 | Output | numeric / probabilistic | structural with verdict |
 | Explanation | post-hoc (SHAP, LIME, attention) | inherent in the inference |
 | Scale of usable data | requires many examples | works on small data, even single-digit examples |
 SEM and ML are not exclusive — SEM is sometimes layered on top of
 neural-network embeddings to provide an explainability and abstention
 layer, and ML can supply the embeddings SEM reasons over.
 ## 5.  The `sem_cython12` library
 `sem_cython12` is the high-performance numerical kernel layer that
 backs SEM's reasoning operators.  It is delivered as a pre-compiled
 Linux shared object plus a thin Python wrapper; users do not compile
 anything at install time.
 The library exposes one module:
 - `sem_cython12.wrapper` — Python API over the compiled kernels.
 Inside the module, the public functions are grouped by purpose.
 ### 5.1  Configuration
 | Function | Purpose |
 |---|---|
 | `available() -> bool` | Reports whether the compiled extension loaded |
 | `backend() -> str` | `'cython12'` or `'python-fallback'` |
 | `get_num_threads() -> int` | Active OpenMP worker count |
 | `set_num_threads(n: int)` | Set OpenMP worker count (≥ 1) |
 OpenMP thread count defaults to roughly 50 % of the host's logical
 cores, so other processes are not starved on shared machines.  The
 caller can override via `set_num_threads()` or the `SEM_NUM_THREADS`
 environment variable.
 ### 5.2  Distance and similarity
 | Function | What it does |
 |---|---|
 | `batch_max_similarity(X_query, X_members, lam)` | For each row of `X_query`, returns a similarity score in `[0, 1]` summarising its closeness to the most similar row of `X_members`.  `lam` (> 0) is the scale that determines how quickly similarity decays with separation. |
 | `concept_support_matrix(X_query, member_mats, lam)` | The same operation applied across `K` independent reference sets, returning a `(Q, K)` score matrix. |
 | `pairwise_distances(X)` | Symmetric `(N, N)` distance matrix between rows of `X`. |
 | `nn_distances(X)` | Per-row minimum positive distance to any other row. |
 These four cover the bulk of SEM's structural-similarity workload.
 ### 5.3  Pareto / dominance reasoning
 | Function | What it computes |
 |---|---|
 | `pareto_core_mask(S)` | Boolean mask of rows not strictly dominated in the maximisation order |
 | `one_sided_mask(S)` | Per-row, per-column mask used for non-redundant-witness selection |
 | `non_redundant_witnesses(S)` | Indices of rows that survive both the Pareto and one-sided filters |
 These let the caller reason about which observations *meaningfully*
 contribute to bridging multiple structural classes — versus those that
 are merely peaks of a single class.
 ### 5.4  Vector reduction
 | Function | What it computes |
 |---|---|
 | `extend_frontier_kernel(...)` | Fused centroid + radius computation for incremental hypothesis generation |
 Used by higher-level routines that need to enumerate candidate
 relational hypotheses bridging multiple regions of structural space.
 ### 5.5  Performance
 Measured on commodity x86_64 hardware with 8 OpenMP threads against
 the equivalent pure-numpy reference implementations:
 | Operation | Speed-up |
 |---|---|
 | `batch_max_similarity` (N=2000, D=50) | ~14× |
 | `pareto_core_mask` (N=1000, k=8) | ~50× |
 | Streaming kNN ingest (sliding-window, len=600) | ~100× |
 | Higher-arity hypothesis frontier (k=4, m=20) | brute force is intractable; pruned form runs sub-second |
 All routines release the GIL during their inner loops, so calling
 them concurrently from Python threads is safe.
 ## 6.  A worked Python example
 The following snippet uses only `sem_cython12.wrapper` and `numpy`.
 It shows how a downstream pipeline would identify the **structurally
 informative** members of a small synthetic dataset — those that
 mediate between two clusters rather than sitting at one cluster's
 peak.
 ```python
 import numpy as np
 from sem_cython12 import wrapper as cy
 assert cy.available(), "compiled extension did not load"
 print("backend:", cy.backend(), "  threads:", cy.get_num_threads())
 # Two well-separated clusters in 4-D, plus three "bridging" candidates
 # whose similarity profile spans both clusters.
 rng = np.random.default_rng(0)
 cluster_a = rng.standard_normal((20, 4)) +  3.0
 cluster_b = rng.standard_normal((20, 4)) -  3.0
 bridges   = np.array([
    [ 0.0, 0.0,  0.0, 0.0],
    [ 0.5, 0.5, -0.2, 0.1],
    [-0.3, 0.1,  0.4, -0.2],
 ])
 members = np.vstack([cluster_a, cluster_b, bridges])
 # 1. Build a 2-class similarity matrix:
 #    columns = (sim to cluster_a, sim to cluster_b)
 sim_a = cy.batch_max_similarity(members, cluster_a, lam=1.0)
 sim_b = cy.batch_max_similarity(members, cluster_b, lam=1.0)
 S = np.column_stack([sim_a, sim_b])               # (N, 2)
 # 2. Find the Pareto frontier of (sim_a, sim_b).
 #    Members whose support vector is strictly dominated by another
 #    member are excluded.
 keep_mask = cy.pareto_core_mask(S)
 print("Pareto-frontier members:", int(keep_mask.sum()), "/", len(members))
 # 3. Of those, which are NOT one-sided peaks?
 #    A one-sided member is a peak of exactly one cluster and gains
 #    nothing on the other.  We want members that score on BOTH.
 non_redundant = cy.non_redundant_witnesses(S)
 print("Non-redundant witnesses:", non_redundant.tolist())
 # 4. Inspect the ones that survived: these are the data points that
 #    structurally connect the two clusters.
 for idx in non_redundant:
    print(f"  row {idx}:  sim_a={S[idx, 0]:.3f}  sim_b={S[idx, 1]:.3f}")
 ```
 A typical run prints something like:
 ```
 backend: cython12   threads: 4
 Pareto-frontier members: 8 / 43
 Non-redundant witnesses: [40, 41, 42]
  row 40:  sim_a=0.428  sim_b=0.428
  row 41:  sim_a=0.412  sim_b=0.401
  row 42:  sim_a=0.402  sim_b=0.395
 ```
 The library has filtered out the 40 cluster members (which sit at
 their own cluster's peak and contribute nothing across cluster
 boundaries) and identified the three synthetic "bridges" as the
 structurally informative observations.  This is the kind of
 elementary operation that higher-level SEM reasoning composes into
 concept discovery, gap detection and prototype prediction.
 ## 7.  When to consider SEM
 | Situation | Consider SEM |
 |---|---|
 | You have small data (10–10,000 examples) and need a defensible decision | Yes |
 | You need to know *what is missing* from your data | Yes |
 | You need a model that refuses to guess when the data is ambiguous | Yes |
 | You want explanations that are inherent to the inference, not bolted on | Yes |
 | You have millions of labelled examples and need raw classification accuracy | Stay with ML |
 | You have a regression task with smooth dependencies | Stay with classical statistics |
 ## 8.  Library availability
 `sem_cython12` is distributed as a pre-compiled Linux x86_64 / CPython
 3.12 shared object.  Installation is:
 ```bash
 git clone https://git.sevana.biz/vvs/sem_cython12.git
 cd sem_cython12
 pip install -r requirements.txt
 export PYTHONPATH=$PWD:$PYTHONPATH
 ```
 The package contains `sem_cython12/__init__.py`, `sem_cython12/wrapper.py`,
 and the compiled `.so`, plus `requirements.txt` and a README describing
 the public API.
 ## 9.  Summary
 SEM is a structural reasoning system whose promise is decision
 quality, not raw accuracy.  Its key product is a verdict-qualified
 prediction: the system tells you whether it is confident, whether
 the data is genuinely ambiguous, or whether the observation lies
 outside the apparatus's coherent coverage.  The `sem_cython12`
 library provides the high-performance numerical layer beneath this
 reasoning, exposing a small, well-defined Python API that downstream
 applications compose into domain-specific pipelines.
Author	SHA1	Message	Date
vvs	ed5ca0cafc	v1.1.0: extend binary matrix to CPython 3.10/3.11/3.13 on Linux and Windows - Linux x86_64: add cp310, cp311, cp313 (.so), built in conda-forge envs. - Windows AMD64: add cp310, cp311, cp313 (.pyd), built with MSVC v14.50. - All eight binaries verified to produce identical numerical output. - README compatibility table + build provenance updated. - macOS still deferred.	2026-05-10 11:15:11 +01:00
vvs	fa87dbb473	Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README	2026-05-09 19:24:57 +01:00
dmytro.bogovych	80f99d1d15	- add 'what is this' section to README.md	2026-05-09 20:46:56 +03:00
vvs	c886ded981	Vendor demos under demos/ and link from README for landing-page visibility	2026-05-09 15:25:52 +01:00