Files

T

vvs fa87dbb473 Add SEM_Overview.md and SEM_Mathematical_Apparatus.md under docs/ and link from README

2026-05-09 19:24:57 +01:00

13 KiB

Raw Permalink Blame History

SEM — Mathematical Apparatus (Capability Catalog)

A non-internal catalog of the operators SEM offers, what each is for, and which entry points of the sem_cython12 library back them.

This document describes WHAT the apparatus does and WHERE to use it. It does not describe HOW any operator works internally — algorithms, formulas, lemmas and proofs are intentionally not reproduced here.

Conventions

"Item" / "world" / "observation": one row of input data. Items live in some payload space (real numbers, vectors, matrices, sampled functions, sampled manifolds, distributions, complex amplitudes, time-series windows, recursive concept trees) — the apparatus treats them uniformly via a small set of structural operators.
"Concept": a subset of items that share structural meaning. The apparatus can either be told the concepts (labelled mode) or discover them from data (unsupervised mode).
"Witness": an item whose structural position carries information beyond merely belonging to one concept.
"Verdict": the system's qualified output for a new observation - one of confident, gap, incoherent (see §4.6).

All of the apparatus is parameter-free and threshold-free: there are no fitting parameters, no numeric cut-offs, no fidelity knobs.

1. Structural similarity primitives

These are the lowest-level building blocks. Each is exposed directly in sem_cython12.wrapper.

1.1 Pairwise similarity


Purpose	Score how close a query item is to the most similar member of a reference set.
Output	A score in `[0, 1]` per query (1 = at the reference set, 0 = effectively far).
Applications	Membership tests, retrieval, anomaly detection, k-nearest-neighbour pre-filtering, similarity-weighted aggregation.
Cython entry point	`batch_max_similarity(X_query, X_members, lam)`

1.2 Multi-class similarity matrix


Purpose	The same operation applied across `K` independent reference sets in one call, returning a `(Q, K)` score matrix.
Applications	Multi-class classification scoring, multi-criterion membership, class-confusion matrices, support-vector inputs to higher-level filters.
Cython entry point	`concept_support_matrix(X_query, member_mats, lam)`

1.3 Pairwise distance matrix


Purpose	Symmetric `(N, N)` distance matrix between rows of `X`.
Applications	Graph construction, clustering, scale estimation, downstream filtering and ranking.
Cython entry point	`pairwise_distances(X)`

1.4 Nearest-neighbour distance vector


Purpose	For each row, the minimum positive distance to any other row. Rows with no positive-distance neighbour receive `inf`.
Applications	Local-density estimation, intrinsic-scale derivation, duplicate detection, outlier identification.
Cython entry point	`nn_distances(X)`

2. Multi-criterion filtering primitives

Given a real-valued matrix S of shape (N, k) (rows are items, columns are independent criteria — each in maximisation orientation), these primitives identify structurally informative subsets of rows.

2.1 Best-tradeoff filter


Purpose	Mask the rows that survive a multi-objective best-tradeoff filter (i.e. items that are not strictly worse than another item on every criterion).
Applications	Multi-objective optimisation frontier, concept-membership trade-off, candidate winnowing before further analysis.
Cython entry point	`pareto_core_mask(S)`

2.2 One-sided peak flagging


Purpose	Flag row/column pairs where the row is the column-wise winner but contributes nothing on the remaining columns - i.e. items that "peak" on a single criterion alone.
Applications	Removing items that are only locally informative; finding cross-criterion contributors; bridge identification.
Cython entry point	`one_sided_mask(S)`

2.3 Non-redundant witness identification


Purpose	The subset of rows that survive both 2.1 and 2.2 — items that contribute meaningfully across multiple criteria, not just on one.
Applications	Bridge-witness selection between concept regions, structurally informative subset extraction, downstream gap analysis.
Cython entry point	`non_redundant_witnesses(S)`

3. Incremental aggregation primitive

3.1 Fused centroid + radius update


Purpose	One-pass bulk update for an incremental aggregation step. Given `F` reference items - each summarised by a centre vector and a radius (representing the dispersion of `cur_arity` underlying points) - and `A` candidate new contributions, produce all `F * A` updated (centre, radius) pairs that result from appending one candidate to one reference item.
Applications	Streaming centroid / radius maintenance, candidate-frontier expansion in multi-stage selection, online aggregation pipelines.
Cython entry point	`extend_frontier_kernel(cur_centers, cur_radii, new_emb, cur_arity)`

4. Higher-level apparatus

Built on the primitives in §1–§3. These are the operators that distinguish SEM as a reasoning system rather than a computation library. Their internal construction is not reproduced here; the "Cython entry points used" column lists the public primitives the operator composes.

4.1 Intrinsic scale


Purpose	Derive the kernel scale from the data's own structural geometry, so that no manual `lam` value is ever required.
Applications	Any pipeline that wants the scale property to be a function of the data, not a tuning knob; cross-application portability.
Cython entry points used	`nn_distances`, `pairwise_distances`

4.2 Concept discovery


Purpose	Group observations into structurally coherent regions without using labels, ML training, or numeric thresholds. Returns the concepts the data itself supports.
Applications	Unsupervised classification, regime identification, exploratory analysis, foundation for downstream operators.
Cython entry points used	`pairwise_distances`, `nn_distances`, `pareto_core_mask`

4.3 Relational hypothesis generation


Purpose	Enumerate candidate structural relationships between concepts (pair-wise and higher-arity) and rank them by support.
Applications	Discovering laws / regularities between groups, cross-concept analysis, scientific structure recovery.
Cython entry points used	`concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel`

4.4 Semantic gap detection


Purpose	Identify positions in structural space where the data should produce a witness bridging two or more concepts but does not.
Applications	Detecting missing variables, hidden mediators, unobserved confounders; identifying where additional measurement would resolve ambiguity.
Cython entry points used	`concept_support_matrix`, `non_redundant_witnesses`

4.5 Prototype construction


Purpose	Predict the structural features of an item that should exist between known concepts but has not yet been observed.
Applications	Drug-candidate suggestion, missing-mediator prediction, "what if" scenario generation, hypothesis-driven data acquisition.
Cython entry points used	`batch_max_similarity`, `concept_support_matrix`

4.6 Verdict-qualified inference


Purpose	Decide which concept best explains a new observation, returning one of three outcomes: `confident` (a single concept dominates), `gap` (multiple concepts are equally admissible), `incoherent` (no concept admits the observation consistently).
Applications	Decision-support systems that must abstain when ambiguous, safety-critical classification, regime change detection, automated triage.
Cython entry points used	`concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity`

4.7 Lifecycle / dominance verification


Purpose	When a real observation arrives, decide whether it confirms, displaces, or co-exists with a previously predicted prototype. Maintains the prototype's status across its lifetime.
Applications	Continuous-learning pipelines, theory revision under new evidence, audit-trail-preserving inference.
Cython entry points used	`pareto_core_mask`

4.8 Hierarchical recursion


Purpose	Apply every operator above to recursive concept trees — concepts whose members are themselves concepts. Operators bubble through the hierarchy and remain mathematically consistent at every level.
Applications	Taxonomies, organisational hierarchies, multi-scale analysis (chemical → biological → organism, file → folder → project, etc.).
Cython entry points used	the operators above, recursively

4.9 Streaming kNN graph maintenance


Purpose	Maintain an exact k-nearest-neighbour graph as items are added or removed one at a time, without rebuilding from scratch on each update.
Applications	Online time-series ingest, sliding-window analytics, sensor-stream monitoring, real-time anomaly detection.
Cython entry points used	`pairwise_distances`, `nn_distances` (on the contiguous buffer); `scipy.spatial.cKDTree` is used internally above 1000 items for exact O(log N) queries — no fidelity knob.

4.10 Time-series streaming model


Purpose	A complete reasoning model over sliding windows of a stream: state extraction, transition modelling, intrinsic-scale maintenance, and verdict-qualified prediction on novel windows. Optionally projects high-dimensional windows to lower dimensions when configured to do so.
Applications	Multivariate time-series classification, regime detection, online anomaly identification, signal-quality forecasting.
Cython entry points used	`nn_distances` (intrinsic scale), `concept_support_matrix` (verdict), the streaming-kNN apparatus from 4.9

5. Composition properties

The operators in §1–§4 compose along several axes:

Across payload types: the same operator works for scalars, vectors, matrices, tensors, functions, manifolds, complex states, distributions, time-series windows. The caller supplies the appropriate distance function or, equivalently, an embedding into Euclidean space.
Across hierarchy levels: concepts can themselves be members of parent concepts; operators recurse through the tree (§4.8).
Under wrapping: stochastic and temporal extensions can be layered over any base payload type. Triple compositions like "hierarchy of stochastic time-series" are admissible and produce consistent results at every level.

6. What the apparatus does NOT offer

Stated explicitly so users can plan around the limits:

No probability distributions over outcomes. Verdicts are structural, not Bayesian.
No reward / objective optimisation. The apparatus does not learn policies; it identifies structural relationships.
No tuning knobs that trade fidelity for speed. Where some alternatives expose epsilon, top_k, temperature, etc., the apparatus uses data-derived structural boundaries instead.
No approximate-mode kNN (HNSW / IVF / LSH / FAISS lossy modes). Every kNN-related operator returns exact results.

7. Mapping summary

Apparatus operator	Cython entry point(s)
Pairwise similarity	`batch_max_similarity`
Multi-class similarity	`concept_support_matrix`
Pairwise distance	`pairwise_distances`
Nearest-neighbour distance	`nn_distances`
Best-tradeoff filter	`pareto_core_mask`
One-sided peak flag	`one_sided_mask`
Non-redundant witness	`non_redundant_witnesses`
Fused centroid + radius update	`extend_frontier_kernel`
Intrinsic scale	composed of `nn_distances`, `pairwise_distances`
Concept discovery	composed of `pairwise_distances`, `nn_distances`, `pareto_core_mask`
Relational hypothesis generation	composed of `concept_support_matrix`, `pareto_core_mask`, `extend_frontier_kernel`
Semantic gap detection	composed of `concept_support_matrix`, `non_redundant_witnesses`
Prototype construction	composed of `batch_max_similarity`, `concept_support_matrix`
Verdict-qualified inference	composed of `concept_support_matrix`, `pareto_core_mask`, `batch_max_similarity`
Lifecycle / dominance verification	composed of `pareto_core_mask`
Hierarchical recursion	every operator above, recursively
Streaming kNN graph	`pairwise_distances`, `nn_distances`
Time-series streaming model	`nn_distances`, `concept_support_matrix`, streaming kNN

8. Library availability

The Cython entry points in the right column of §7 are all in sem_cython12.wrapper, distributed at https://git.sevana.biz/vvs/sem_cython12. Higher-level apparatus (composed operators in §4) is built on those primitives and ships in the SEM foundation package, separate from this library.

13 KiB Raw Permalink Blame History