Evidently
Repo: evidently (open-source ML/LLM evaluation & monitoring framework)
Scope of this report: Python package under src/evidently/. The React frontend (/ui/) is mentioned only where it interfaces with the Python service.
Method: Static reading of source. All citations use path:line against the working tree at HEAD = a4aa4c2b.
1. High-Level Project Layout
Section titled “1. High-Level Project Layout”src/evidently/├── __init__.py # Public API: Report, Run, Dataset, DataDefinition, ...├── _registry.py # Imports all sub-registries (plugin discovery)├── _pydantic_compat.py # Pydantic v1/v2 shim (uses pydantic.v1 if v2 installed)├── pydantic_utils.py # Foundation: PolymorphicModel, FrozenBaseModel, EvidentlyBaseModel│├── core/ # NEW v2 API — what `from evidently import Report` exposes│ ├── datasets.py # Dataset, DataDefinition, classification/regression task descriptors│ ├── report.py # Report, Snapshot (aliased as Run), Context│ ├── metric_types.py # Metric / MetricCalculation / MetricResult / MetricTest hierarchies│ ├── tests.py # Test framework on top of metric_types│ ├── container.py # MetricContainer (lazy metric generators)│ ├── preset_types.py # Preset abstract class│ ├── serialization.py # JSON / YAML round-trip helpers│ ├── compare.py # `compare(report_a, report_b)`│ └── registries/ # Type-alias tables for polymorphic Pydantic deserialization│├── metrics/ # v2 metric implementations (column_statistics, classification, ...)├── tests/ # v2 test builders (eq, gt, lt, between, ...)├── presets/ # v2 preset bundles (DataDriftPreset, TextEvals, ...)├── descriptors/ # v2 row-wise feature generators (Sentiment, TextLength, LLM judges)├── generators/ # v2 column generators (parameterised metric expansion)│├── future/ # Stable re-export alias for `core/` — same API, different import path│├── llm/ # LLM provider abstraction (litellm-first), prompts, RAG, optimisation├── guardrails/ # Decorator-based input/output validators (Negativity, PII, Toxicity, ...)│├── ui/ # Litestar monitoring service + storage abstractions + Workspace├── sdk/ # Artifacts/Prompts/Configs/Datasets managers (local + remote-friendly)│├── cli/ # Typer CLI: `evidently ui|report|demo_project|migrate|legacy_ui`│└── legacy/ # v1 API kept alive for back-compat (strangler-fig migration) ├── core.py # ColumnType, BaseResult ├── base_metric.py # v1 Metric / MetricResult / ColumnMetric ├── metrics/ # v1 metrics (still backing many UI snapshots) ├── tests/, test_suite/, test_preset/ ├── report/, suite/ ├── renderers/ # Compute/presentation split (Strategy) ├── calculation_engine/ # Engine/PythonEngine (Strategy for compute backend) ├── features/ # GeneratedFeature — derived columns (≠ Metric) ├── calculations/ # Stat tests, drift, data-quality maths ├── options/ # Cascading Options pattern └── ui/ # v1 monitoring UI (legacy_ui CLI command)The two coexisting namespaces — evidently/core/ (new) and evidently/legacy/ (old) — are the most important fact about this repository. The new API is a thin, more ergonomic layer; the old one still owns most of the mathematical heavy lifting.
2. The Pydantic Polymorphism Foundation
Section titled “2. The Pydantic Polymorphism Foundation”Every serialisable type in Evidently — Metric, MetricResult, Test, Preset, MetricContainer, Feature, GuardrailBase — descends from a small foundation defined in src/evidently/pydantic_utils.py.
2.1 Class chain
Section titled “2.1 Class chain”classDiagram
class BaseModel {
<<pydantic>>
}
class FrozenBaseModel {
<<pydantic_utils.py:81>>
+__hash__()
+Frozen + auto-register
}
class PolymorphicModel {
<<pydantic_utils.py:197>>
+type : str «discriminator»
+__init_subclass__()
}
class AutoAliasMixin {
<<pydantic_utils.py:623>>
+__alias_type__ : ClassVar[str]
+builds short alias path
}
class EvidentlyBaseModel {
<<pydantic_utils.py:331>>
+dump(path)
+load(path)
+update(**kw)
+fingerprint()
}
BaseModel <|-- FrozenBaseModel
BaseModel <|-- PolymorphicModel
FrozenBaseModel <|-- EvidentlyBaseModel
PolymorphicModel <|-- EvidentlyBaseModel
AutoAliasMixin <.. EvidentlyBaseModel : mixed-in by subclasses
2.2 What each piece does
Section titled “2.2 What each piece does”FrozenBaseModel(pydantic_utils.py:81) — immutable, hashable, supports auto-registration of subclasses via__init_subclass__. Hashability matters because metric fingerprints (MD5 of class-path + field values) are used as identity keys in theContextcache.PolymorphicModel(pydantic_utils.py:197) — adds atypediscriminator field. Two module-level dicts (pydantic_utils.py:132–133):These map short aliases (e.g.TYPE_ALIASES: Dict[(BaseClass, alias_str), classpath]LOADED_TYPE_ALIASES: Dict[(BaseClass, alias_str), Type]evidently:metric_v2:Accuracy) to fully-qualified Python classpaths. On load,PolymorphicModel.validate()readstype, looks up the right subclass, and deserialises into it. This is the standard registered subclass trick to round-trip Pydantic unions through JSON / YAML.AutoAliasMixin(pydantic_utils.py:623) — gives each Evidently subsystem its own short prefix (metric_v2,metric_container,descriptor,test, …) so aliases are stable and human-readable.EvidentlyBaseModel(pydantic_utils.py:331) — top of the inheritance chain for “real” domain types. Adds:dump()/load()for YAML/JSON,update(**fields)for safe field replacement on a frozen model,fingerprint()— deterministic MD5 hash used as the metric identity in theContextregistry.
_pydantic_compat.pydetects whether installed Pydantic is v1 or v2 and selects either the native module orpydantic.v1. All Evidently base types still target Pydantic v1 semantics (BaseModel.Config,validator), which keeps the codebase working on both.
2.3 Why this matters
Section titled “2.3 Why this matters”This single pattern — frozen, hashable, polymorphic, fingerprintable Pydantic models — is the reason Evidently can:
- Send a
Reportover JSON (UI ↔ service ↔ storage). - Compare two snapshots structurally (
core/compare.py). - Cache metric results by identity in
Context. - Allow user-defined custom metrics to round-trip through the registry without service code changes.
3. The Two Co-Existing APIs
Section titled “3. The Two Co-Existing APIs”3.1 v2 (core/) — Config / Calculation split
Section titled “3.1 v2 (core/) — Config / Calculation split”The new API splits each metric into two cooperating classes:
| Concern | Type | What it carries |
|---|---|---|
| What to compute | Metric subclass | column name, thresholds, tests, output shape — pure config |
| How to compute it | MetricCalculation subclass | the actual calculate(context, current, reference) body |
A class-level binding wires the two together via __init_subclass__, set on MetricCalculation at core/metric_types.py:1229. So you write:
class MinValue(StatisticsMetric): ... # configclass MinValueCalculation(StatisticsCalculation[MinValue]): # calculation def calculate_value(column): return column.data.min()…and Evidently auto-records MinValue.__calculation_type__ = MinValueCalculation. Metric.to_calculation() later instantiates the calculation class with the metric instance attached.
Why this split? It cleanly separates the serialisable description of a metric from its executable behaviour. A Report can be saved, sent over the wire, or stored in the UI as a tree of Metric configs; the corresponding MetricCalculation is only resolved at run time. This is the same pattern Spark uses for logical vs. physical plans.
3.2 v1 (legacy/) — single-class Metric
Section titled “3.2 v1 (legacy/) — single-class Metric”The legacy API keeps both concerns in one class: Metric[TResult] (legacy/base_metric.py:241) directly defines calculate(data: InputData) -> TResult. There is no separate calculation object; subclasses are concrete implementations.
3.3 Where each is used
Section titled “3.3 Where each is used”| API | Public entry points | Used by |
|---|---|---|
v2 (core/) | from evidently import Report, Run, Dataset | New code, examples, presets |
v1 (legacy/) | from evidently.legacy.report import Report | Legacy reports, the legacy UI service, snapshots already on disk |
| Bridge | ui/backport.py | Converts v1 results to v2 shapes for unified rendering |
4. v2 Metric Class Hierarchy (core/metric_types.py)
Section titled “4. v2 Metric Class Hierarchy (core/metric_types.py)”4.1 Three parallel hierarchies
Section titled “4.1 Three parallel hierarchies”classDiagram
direction LR
%% --- Config side ---
class Metric {
<<abstract, line 1114>>
+tests
+__calculation_type__ : ClassVar[Type]
+get_bound_tests(ctx)*
+to_calculation()
}
class SingleValueMetric { <<line 1289>> }
class ByLabelMetric { <<line 1363>> }
class CountMetric { <<line 1580>> }
class MeanStdMetric { <<line 1760>> }
class DataframeMetric { <<line 1789>> }
class ColumnMetric {
<<abstract, line 1891>>
+column : str
}
Metric <|-- SingleValueMetric
Metric <|-- ByLabelMetric
Metric <|-- CountMetric
Metric <|-- MeanStdMetric
Metric <|-- DataframeMetric
Metric <|-- ColumnMetric
%% --- Calculation side ---
class MetricCalculationBase {
<<abstract, line 924>>
+calculate(ctx, cur, ref)*
+display_name()*
+call(ctx)
}
class MetricCalculation {
<<abstract, line 1229>>
+metric : TMetric
}
class SingleValueCalculation
class ByLabelCalculation
class CountCalculation
class MeanStdCalculation
class DataframeCalculation
MetricCalculationBase <|-- MetricCalculation
MetricCalculation <|-- SingleValueCalculation
MetricCalculation <|-- ByLabelCalculation
MetricCalculation <|-- CountCalculation
MetricCalculation <|-- MeanStdCalculation
MetricCalculation <|-- DataframeCalculation
%% --- Result side ---
class MetricResult {
<<abstract, line 165>>
+set_metric_location(metric)*
+to_simple_dict()*
+itervalues()
}
class SingleValue
class ByLabelValue
class ByLabelCountValue
class CountValue
class MeanStdValue
class DataframeValue
MetricResult <|-- SingleValue
MetricResult <|-- ByLabelValue
MetricResult <|-- ByLabelCountValue
MetricResult <|-- CountValue
MetricResult <|-- MeanStdValue
MetricResult <|-- DataframeValue
%% --- Cross-relations ---
Metric ..> MetricCalculation : __calculation_type__
MetricCalculation ..> MetricResult : produces
The shape of a metric’s output (single number, dict-by-label, mean/std pair, full DataFrame, …) determines which *Metric and *Result subclasses it inherits — the result type is the central design dimension, not the algorithm.
4.2 A concrete example (column statistics)
Section titled “4.2 A concrete example (column statistics)”src/evidently/metrics/column_statistics.py is the canonical pattern:
class StatisticsMetric(ColumnMetric, SingleValueMetric): # dual-inherits def _default_tests_with_reference(ctx): ...
class StatisticsCalculation(SingleValueCalculation[TStatisticsMetric]): @abstractmethod def calculate_value(self, column: DatasetColumn) -> float: ... def calculate(self, context, current, reference): cur = self.result(self.calculate_value(current.column(self.column))) ref = self.result(self.calculate_value(reference.column(self.column))) if reference else None return cur, ref
class MinValue(StatisticsMetric): ...class MinValueCalculation(StatisticsCalculation[MinValue]): def calculate_value(self, column): return column.data.min()Adding a new column statistic = subclass StatisticsMetric + subclass StatisticsCalculation and override calculate_value. Everything else (test binding, fingerprint, registry, rendering) is inherited.
4.3 Test binding (MetricTest → BoundTest)
Section titled “4.3 Test binding (MetricTest → BoundTest)”Tests in v2 are not run inline. They are bound late to results:
classDiagram
class MetricTest {
<<abstract, line 1008>>
+is_critical : bool
+alias : Optional[str]
+to_test()*
+bind_single(fp)
+bind_by_label(fp, label)
+bind_dataframe(fp, col, filters)
}
class BoundTest {
<<abstract, line 1080>>
+test : MetricTest
+metric_fingerprint : Fingerprint
+run_test(ctx, calc, result)*
}
class SingleValueBoundTest
class ByLabelBoundTest
class CountBoundTest
class MeanStdBoundTest
class DataframeBoundTest
BoundTest <|-- SingleValueBoundTest
BoundTest <|-- ByLabelBoundTest
BoundTest <|-- CountBoundTest
BoundTest <|-- MeanStdBoundTest
BoundTest <|-- DataframeBoundTest
MetricTest ..> BoundTest : bind_*()
The user writes eq(0.0) or gt(0.95) against a metric; under the hood Metric.get_bound_tests(context) produces BoundTest objects pinned to the metric’s fingerprint. After the metric calculates, BoundTest.run_test(...) extracts the relevant value (a scalar, a label-keyed dict entry, a dataframe cell) and runs the comparison.
The user-friendly builders live in src/evidently/tests/numerical_tests.py and categorical_tests.py (e.g. eq, gt, lt, between, gte, lte).
4.4 Containers and Presets (lazy metric generation)
Section titled “4.4 Containers and Presets (lazy metric generation)”classDiagram
class MetricContainer {
<<abstract, container.py:25>>
+include_tests : bool
+generate_metrics(ctx)*
+metrics(ctx)
+list_metrics(ctx)
}
class ColumnMetricContainer {
<<abstract, container.py:142>>
+column : str
}
MetricContainer <|-- ColumnMetricContainer
class Preset {
<<preset_types.py>>
}
Preset --|> MetricContainer
A MetricContainer returns a sequence of metrics, evaluated against a Context so it can adapt to the actual columns/labels in the dataset. Presets (DataDriftPreset, TextEvals, ClassificationPreset, …) are containers: “generate the right metrics for this dataset”. Containers nest — list_metrics() walks the tree and yields leaf Metrics, which keeps user-facing API tiny while supporting arbitrary expansion.
4.5 Execution: Report.run → Snapshot (alias Run)
Section titled “4.5 Execution: Report.run → Snapshot (alias Run)”sequenceDiagram
actor User
participant R as Report
participant Ctx as Context
participant MC as MetricContainer
participant Calc as MetricCalculation
participant BT as BoundTest
User->>R: run(eval, reference)
R->>Ctx: Context(eval, reference)
R->>MC: list_metrics(ctx)
loop each metric
R->>Calc: metric.to_calculation()
Calc->>Ctx: calculate(ctx, current, reference)
Calc-->>Ctx: (cur_result, ref_result) cached by fingerprint
end
R->>BT: get_bound_tests(ctx) for each metric
BT->>Ctx: run_test(ctx, calc, result)
R-->>User: Snapshot (TypeAlias Run)
Notes:
Contextlives atcore/report.py:123and owns_metrics,_reference_metrics,_input_data, plus per-column caches.Snapshotis atcore/report.py:487;Runis a type alias for it (core/report.py:818:Run: TypeAlias = Snapshot).from evidently import Runtherefore returns theSnapshotclass.- The
Reportclass is atcore/report.py:821.
5. v1 (Legacy) Hierarchy
Section titled “5. v1 (Legacy) Hierarchy”The legacy API is still load-bearing — most stat tests, drift methods, and renderers live here.
classDiagram
direction TB
class Metric_v1 {
<<legacy/base_metric.py:241>>
+calculate(data : InputData)* TResult
}
class MetricResult_v1 {
<<legacy/base_metric.py:52>>
}
class ColumnMetric_v1 {
<<legacy/base_metric.py:353>>
+column_name : ColumnName
}
Metric_v1 <|-- ColumnMetric_v1
Metric_v1 ..> MetricResult_v1 : produces
class Test_v1 {
<<legacy/tests/base_test.py>>
}
class TestResult_v1
Test_v1 ..> TestResult_v1
class MetricRenderer {
<<legacy/renderers/base_renderer.py:44>>
+render_html()*
+render_json()*
+render_pandas()*
}
class TestRenderer {
<<legacy/renderers/base_renderer.py:90>>
}
Metric_v1 ..> MetricRenderer : @default_renderer
Test_v1 ..> TestRenderer : @default_renderer
class Engine {
<<legacy/calculation_engine/engine.py:48>>
+execute_metrics()
}
class PythonEngine {
<<python_engine.py:27>>
}
Engine <|-- PythonEngine
class GeneratedFeatures {
<<features/generated_features.py:30>>
}
class GeneratedFeature {
<<features/generated_features.py:120>>
+generate(data) -> Series
}
class ApplyColumnGeneratedFeature {
<<features/generated_features.py:156>>
}
GeneratedFeatures <|-- GeneratedFeature
GeneratedFeature <|-- ApplyColumnGeneratedFeature
class MetricPreset {
<<legacy/metric_preset/metric_preset.py:16>>
}
5.1 Three legacy patterns worth naming
Section titled “5.1 Three legacy patterns worth naming”-
Renderer pattern (Strategy on output format).
MetricRenderer[TMetric]andTestRenderer[TTest]separate computation from presentation. Concrete metrics register a renderer via the@default_rendererdecorator. The same metric renders to HTML widgets, JSON, or pandas depending on which renderer method the caller invokes. -
Engine pattern (Strategy on compute backend).
Engine[TMetricImplementation, TInputData, TEngineDataType](engine.py:48) is generic over three type parameters — the engine-specific metric implementation, the input data type, and the in-engine data type. OnlyPythonEngine(pandas) ships in OSS, but the seam is built so a Spark engine can plug in alongside; see alsotests/spark/. This is exactly where you would extend Evidently to support a different compute substrate. -
Feature ≠ Metric.
GeneratedFeature(features/generated_features.py:120) and its row-wise variantApplyColumnGeneratedFeature(line 156) produce new columns, not metrics. Sentiment, OOV-words, text length, semantic similarity are all features: they consume a column, emit a column. Metrics then operate on those columns. This separation lets you cache, rerun, and compose features independently of metrics.
5.2 Options pattern
Section titled “5.2 Options pattern”Options (legacy/options/base.py) is a composable, hierarchical config object: color, render, plus a typed custom: Dict[Type[Option], Option] bag, with .override(other) returning a merged copy. Options cascade from global → display → metric-level without subclasses needing to know what they are.
6. Algorithm Implementation Strategy — From-Scratch vs. Library
Section titled “6. Algorithm Implementation Strategy — From-Scratch vs. Library”This is the question that most informs how to trust and contribute to Evidently.
Bottom line: Evidently is overwhelmingly an orchestration layer over scipy / scikit-learn / nltk / sentence-transformers / litellm, with hand-rolled implementations only where (a) the test isn’t in scipy, (b) a custom permutation/bootstrap is needed, or (c) the maths is easier than depending on another package.
6.1 Statistical drift tests (legacy/calculations/stattests/)
Section titled “6.1 Statistical drift tests (legacy/calculations/stattests/)”Library-backed (thin scipy wrappers):
| Test | File | Backend |
|---|---|---|
| Kolmogorov–Smirnov | ks_stattest.py:29 | scipy.stats.ks_2samp |
| Chi-square | chisquare_stattest.py:29 | scipy.stats.chisquare |
| Mann–Whitney U | mann_whitney_urank_stattest.py | scipy.stats.mannwhitneyu |
| Fisher’s exact | fisher_exact_stattest.py | scipy.stats.fisher_exact |
| Anderson–Darling k-sample | anderson_darling_stattest.py | scipy.stats.anderson_ksamp |
| Energy distance | energy_distance.py | scipy.stats.energy_distance |
| Welch’s t | t_test.py | scipy.stats.ttest_ind |
| Jensen–Shannon | jensenshannon.py:30 | scipy.spatial.distance.jensenshannon |
| Wasserstein | wasserstein_distance_norm.py | scipy.stats.wasserstein_distance |
| Epps–Singleton | epps_singleton_stattest.py | scipy.stats.epps_singleton_2samp |
| Cramér–von Mises (asymptotic) | cramer_von_mises_stattest.py | uses scipy.special.kv |
Hand-rolled (numpy/pandas only — no scipy import):
| Test | File | Why hand-rolled |
|---|---|---|
| Population Stability Index (PSI) | psi.py | Standard formula, no scipy equivalent |
| Hellinger distance | hellinger_distance.py:39 | Simple binned-histogram form |
| Maximum Mean Discrepancy (MMD) | mmd_stattest.py:12 | Custom RBF kernel + permutation p-value (100 shuffles) |
| Cramér–von Mises (exact) | cramer_von_mises_stattest.py | Recurrence-relation algorithm from the paper |
| Z-test on proportions | z_stattest.py | Wraps scipy.stats.norm.cdf but computes the z statistic by hand |
| Total Variation Distance | tvd_stattest.py | One-line numpy formula |
| KL divergence | kl_div.py | One-line numpy formula |
| G-test | g_stattest.py | Custom log-likelihood ratio |
You can verify the divide quickly: hand-rolled tests import only numpy, pandas, and the stat-test registry — no scipy imports at all (e.g. psi.py:26-33, mmd_stattest.py:1-9, hellinger_distance.py:26-36).
6.2 Drift orchestration
Section titled “6.2 Drift orchestration”legacy/calculations/data_drift.py is pure routing: it picks a stat test by column type and signature, calls it, and packages the result. No maths of its own. Selecting which test runs by default (chi-square for categorical, K-S for numerical, …) lives here, not in the tests.
6.3 Data-quality calculations
Section titled “6.3 Data-quality calculations”legacy/calculations/data_quality.py is mostly pandas: value_counts, nunique, quantile, plus a single scipy.stats.chi2_contingency call for cross-feature association.
6.4 ML metrics
Section titled “6.4 ML metrics”legacy/metrics/classification_performance/* and regression_performance/* are sklearn wrappers. Examples:
roc_auc_score,log_loss,precision_recall_curve,confusion_matrix— allfrom sklearn import metrics.legacy/metrics/regression_performance/*usesklearn.metrics.mean_squared_error,mean_absolute_percentage_error, etc.
6.5 Embedding-drift methods
Section titled “6.5 Embedding-drift methods”legacy/metrics/data_drift/embedding_drift_methods/ (called from EmbeddingsDriftMetric) is a hybrid:
- Library:
sklearn.decomposition.PCA,sklearn.linear_model.SGDClassifier,sklearn.metrics.roc_auc_score,sklearn.metrics.pairwise_distances,pairwise_kernels. - Custom: the bootstrap loops and the “ratio” / “model” detection pipelines that combine these primitives.
6.6 Text descriptors
Section titled “6.6 Text descriptors”| Descriptor | Backend |
|---|---|
| Sentiment | nltk.sentiment.vader.SentimentIntensityAnalyzer |
| OOV words, words match | nltk.corpus.words, WordNetLemmatizer |
| Text length, word count, char count | hand-rolled (.apply(len), regex) |
| Semantic similarity | sentence_transformers.SentenceTransformer("all-MiniLM-L6-v2") + custom normalised cosine |
| Regex / contains / starts-with | hand-rolled |
| LLM judges | evidently.llm.utils.wrapper.LLMWrapper — see below |
6.7 LLM integration (llm/utils/wrapper.py)
Section titled “6.7 LLM integration (llm/utils/wrapper.py)”The LLM layer is the most architecturally interesting piece outside core/:
# llm/utils/wrapper.py:438if find_spec("litellm") is not None: litellm_wrapper = get_litellm_wrapper(provider, model, options) if litellm_wrapper is not None: return litellm_wrapperraise ValueError(f"... Try installing litellm")- litellm-first. Any provider supported by litellm is routed through
litellm.acompletion(wrapper.py:595). - OpenAI direct fallback. A non-litellm path (
wrapper.py:518–541) imports theopenaiSDK directly for environments that don’t want litellm. - Provider registry.
litellm_providers(line 712) and@llm_provider("litellm", None)(line 574) expand into per-provider wrappers automatically. - No direct vendor SDK lock-in elsewhere. Descriptors-as-LLM-judges (
descriptors/llm_judges.py) and prompt optimisation (llm/optimization/) all go throughLLMWrapper.run_batch_sync, never a vendor SDK directly.
So Evidently’s LLM strategy is: use litellm as a unified abstraction; treat any vendor SDK as a private detail of a wrapper.
7. Surrounding Layers
Section titled “7. Surrounding Layers”7.1 Layered service view
Section titled “7.1 Layered service view”┌──────────────────────────────────────────────────────────────┐│ CLI (typer) ││ evidently ui | report | demo_project | migrate | legacy_ui │└──────────────────────┬───────────────────────────────────────┘ │┌──────────────────────▼───────────────────────────────────────┐│ Workspace (ui/workspace.py) ││ exposes: artifacts / prompts / configs / datasets │└──────────────────────┬───────────────────────────────────────┘ │┌──────────────────────▼───────────────────────────────────────┐│ SDK managers (sdk/local.py, sdk/adapters.py) ││ LocalArtifactAPI / LocalPromptAPI / LocalConfigAPI ││ + adapters that bridge interfaces │└──────────────────────┬───────────────────────────────────────┘ │┌──────────────────────▼───────────────────────────────────────┐│ UI service (litestar app, ui/service/) ││ ComponentContext + AppBuilder DI ││ Storage abstractions: DataStorage / BlobStorage / ││ ProjectMetadataStorage ││ LocalStorage backed by JSON files via FSLocation ││ Static React assets served from ui/service/assets/ │└──────────────────────┬───────────────────────────────────────┘ │┌──────────────────────▼───────────────────────────────────────┐│ Compute layer ││ evidently.core.* (v2 metrics/tests/presets) ││ evidently.legacy.* (v1 metrics/tests, stat tests, ││ features, renderers, engines) ││ evidently.llm.* (litellm wrapper, prompts, RAG) ││ evidently.guardrails.* (decorator-based validators) ││ ui/backport.py (v1 → v2 result adapters) │└──────────────────────────────────────────────────────────────┘7.2 UI service (ui/service/)
Section titled “7.2 UI service (ui/service/)”- Built on Litestar (
ui/service/app.py:13–23), API mounted under/api. - Component-based DI:
AppBuilder+ComponentContext(ui/service/components/base.py). Components declareget_dependencies()andget_route_handlers(); the AppBuilder composes them. - Security:
TokenSecurityComponent(bearer token fromEVIDENTLY_SECRET_ENV). - Storage abstractions:
DataStorage,BlobStorage,ProjectMetadataStorage(inui/service/base.py). Only the local filesystem implementation ships in OSS (ui/storage/local/); cloud backends live in the closed Evidently Cloud product. The static React frontend is built with pnpm under the top-level/ui/and served fromui/service/assets/.
7.3 SDK (sdk/)
Section titled “7.3 SDK (sdk/)”A small, programmatic façade for managing artifacts (saved reports), prompts, configs, datasets, and panels — independent of whether storage is local or remote. sdk/adapters.py uses bridge-style adapters so a Workspace user always sees the same API regardless of backend.
7.4 LLM (llm/)
Section titled “7.4 LLM (llm/)”| Submodule | Purpose |
|---|---|
llm/utils/wrapper.py | Provider abstraction (litellm, OpenAI direct) |
llm/prompts/ | Prompt templates, few-shot examples, RAG hooks |
llm/templates.py | Jinja-style variable substitution |
llm/datagen/ | Synthetic dataset generation via LLM |
llm/optimization/ | Prompt optimisation loops |
llm/rag/ | Document splitting, embeddings, vector index helpers |
llm/models.py | Vendor-neutral LLMMessage / LLMResponse types |
llm/options.py | Temperature, max-tokens, retry config |
7.5 Guardrails (guardrails/)
Section titled “7.5 Guardrails (guardrails/)”A small decorator framework: @guard(...) (guardrails/decorators.py) wraps a function, validates arguments through one or more GuardrailBase subclasses, and raises GuardException on failure. Concrete guards in guardrails/guards/ include Negativity, Toxicity, PII, WordPresence, PythonFunction. Optional integration with tracely for tracing. Operates on string inputs/outputs; orthogonal to metrics and storage.
7.6 future/ — alias, not experiment
Section titled “7.6 future/ — alias, not experiment”Despite the name, future/ is a stable re-export of core/, e.g. future/report.py:1 is literally:
from evidently.core.report import * # noqa: F403Subdirectories future/metrics/, future/presets/, etc. are empty __init__.py stubs. The clear intent is to give users an import path that won’t break if core/ is internally reorganised.
7.7 CLI (cli/, typer)
Section titled “7.7 CLI (cli/, typer)”| Command | Entry | Purpose |
|---|---|---|
evidently ui | cli/ui.py | Start Litestar UI service (optionally seed demo data) |
evidently report | cli/report.py | Render a report from YAML config |
evidently demo_project | cli/demo_project.py | Generate demo workspace |
evidently migrate | cli/migrate.py | Run alembic migrations (SQL backend) |
evidently legacy_ui | cli/legacy_ui.py | Start the v1 monitoring UI |
8. Summary of OOP Patterns Used
Section titled “8. Summary of OOP Patterns Used”| Pattern | Where it shows up | Notes |
|---|---|---|
| Polymorphic discriminator (registry-backed) | pydantic_utils.PolymorphicModel, core/registries/, _registry.py | Round-tripping Pydantic unions through JSON/YAML |
| Frozen value object + fingerprint | FrozenBaseModel, EvidentlyBaseModel.fingerprint() | MD5(classpath + fields) is the metric identity |
| Config / Calculation split | v2 Metric ↔ MetricCalculation | Serialisable description vs. executable behaviour |
Generic base + __init_subclass__ wiring | MetricCalculation[TResult, TMetric] | Auto-binds calculation class to its metric class |
| Strategy (output format) | v1 MetricRenderer / TestRenderer + @default_renderer | HTML / JSON / pandas |
| Strategy (compute backend) | v1 Engine / PythonEngine (Spark seam) | Engine generic over (impl, input, engine_data) |
| Composite / Lazy generators | MetricContainer, Preset | list_metrics() flattens nested containers |
| Bind / late binding | MetricTest.bind_*() → BoundTest.run_test() | Tests pinned to fingerprints, evaluated post-result |
| Bridge / Adapter | sdk/adapters.py, ui/backport.py | Same SDK API across backends; v1 → v2 result conversion |
| Decorator | @guard(...), @default_renderer, @llm_provider | Zero-boilerplate registration / interception |
| Pluggable provider via feature-flag import | llm/utils/wrapper.py:438 (find_spec("litellm")) | Soft dependency on litellm with an OpenAI-direct fallback |
| Strangler fig (legacy/ vs core/) | top-level package layout | New API written alongside the old, not on top of it |
9. Quick Map — “If I want to add X, where does it go?”
Section titled “9. Quick Map — “If I want to add X, where does it go?””| Goal | Files to touch |
|---|---|
| New column statistic (single number) | metrics/column_statistics.py — subclass StatisticsMetric + StatisticsCalculation |
| New classification metric | metrics/classification.py |
| New stat test | legacy/calculations/stattests/<name>.py + register in registry.py |
| New text descriptor | descriptors/<name>.py (or legacy/features/<name>_feature.py for column-generating ones) |
| New LLM judge | descriptors/llm_judges.py + a prompt in llm/prompts/ |
| New preset | presets/<area>.py — subclass Preset (a MetricContainer) |
| New guard | guardrails/guards/<name>.py — subclass GuardrailBase |
| New compute backend | legacy/calculation_engine/<engine>_engine.py — subclass Engine |
| New storage backend | ui/service/components/storage.py + ui/storage/<backend>/ |
10. Key Citations Index
Section titled “10. Key Citations Index”| Concept | File:line |
|---|---|
FrozenBaseModel | src/evidently/pydantic_utils.py:81 |
PolymorphicModel | src/evidently/pydantic_utils.py:197 |
EvidentlyBaseModel | src/evidently/pydantic_utils.py:331 |
AutoAliasMixin | src/evidently/pydantic_utils.py:623 |
TYPE_ALIASES registry | src/evidently/pydantic_utils.py:132 |
v2 Metric | src/evidently/core/metric_types.py:1114 |
v2 MetricCalculationBase | src/evidently/core/metric_types.py:924 |
v2 MetricCalculation | src/evidently/core/metric_types.py:1229 |
v2 MetricResult | src/evidently/core/metric_types.py:165 |
v2 ColumnMetric | src/evidently/core/metric_types.py:1891 |
v2 MetricTest | src/evidently/core/metric_types.py:1008 |
v2 BoundTest | src/evidently/core/metric_types.py:1080 |
v2 MetricContainer | src/evidently/core/container.py:25 |
v2 Context | src/evidently/core/report.py:123 |
v2 Snapshot | src/evidently/core/report.py:487 |
Run = Snapshot (TypeAlias) | src/evidently/core/report.py:818 |
v2 Report | src/evidently/core/report.py:821 |
v1 Metric | src/evidently/legacy/base_metric.py:241 |
v1 MetricResult | src/evidently/legacy/base_metric.py:52 |
v1 ColumnMetric | src/evidently/legacy/base_metric.py:353 |
v1 MetricRenderer | src/evidently/legacy/renderers/base_renderer.py:44 |
v1 TestRenderer | src/evidently/legacy/renderers/base_renderer.py:90 |
v1 Engine | src/evidently/legacy/calculation_engine/engine.py:48 |
v1 PythonEngine | src/evidently/legacy/calculation_engine/python_engine.py:27 |
v1 GeneratedFeature | src/evidently/legacy/features/generated_features.py:120 |
v1 MetricPreset | src/evidently/legacy/metric_preset/metric_preset.py:16 |
| KS test (scipy) | src/evidently/legacy/calculations/stattests/ks_stattest.py:29 |
| Chi-square test (scipy) | src/evidently/legacy/calculations/stattests/chisquare_stattest.py:29 |
| Jensen–Shannon (scipy) | src/evidently/legacy/calculations/stattests/jensenshannon.py:30 |
| PSI (from-scratch) | src/evidently/legacy/calculations/stattests/psi.py |
| MMD (from-scratch) | src/evidently/legacy/calculations/stattests/mmd_stattest.py:12 |
| Hellinger (from-scratch) | src/evidently/legacy/calculations/stattests/hellinger_distance.py:39 |
| litellm provider switch | src/evidently/llm/utils/wrapper.py:438 |