make diff of time series to compare test productions+AliasDataFrame by miranov25 · Pull Request #2014 · AliceO2Group/O2DPG

miranov25 · 2025-05-29T17:43:22Z

Adding - AliasDataFrame is a small utility that extends pandas.DataFrame functionality by enabling:

Lazy evaluation of derived columns via named aliases
Automatic dependency resolution across aliases
Persistence via Parquet + JSON or ROOT TTree (via uproot + PyROOT)
ROOT-compatible TTree export/import including alias metadata

github-actions · 2025-05-29T17:43:31Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

…unctionality by enabling: * **Lazy evaluation of derived columns via named aliases** * **Automatic dependency resolution across aliases** * **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)** * **ROOT-compatible TTree export/import including alias metadata**

miranov25 · 2025-06-01T07:00:35Z

✨ Add `AliasDataFrame Utilities for On-Demand Evaluation

This PR adds support for alias-based derived column computation, as used for example in TPC distortion error parameterization. It includes:

✅ Key Features

Function Validation: Supports expressions using standard math, numpy, and previously defined aliases. Invalid aliases are warned during definition.
Alias Dependency Resolution: Automatic topological sort of aliases with dependency tracking and detection of circular references.
Output Type Specification: Each alias can specify its desired output dtype (e.g. np.float16, np.uint8). This can also be overridden during materialization.
- Dtypes are preserved in .parquet exports.
- TTree support can be extended to encode dtype metadata in a structured way.
Alias Dependency Graph: Visualization of alias relationships using networkx and matplotlib.

🧪 Example Usage

The function below demonstrates how derived error estimates and quality flags can be defined in terms of other DataFrame columns and aliases:

def makeErrParamAlias(adf):
    adf.df["Beta2"] = np.minimum(50 / adf.df["dEdxTPC"], 1.0).astype(np.float16)
    adf.add_alias("errz0a0", "0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("errz0b0", "0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("errz0b1", "0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("erry0c1", "0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5/150**2", dtype=np.float16)
    adf.add_alias("cutB6", "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8)", dtype=np.uint8)
    adf.add_alias("cutC6", "((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2)", dtype=np.uint8)
    adf.add_alias("cutA6", "((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2)", dtype=np.uint8)
    adf.add_alias("cutT", "((cutB6 + cutC6 + cutA6) > 0)", dtype=np.uint8)
    return adf

📊 Alias Dependency Graph

Visual representation of dependencies:

- Allow optional dtype per alias via `add_alias(..., dtype=...)` - Enable global override dtype in `materialize_alias` and `materialize_all` - Add `plot_alias_dependencies()` for visualizing alias dependencies - Improve alias validation with support for numpy/math functions

- Extend `save()` with dropAliasColumns to skip derived columns (before done only for TTree) - Store alias output dtypes in JSON metadata - Restore dtypes on load using numpy type resolution

miranov25 · 2025-06-01T08:14:20Z

🧾 Output Storage for `AliasDataFrame`: TTree or Parquet + JSON/Metadata

This update improves how aliases and derived columns are saved and loaded across different formats.

✅ Key Features

Selective Column Export:
- save(..., dropAliasColumns=True) stores only non-alias columns in .parquet (default), matching export_tree() behavior.
Alias Dtype Persistence:
- Output dtypes (e.g. np.float16, np.uint8) are now stored as type names (e.g. "float16").
- Correctly reloaded using getattr(np, ...) to ensure .astype(...) works.
Dual Metadata Storage:
- Aliases and dtypes are stored both:
  - in .parquet file metadata (pyarrow)
  - and in a .aliases.json file for inspection and fallback

🔍 Example Outputs (for exame above #2014 (comment))

ROOT TTree alias list:

TNamed	cutB6	((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ...
TNamed	cutT	((cutB6 + cutC6 + cutA6) > 0)

Parquet + JSON metadata:

{
  "aliases": {
    "cutB6": "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ...",
    "cutT": "((cutB6 + cutC6 + cutA6) > 0)"
  },
  "dtypes": {
    "cutB6": "uint8",
    "cutT": "uint8"
  }
}

The .parquet file contains embedded metadata.
The .aliases.json provides a transparent sidecar view.
If metadata is missing or outdated, the loader will fall back to the JSON.

Example usage of tree with aliases (later RDataFrame) - in TTree query:

root [20] tree->GetListOfAliases()->ls()
OBJ: TList	TList	Doubly linked list : 0
 OBJ: TNamed	errz0a0	0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5 : 0 at: 0x445d4e0
 OBJ: TNamed	errz0b0	0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5 : 0 at: 0x4465bb0
 OBJ: TNamed	errz0b1	(0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5) : 0 at: 0x4465cb0
 OBJ: TNamed	erry0c1	(0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5)/(150**2) : 0 at: 0x4465db0
 OBJ: TNamed	cutB6	((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8) : 0 at: 0x4465eb0
 OBJ: TNamed	cutC6	((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2) : 0 at: 0x4465f60
 OBJ: TNamed	cutA6	((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2) : 0 at: 0x4466070
 OBJ: TNamed	cutT	((cutB6 + cutC6 + cutA6) > 0) : 0 at: 0x4466180

root [17] tree->Draw("(dz0_a0/errz0a0):mP4>>his(200,-5,5,100,-5,5)","cutT==0","colz")
(long long) 3682828
root [18] tree->Draw("abs(dz0_a0/errz0a0):mP4>>rofdz(200,-5,5,100,0,20)","cutT==0&&abs(dy0_a0/errz0a0)<20","profsame")

miranov25 · 2025-06-02T15:32:46Z

🔄 Update Changes Summary

✅ Constants Support

New parameter is_constant=True in add_alias marks aliases as constants (e.g. countsFV0_median = 2096.0).
Constants are evaluated once and injected directly during alias materialization.
Constants are not materialized as DataFrame columns unless explicitly requested.

🧠 Smart Dependency Handling

During materialize_alias or materialize_all, constants are evaluated and injected before dependency resolution.
Dependency graphs and topological sorting skip constants to avoid recomputation.

💾 Parquet and ROOT I/O Support

Metadata (aliases, dtypes, constants) is stored inside .parquet file as Arrow schema metadata.
Export to ROOT TTree uses SetAlias, with workaround for ROOT interpreter bug by forcing numeric constants to use val + 0.

🧪 Unit Tests

Added robust pytest unit tests with:
- Standard aliases
- Aliases with custom dtype
- Constants and mixed expressions
- Dependency resolution
- Reloading and validating from saved Parquet files

…ses` **Extended commit description:** * Introduced `convert_expr_to_root()` static method using `ast` to translate Python expressions into ROOT-compatible syntax, including function mapping (`mod → fmod`, `arctan2 → atan2`, etc.). * Patched `export_tree()` to: * Apply ROOT-compatible expression conversion. * Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using `(<value> + 0)` workaround. * Save full Python alias metadata (`aliases`, `dtypes`, `constants`) as JSON in `TTree::GetUserInfo()`. * Patched `read_tree()` to: * Restore alias expressions and metadata from `UserInfo` JSON. * Maintain full alias context including constants and types. * Preserved full compatibility with the existing parquet export/load code. * Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.

miranov25 · 2025-06-03T07:52:05Z

Add ROOT SetAlias export and Python-to-ROOT AST translation for aliases

Extended commit description:

Introduced convert_expr_to_root() static method using ast to translate Python expressions into ROOT-compatible syntax, including function mapping (mod → fmod, arctan2 → atan2, etc.).
Patched export_tree() to:
- Apply ROOT-compatible expression conversion.
- Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using (<value> + 0) workaround.
- Save full Python alias metadata (aliases, dtypes, constants) as JSON in TTree::GetUserInfo().
Patched read_tree() to:
- Restore alias expressions and metadata from UserInfo JSON.
- Maintain full alias context including constants and types.
Preserved full compatibility with the existing parquet export/load code.
Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.

…verbosity - Introduced `materialize_aliases(targets, cleanTemporary=True, verbose=False)` method: - Builds a dependency graph among defined aliases using NetworkX. - Topologically sorts dependencies to ensure correct materialization order. - Materializes only the requested aliases and their dependencies. - Optionally cleans up intermediate (temporary) columns not in the target list. - Includes verbose logging to trace evaluation and cleanup steps. - Improves memory efficiency and control when working with layered alias chains. - Ensures robust handling of mixed alias and non-alias columns.

…ror handling - Added tests for: * Circular dependency detection * Undefined alias symbols * Invalid expression syntax * Partial materialization logic * Subframe behavior with unregistered references * Improved save/load integrity checks with alias mean delta validation * Direct alias dictionary comparison after load Known test failures to be addressed: - Circular dependency not detected (ValueError not raised) - Syntax error not caught (SyntaxError not raised) - Undefined symbol not caught (Exception not raised) - Partial materialization does not preserve dependency logic - Subframe alias on unregistered frame does not raise NameError

- Introduces per-channel, detector-agnostic model: X(Q,n) = a(q0,n) + b(q0,n)·(Q−q0), centered on Δq - Defines inputs/outputs, fit steps, and monotonicity policy (b > b_min) - Details nuisance-axis interpolation (linear/PCHIP) and uncertainty (σ_Q, σ_Q_irr) - Provides API sketch (fit_quantile_linear_nd, QuantileEvaluator) and persistence (Parquet/Arrow/ROOT) - Outlines unit tests, diagnostics, and performance expectations Refs: calibration, multiplicity/flow estimator framework

…er plots - Added NumPy-style docstrings to df_draw_scatter and drawExample

…ench - Introduces dfextensions/quantile_fit_nd: - quantile_fit_nd.py: per-channel ND fit, separable interpolation, evaluator, I/O - test_quantile_fit_nd.py: synthetic unit tests (uniform/poisson/gaussian, z nuisance) - bench_quantile_fit_nd.py: simple timing benchmark over N and distributions - Uses Δq-centered model: X = a(q0,n) + b(q0,n)·(Q − q0) - Enforces monotonicity with configurable b_min (auto/fixed) - Outputs DataFrame (Parquet/Arrow/ROOT) with diagnostics and metadata

…ust edge expectations - Define evaluator.invert_rank() with self-consistent candidate + fixed-point refinement - Compute b(z) expectation by averaging b_true over sample per z-bin - Relax sigma_Q tolerance to 0.25 (finite-window OLS) - Update edge-case test to assert edge coverage instead of unrealistic 90% overall

…ngle-groupby warning - Evaluator was treating 'q_center' as a nuisance axis (detected by *_center), causing axis misalignment and AxisError in moveaxis. Exclude it explicitly. - When grouping by a single nuisance bin column, use scalar grouper to avoid pandas FutureWarning.

…b_min + stable inversion - QuantileEvaluator: exclude 'q_center' from nuisance axes (fix AxisError in moveaxis) - Groupby: use scalar grouper for single nuisance bin column (silence FutureWarning) - Fit: compute b_min per |Q−q0|≤dq window (avoid over-clipping b in low-b regions) - Inversion: implement self-consistent candidate + 2-step fixed-point refine (invert_rank) - Keep API/metadata unchanged; prepare for ND nuisances and time

…(exclude IDE files) - remove .idea/ from repo and add .gitignore

…d record reason - Apply b_min only when a valid fit yields b<=0 (monotonicity enforcement) - For low-Q-spread / low-N windows, keep NaN (no floor), add reason in fit_stats - Greatly reduces bias in Poisson case; z-bin averages use informative windows only

- Use Q = F(k-1) + U*(F(k)-F(k-1)) for Poisson synthetic data - Ensures continuous ranks and informative Δq windows - Keeps fitter unchanged; diagnostics remain valid

- Explain continuous-Q assumption and discrete preprocessing (PIT/mid-ranks) - Add utils: discrete_to_uniform_rank_poisson / _empirical for reuse

- Round-trip RMS is dominated by per-event noise → expect α_rt≈0 (flat), not −0.5 - Keep rms_b scaling check near −0.5 (loosen tol to ±0.2 across 5 N points) - Clarify summary prints and expectations; leave constancy check only for rms_b·√N PWGPP-643

- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643

- bench_groupby_regression.py: self-contained scenarios (clean/outliers, serial/parallel) - Emits TXT and JSON (CSV optional) for easy doc inclusion and CI checks - Uses y ~ x1 + x2 per-group via GroupByRegressor.make_parallel_fit - Workaround for single-col group key (duplicate column for tuple keys) Sample results show: - ~1.75 s / 1k groups (serial clean, 50k rows, 10k groups) - ~0.41 s / 1k groups with n_jobs=10 (≈4.3× speedup) - Current y-shift outliers do not slow down OLS path (no refits triggered)

…x Markdown tables - Added new "Performance & Benchmarking" section describing benchmark usage, results, and interpretation - Included CLion-compatible Markdown tables for output columns, example results, and recommendations - Documented benchmark command line and sample outputs (50k rows / 10k groups) - Clarified how sigmaCut and parallelization affect runtime - Minor formatting and readability improvements across the file

- Default benchmark: 5 rows/group, 5k groups (faster, still representative) - Added 30% outlier scenario to examples; clarified that response-only outliers don’t trigger slow robust re-fits - Updated example tables for Mac and Linux with new per-1k-group timings - (optional) bench CLI default --groups=5000

…erage-outlier plan - Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime - Add action plan: leverage-outlier generator + refit counters + multi-target cost check - Keep PR target; align benchmarks and docs with 5k/5 default

…iag_prefix) - process_group_robust: record n_refits, frac_rejected, hat_max, cond_xtx, time_ms, n_rows (only when diag=True) - make_parallel_fit: new args diag / diag_prefix (default off; no behavior change) - add summarize_diagnostics(dfGB) helper for quick triage

… report - Append scenario-wise diagnostics summary to benchmark_report.txt - Save top-10 violators per scenario (time/refits) as CSVs - Supports suffix-aware summarize_diagnostics() from GroupByRegressor - Verified clean pytest and benchmark runs on real and synthetic data

…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study

make diff of time series

e6940e9

miranov25 requested review from chiarazampolli, davidrohr, sawenzel and shahor02 as code owners May 29, 2025 17:43

miranov25 changed the title ~~make diff of time series to compare test productions~~ make diff of time series to compare test productions+AliasDataFrame May 29, 2025

miranov25 added 2 commits May 31, 2025 12:09

adding perfmonitor

8ddfbf7

adding PerfromanceLogger extracted from calibration code

350f786

miranov25 marked this pull request as draft May 31, 2025 17:35

supressing linter warning

1ba0686

miranov25 added 3 commits June 1, 2025 09:07

Add support for dtype persistence and alias filtering in save/load

54de3fd

- Extend `save()` with dropAliasColumns to skip derived columns (before done only for TTree) - Store alias output dtypes in JSON metadata - Restore dtypes on load using numpy type resolution

Save aliases directly to pyarrow metadata

b8e241e

miranov25 added 5 commits June 2, 2025 08:24

add FormulaLinearModel.py used for the dEdx and distortion calibration

fcb9bb9

add FormulaLinearModel.py used for the dEdx and distortion calibration

cfe72d4

special treatment for constants - should be enver materialized but used

9087f54

special treatment for constants

60e26cb

special treatment for constants

b188456

miranov25 added 5 commits June 4, 2025 13:58

Extended usnit test for the sub_frames

679141b

fixed - Circular dependency detection

3aae8ee

fixing all unit test - except oth the automatic materialization

6759c26

miranov25 added 30 commits June 25, 2025 13:37

adding conversions to the function list

4d44bb2

adding chunksize and compression as argument

cb4b5d1

adding chunksize and compression as argument

87fa521

adding df drawing interface similar to the tree::Draw

4ef6973

Commit latest working version of AliasDataFrame

257d2ea

Commit latest working version of perfoemance_logger.py

fc54430

Commit latest working version of groupby_regression.py

161f0f0

feat(DataFrameUtils): Enhance docstrings and error handling for scatt…

53db0b8

…er plots - Added NumPy-style docstrings to df_draw_scatter and drawExample

tests(quantile_fit_nd): snapshot pre-fix state with rich diagnostics …

a578c17

…(exclude IDE files) - remove .idea/ from repo and add .gitignore

tests(quantile_fit_nd): handle Poisson via randomized PIT pre-processing

30b7ee7

- Use Q = F(k-1) + U*(F(k)-F(k-1)) for Poisson synthetic data - Ensures continuous ranks and informative Δq windows - Keeps fitter unchanged; diagnostics remain valid

docs(quantile_fit_nd): add Discrete Inputs policy and utilities

12d5fe4

- Explain continuous-Q assumption and discrete preprocessing (PIT/mid-ranks) - Add utils: discrete_to_uniform_rank_poisson / _empirical for reuse

docs(quantile_fit_nd): add contextLLM.md (cold-start guide + policies)

8625857

- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643

docs(quantile_fit_nd): add contextLLM.md (cold-start guide + policies)

2b27e47

- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643

Forgottend commit of refernce test and bench log

ec9f424

docs(restartContext): record diagnostics integration and real-data va…

a71cc4d

…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study

docs(restartContext): record diagnostics integration and real-data va…

cc1ecb4

…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study

use faster compression by default

5cf7431

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make diff of time series to compare test productions+AliasDataFrame#2014

make diff of time series to compare test productions+AliasDataFrame#2014
miranov25 wants to merge 69 commits intoAliceO2Group:masterfrom
miranov25:master

miranov25 commented May 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

miranov25 commented Jun 1, 2025

Uh oh!

miranov25 commented Jun 1, 2025

Uh oh!

miranov25 commented Jun 2, 2025

Uh oh!

miranov25 commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miranov25 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

miranov25 commented Jun 1, 2025

✨ Add `AliasDataFrame Utilities for On-Demand Evaluation

✅ Key Features

🧪 Example Usage

📊 Alias Dependency Graph

Uh oh!

miranov25 commented Jun 1, 2025

🧾 Output Storage for AliasDataFrame: TTree or Parquet + JSON/Metadata

✅ Key Features

🔍 Example Outputs (for exame above #2014 (comment))

Example usage of tree with aliases (later RDataFrame) - in TTree query:

Uh oh!

miranov25 commented Jun 2, 2025

🔄 Update Changes Summary

✅ Constants Support

🧠 Smart Dependency Handling

💾 Parquet and ROOT I/O Support

🧪 Unit Tests

Uh oh!

miranov25 commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miranov25 commented May 29, 2025 •

edited

Loading

🧾 Output Storage for `AliasDataFrame`: TTree or Parquet + JSON/Metadata