make diff of time series to compare test productions+AliasDataFrame#2014
make diff of time series to compare test productions+AliasDataFrame#2014miranov25 wants to merge 69 commits intoAliceO2Group:masterfrom
Conversation
|
REQUEST FOR PRODUCTION RELEASES: This will add The following labels are available |
…unctionality by enabling: * **Lazy evaluation of derived columns via named aliases** * **Automatic dependency resolution across aliases** * **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)** * **ROOT-compatible TTree export/import including alias metadata**
✨ Add `AliasDataFrame Utilities for On-Demand EvaluationThis PR adds support for alias-based derived column computation, as used for example in TPC distortion error parameterization. It includes: ✅ Key Features
🧪 Example UsageThe function below demonstrates how derived error estimates and quality flags can be defined in terms of other DataFrame columns and aliases: def makeErrParamAlias(adf):
adf.df["Beta2"] = np.minimum(50 / adf.df["dEdxTPC"], 1.0).astype(np.float16)
adf.add_alias("errz0a0", "0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("errz0b0", "0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("errz0b1", "0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("erry0c1", "0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5/150**2", dtype=np.float16)
adf.add_alias("cutB6", "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8)", dtype=np.uint8)
adf.add_alias("cutC6", "((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2)", dtype=np.uint8)
adf.add_alias("cutA6", "((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2)", dtype=np.uint8)
adf.add_alias("cutT", "((cutB6 + cutC6 + cutA6) > 0)", dtype=np.uint8)
return adf📊 Alias Dependency Graph |
- Allow optional dtype per alias via `add_alias(..., dtype=...)` - Enable global override dtype in `materialize_alias` and `materialize_all` - Add `plot_alias_dependencies()` for visualizing alias dependencies - Improve alias validation with support for numpy/math functions
- Extend `save()` with dropAliasColumns to skip derived columns (before done only for TTree) - Store alias output dtypes in JSON metadata - Restore dtypes on load using numpy type resolution
🧾 Output Storage for
|
🔄 Update Changes Summary✅ Constants Support
🧠 Smart Dependency Handling
💾 Parquet and ROOT I/O Support
🧪 Unit Tests
|
…ses` **Extended commit description:** * Introduced `convert_expr_to_root()` static method using `ast` to translate Python expressions into ROOT-compatible syntax, including function mapping (`mod → fmod`, `arctan2 → atan2`, etc.). * Patched `export_tree()` to: * Apply ROOT-compatible expression conversion. * Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using `(<value> + 0)` workaround. * Save full Python alias metadata (`aliases`, `dtypes`, `constants`) as JSON in `TTree::GetUserInfo()`. * Patched `read_tree()` to: * Restore alias expressions and metadata from `UserInfo` JSON. * Maintain full alias context including constants and types. * Preserved full compatibility with the existing parquet export/load code. * Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.
|
Extended commit description:
|
…verbosity - Introduced `materialize_aliases(targets, cleanTemporary=True, verbose=False)` method: - Builds a dependency graph among defined aliases using NetworkX. - Topologically sorts dependencies to ensure correct materialization order. - Materializes only the requested aliases and their dependencies. - Optionally cleans up intermediate (temporary) columns not in the target list. - Includes verbose logging to trace evaluation and cleanup steps. - Improves memory efficiency and control when working with layered alias chains. - Ensures robust handling of mixed alias and non-alias columns.
…ror handling - Added tests for: * Circular dependency detection * Undefined alias symbols * Invalid expression syntax * Partial materialization logic * Subframe behavior with unregistered references * Improved save/load integrity checks with alias mean delta validation * Direct alias dictionary comparison after load Known test failures to be addressed: - Circular dependency not detected (ValueError not raised) - Syntax error not caught (SyntaxError not raised) - Undefined symbol not caught (Exception not raised) - Partial materialization does not preserve dependency logic - Subframe alias on unregistered frame does not raise NameError
- Introduces per-channel, detector-agnostic model: X(Q,n) = a(q0,n) + b(q0,n)·(Q−q0), centered on Δq - Defines inputs/outputs, fit steps, and monotonicity policy (b > b_min) - Details nuisance-axis interpolation (linear/PCHIP) and uncertainty (σ_Q, σ_Q_irr) - Provides API sketch (fit_quantile_linear_nd, QuantileEvaluator) and persistence (Parquet/Arrow/ROOT) - Outlines unit tests, diagnostics, and performance expectations Refs: calibration, multiplicity/flow estimator framework
…er plots - Added NumPy-style docstrings to df_draw_scatter and drawExample
…ench - Introduces dfextensions/quantile_fit_nd: - quantile_fit_nd.py: per-channel ND fit, separable interpolation, evaluator, I/O - test_quantile_fit_nd.py: synthetic unit tests (uniform/poisson/gaussian, z nuisance) - bench_quantile_fit_nd.py: simple timing benchmark over N and distributions - Uses Δq-centered model: X = a(q0,n) + b(q0,n)·(Q − q0) - Enforces monotonicity with configurable b_min (auto/fixed) - Outputs DataFrame (Parquet/Arrow/ROOT) with diagnostics and metadata
…ust edge expectations - Define evaluator.invert_rank() with self-consistent candidate + fixed-point refinement - Compute b(z) expectation by averaging b_true over sample per z-bin - Relax sigma_Q tolerance to 0.25 (finite-window OLS) - Update edge-case test to assert edge coverage instead of unrealistic 90% overall
…ngle-groupby warning - Evaluator was treating 'q_center' as a nuisance axis (detected by *_center), causing axis misalignment and AxisError in moveaxis. Exclude it explicitly. - When grouping by a single nuisance bin column, use scalar grouper to avoid pandas FutureWarning.
…b_min + stable inversion - QuantileEvaluator: exclude 'q_center' from nuisance axes (fix AxisError in moveaxis) - Groupby: use scalar grouper for single nuisance bin column (silence FutureWarning) - Fit: compute b_min per |Q−q0|≤dq window (avoid over-clipping b in low-b regions) - Inversion: implement self-consistent candidate + 2-step fixed-point refine (invert_rank) - Keep API/metadata unchanged; prepare for ND nuisances and time
…(exclude IDE files) - remove .idea/ from repo and add .gitignore
…d record reason - Apply b_min only when a valid fit yields b<=0 (monotonicity enforcement) - For low-Q-spread / low-N windows, keep NaN (no floor), add reason in fit_stats - Greatly reduces bias in Poisson case; z-bin averages use informative windows only
- Use Q = F(k-1) + U*(F(k)-F(k-1)) for Poisson synthetic data - Ensures continuous ranks and informative Δq windows - Keeps fitter unchanged; diagnostics remain valid
- Explain continuous-Q assumption and discrete preprocessing (PIT/mid-ranks) - Add utils: discrete_to_uniform_rank_poisson / _empirical for reuse
- Round-trip RMS is dominated by per-event noise → expect α_rt≈0 (flat), not −0.5 - Keep rms_b scaling check near −0.5 (loosen tol to ±0.2 across 5 N points) - Clarify summary prints and expectations; leave constancy check only for rms_b·√N PWGPP-643
- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643
- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643
- bench_groupby_regression.py: self-contained scenarios (clean/outliers, serial/parallel) - Emits TXT and JSON (CSV optional) for easy doc inclusion and CI checks - Uses y ~ x1 + x2 per-group via GroupByRegressor.make_parallel_fit - Workaround for single-col group key (duplicate column for tuple keys) Sample results show: - ~1.75 s / 1k groups (serial clean, 50k rows, 10k groups) - ~0.41 s / 1k groups with n_jobs=10 (≈4.3× speedup) - Current y-shift outliers do not slow down OLS path (no refits triggered)
…x Markdown tables - Added new "Performance & Benchmarking" section describing benchmark usage, results, and interpretation - Included CLion-compatible Markdown tables for output columns, example results, and recommendations - Documented benchmark command line and sample outputs (50k rows / 10k groups) - Clarified how sigmaCut and parallelization affect runtime - Minor formatting and readability improvements across the file
- Default benchmark: 5 rows/group, 5k groups (faster, still representative) - Added 30% outlier scenario to examples; clarified that response-only outliers don’t trigger slow robust re-fits - Updated example tables for Mac and Linux with new per-1k-group timings - (optional) bench CLI default --groups=5000
…erage-outlier plan - Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime - Add action plan: leverage-outlier generator + refit counters + multi-target cost check - Keep PR target; align benchmarks and docs with 5k/5 default
…iag_prefix) - process_group_robust: record n_refits, frac_rejected, hat_max, cond_xtx, time_ms, n_rows (only when diag=True) - make_parallel_fit: new args diag / diag_prefix (default off; no behavior change) - add summarize_diagnostics(dfGB) helper for quick triage
… report - Append scenario-wise diagnostics summary to benchmark_report.txt - Save top-10 violators per scenario (time/refits) as CSVs - Supports suffix-aware summarize_diagnostics() from GroupByRegressor - Verified clean pytest and benchmark runs on real and synthetic data
…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study
…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study


Adding -
AliasDataFrameis a small utility that extendspandas.DataFramefunctionality by enabling:uproot+PyROOT)