Skip to content

Conversation

@abon-mostly
Copy link
Contributor

No description provided.

@mplatzer mplatzer changed the title Fix predict_proba with context data fix: fix predict_proba with context data Dec 12, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug in predict_proba when using context data where the input X contains only the join key. The fix ensures correct sample counts are maintained throughout the probability computation pipeline.

Key Changes

  • Added explicit n_samples parameter to _generate_marginal_probs function to avoid deriving sample count from potentially concatenated/batched data
  • Refactored DataFrame construction in multi-target probability computation to use explicit indexing and dict-based approach
  • Added comprehensive test coverage for single and multi-target probability prediction with context-only input

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File Description
uv.lock Version bump from 2.3.1 to 2.3.3 reflecting the bug fix
mostlyai/engine/_tabular/probability.py Core fix: added n_samples parameter and refactored DataFrame construction to handle context-only scenarios correctly
tests/end_to_end/test_tabular_interface.py Added test case validating predict_proba behavior when X contains only join keys with context data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

X=test_df[["id"]], ctx_data=test_df[["id", "ctx_a"]], target=["target_b", "target_c"]
)
assert proba_multi.shape[0] == 10
assert proba_multi.shape[1] == 16 # 4 × 4 = 16 combinations
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "4 × 4 = 16 combinations" but the test data defines both target_b and target_c with only 3 unique values each (["b1", "b2", "b3"] and ["c1", "c2", "c3"]). The expected shape should be 3 × 3 = 9 combinations, not 16. However, the actual model may include additional tokens (like special tokens for unknown values, padding, etc.), so the assertion value of 16 may be correct depending on the encoding. The comment should either be corrected to reflect the actual number of combinations being tested, or use a >= comparison like the single target test does.

Suggested change
assert proba_multi.shape[1] == 16 # 4 × 4 = 16 combinations
assert proba_multi.shape[1] >= 9 # 3 × 3 = 9 combinations (may include extra tokens)

Copilot uses AI. Check for mistakes.
@mplatzer mplatzer merged commit f1a884f into main Dec 12, 2025
7 checks passed
@mplatzer mplatzer deleted the fix-predict-proba-with-ctx branch December 12, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants