Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas-friendly column names #443

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

remrama
Copy link
Contributor

@remrama remrama commented Oct 7, 2024

This PR addresses #208 (part of the v0.6.0 Roadmap layed out in #279). It removes characters that restrict column access to the bracket format (df["p_val"]) rather than dot(?) method (df.p_val). It includes the following column-name changes:

  1. Replaced dashes with underscores in column names (e.g., p-val --> p_val)
  2. Replaced parentheses with underscores in column names (e.g., mean(A) --> mean_A)
  3. Removed percent signs and brackets in column names (e.g., CI95% --> CI95, CI[97.5%] --> CI97.5)

Here is a specific list of most (if not all) of the changes:

  • p-adjust --> p_adjust
  • p-approx --> p_approx
  • p-corr --> p_corr
  • p-exact --> p_exact
  • p-mid --> p_mid
  • p-spher --> p_sphere
  • p-tukey --> p_tukey
  • p-unc --> p_unc
  • p-val --> p_val
  • U-val --> U_val
  • W-spher --> W_spher
  • W-val --> W_val
  • p-GG-corr --> p_GG_corr
  • cohen-d --> cohen_d
  • eta-square --> eta_square
  • odds-ratio --> odds_ratio
  • CI95% --> CI95
  • CI90% --> CI90
  • CI[2.5%] --> CI2.5
  • CI[97.5%] --> CI97.5
  • mean(A) --> mean_A
  • mean(B) --> mean_B
  • std(A) --> std_A
  • std(B) --> std_B

Future column-name considerations

In doing this, I noted some other column-name considerations for future PRs that feel less immediate:

  1. Standardizing the use of "val". Modules differentially name p-value columns as p, pval, or p_val. This should be standardized across modules.
  2. Related to (1), the norm for p-values might correspond with other "values", like the current column names T, F, W, r, etc. I think optimally this would either be pval and Wval/wval or p and W/w. Perhaps an argument for keeping "val" is that .T blocks the pandas transpose method.
  3. Standardize some effect size namings across modules. Cohen's d is differentially referred to as cohen or cohen_d, and eta-squared is differentially referred to as eta_square, eta-squared, or n2.
  4. Standardize the degrees of freedom column names. These are differentially named DF, dof, ddof.
  5. Standardize abbreviation capitalizations. Bayes factors are BF, confidence intervals are CI, Wilcoxon stats are W, then there's dof and r.
  6. Standardize Sentence case vs camelCase vs snake_case. Column capitalization, like Parametric vs alternative are inconsistent. I think snake_case is the pandas norm, but consistency seems most important.
  7. I think including the numerical values in confidence interval column names is not necessary. Eg, CI95 could just be CI, and, more importantly, CI97.5 and CI2.5 could just be CI_upper and CI_lower, respectively.

Notes

  • The jupyter notebooks should run fine with the new code, but I did not re-run them, so the output of those tutorials still have the old column names. Let me know if you want me to rerun them before merging.
  • I updated the docstrings as needed, but did not build them locally to test. Let me know if you want me to build and inspect before merging.

@raphaelvallat raphaelvallat self-requested a review October 8, 2024 06:13
@raphaelvallat
Copy link
Owner

raphaelvallat commented Oct 8, 2024

@remrama this is absolutely fantastic. Thank you for doing such a thorough investigation. My thoughts on what you described in the "Future column-name considerations":

  1. Overall, I think we should merge the current PR and then implement most, if not all of the changes that you describe, and then release a new 0.6.0 version. The other items in the roadmap feel less important and could be included as part of the 6.x minor version updates.
  2. Agreed for:
    • using *val throughout
    • use cohen_d and eta_squared (and replace np2 with partial_eta_squared?)
    • use ddof to be consistent with pandas and scipy
    • use snake_case throughout
    • use ci, ci_upper, ci_lower throughout
    • perhaps even worth adding a new documentation page with Pingouin's standard naming conventions...?

Laslty, yes to both your questions in the "Notes" section. :) I'll do a final review once you have re-generated the notebook. I had a brief look and I think it's good to go. I'm assuming you just did a search and replace of the relevant names in VSCode or similar editor, right?

@raphaelvallat raphaelvallat mentioned this pull request Oct 8, 2024
11 tasks
@remrama
Copy link
Contributor Author

remrama commented Oct 8, 2024

Great, thanks @raphaelvallat. I think the current PR is now ready for a quick review. I updated the notebooks and tested the docs locally (passed build and visual inspection). I'm a bit confused as to why the Python tests aren't running after a PR (Lint is, but not others). I ran tests locally and they passed, but I can't speak for alternate systems and Python versions.

And yes, correct, I used search/replace within VSCode. It was mostly iterations of regex to find all the things that needed replacing, and then more regex to simplify replacements. There were only a few snags, like the CI column namings and catching all the docstrings.

Happy to push for another PR that adds to the column-name conventions after this one :)

@raphaelvallat
Copy link
Owner

raphaelvallat commented Oct 8, 2024

Fix the Github Action CI:

Replace

name: Python tests

on:
  push:
    branches: [master, develop]
  pull_request:
    branches: [master, develop]

with:

name: Python tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

(Yes, I should have caught this before... 😬 )

@raphaelvallat
Copy link
Owner

I'll wait for the unit tests and then do a final review and approval. Hopefully we don't get failing unit tests from previous PRs 🤦 Feel free to re-enable the CI in a separate PR, which we'll merge before this one

* GH Action on main branch instead of master and develop

* bump actions/upload-artifact@v2 to v4

* install doc requirements from .toml

* bump actions/checkout@v2 --> v4

* bump actions/setup-python@v1 --> v5

* move pip-install-docs back to only run during docs build

* typo 3.8 --> 3.9

* bump codecov/codecov-action@v1 --> v4

* remove platform specification from docs-artifact
Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (main@c93ec4b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/pingouin/correlation.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #443   +/-   ##
=======================================
  Coverage        ?   98.54%           
=======================================
  Files           ?       19           
  Lines           ?     3360           
  Branches        ?      492           
=======================================
  Hits            ?     3311           
  Misses          ?       26           
  Partials        ?       23           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@remrama
Copy link
Contributor Author

remrama commented Oct 9, 2024

@raphaelvallat all checks passed here, you should be good to review/merge. We should be good to go for the immediate column-name fixes, then we can set up a separate PR for more convention-specific namings.

I had a last-minute thought about this PR: It really doesn't impact at all how the user interfaces with Pingouin (only the output). But there is one exception with the compute_effsize function, where a few parameters won't work anymore. Passing in eta-square and odds-ratio now will not work, user needs to pass eta_square or odds_ratio. Maybe just leave it, but I was thinking we could add a simple "-".replace("_") and a FutureWarning. The only reason these were changed is because they end up as column names in some functions, so the other option would be to just name the columns something like effsize and not carry the actual stat name over.

Copy link
Owner

@raphaelvallat raphaelvallat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @remrama , approved!

I had a last-minute thought about this PR: It really doesn't impact at all how the user interfaces with Pingouin (only the output). But there is one exception with the compute_effsize function, where a few parameters won't work anymore. Passing in eta-square and odds-ratio now will not work, user needs to pass eta_square or odds_ratio. Maybe just leave it, but I was thinking we could add a simple "-".replace("_") and a FutureWarning. The only reason these were changed is because they end up as column names in some functions, so the other option would be to just name the columns something like effsize and not carry the actual stat name over.

I think let's not worry about the FutureWarning and replace. We should just leverage this major update to do a clean slate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants