Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use of subtyping instead of accessors #490

Merged
merged 10 commits into from
Aug 15, 2023

Conversation

bifbof
Copy link
Collaborator

@bifbof bifbof commented Jul 26, 2023

With this PR I want to discuss what I find the most confusing thing in trackintel and how I think we could improve it.
I used this a draft PR to nicely show potential changes.

tldr: I want to use classes instead of accessors while keeping accessors for backward compatibility.

Currently trackintel uses pandas accessors to enable all the features we provide. The accessors are the recommended way to extend pandas and quite easy to implement. Accessors are added as attributes to pandas DataFrame (e.g. pfs.as_positionfixes)

However, this approach has limitations that make its use unintuitive.

  • Often import trackintel as ti is imported and then ti isn't used anywhere in the code. Instead, as_xyz attributes are used on DataFrame objects. This is often confusing, as imported code is not expected to change anything elsewhere.
  • for as_xyz to work, the object must already follow the xyz model. To find a description of this model, one either uses a trial-and-error method or search the documentation, being aware that the description hides behind the xyzAccessor class.
  • Using 5 accessors (.as_positionfixes, .as_staypoints, .as_triplegs, .as_trips, .as_tours) is probably not the way how accessors should be used. Rather one (for example .geo) for the whole library. Additionally, this leads to bloat, since all methods have to be reimplemented for every accessor class.
  • Writing pfs.as_positionfixes.to_csv(*args, **kwargs) is like adding a type every time we want to do something with the pfs object. If pfs would of a type right away, we could easily write pfs.to_csv(*args, **kwargs) with less overhead.

If we want to solve these problems, I would propose the following way: let's implement the trackintel model as classes that subtype GeoDataFrame and make them backwards compatible by keeping the accessor. This will create the following class hierarchy.

classes-1690283849056-2

The orange arrows show the class returned by calling the pfs.as_positionfixes accessor. A call on Positionfixes for example returns self and a call on DataFrame returns a new Positionfixes object. TrackintelBase is a class that implements all common methods.

This hierarchy works quite well and the changes are relatively minimal, as you can see in the changes of the PR, which implements the subtyping for the Positionfixes class. All test passes while using the class instead of the pandas object. This allows to use the trackintel classes like GeoDataFrame just with some more methods. It also solves the aforementioned problems:

  • using ti.Positionfixes connects the import to the classes.
  • searching help on the ti.Positionfixes constructor is quite clear
  • having 5 classes is connected with how we explain the model. TrackintelBase avoids bloat as it implements shared code.
  • Using classes directly adds type information to the object.

Still, there are some disadvantages that need to be considered:

  • Caching: Accessors are cached. Thus a trackintel class accessed via an accessor is initiated only once. This can lead to a problem with the following code:

    pfs.as_positionfixes["user_id"] == 2  # initialised with .as_positionfixes
    pfs["user_id"] = 3                    # overwrite column "user_id"
    pfs.as_positionfixes["user_id"] == 3  # fails 

    This code fails because the accessor is initialized with a view of the old data, that does not update column overwrites and other changes. → if we remove caching the accessor call returns a new object each time, mirroring all changes. (This also solves obj.as_xyz does not necessarily validate data #476). But of course, this adds overhead due to multiple initializations.

  • subtyped methods: Some methods like to_csv clash now with the inherited DataFrame/GeoDataFrame method. For proper subtyping these methods need to take the same arguments like the super class method.

  • Subtyping GeoDataFrames: pandas nicely allows subtyping DataFrame, geopandas with GeoDataFrames not so much. In detail, in some methods, they override the __class__ attribute fix with GeoDataFrame thus deleting any subclassed type. An example where this happens would be this line. The fix is pretty simple (see line 130 in util.py), but now we depend on some implementation details of geopandas, which is not the most stable library in my opinion.

Further, I want to discuss a feature here namely the geopandas fallback idea, that is maybe a drawback. For the past year or so, GeoDataFrame upcast to DataFrame when it loses the geometry. I would like to use the same idea of this fallback for our classes. So a Positionfixes would fallback to GeoDataFrame if it isn't valid anymore, and further would fallback to DataFrame if it has no geometry.

But there are further complications with the Trips and Tours classes. There the geometry is optional, so we would have to create fallbacks to the DataFrame version of these classes. In total we would then have 3 groups of classes.

  1. classes with fallbacks to GeoDataFrame and DataFrame
  2. classes with fallbacks to GeoDataFrames, their DataFrame version, and a normal DataFrame
  3. classes with fallback to Dataframe

classes_groups

While these fallbacks might be not that hard to implement, users might find it confusing if suddenly the type of their object changes. Alternatively, we could block all fallbacks, raise an error, and require users to explicitly perform these casts. But keeping the fallbacks would ensure the object is always of the right type (that for example solves #451)

With these advantages and disadvantages what is your opinion on this possible change?
Is it useful? Is something unclear?What about the fallback stuff?

@bifbof bifbof marked this pull request as draft July 26, 2023 15:51
@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.19% 🎉

Comparison is base (68ac2b3) 92.51% compared to head (fe96f72) 92.71%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   92.51%   92.71%   +0.19%     
==========================================
  Files          40       40              
  Lines        2032     2086      +54     
  Branches      358      374      +16     
==========================================
+ Hits         1880     1934      +54     
  Misses        133      133              
  Partials       19       19              
Files Changed Coverage Δ
trackintel/__init__.py 100.00% <100.00%> (ø)
trackintel/io/postgis.py 96.07% <100.00%> (ø)
trackintel/model/__init__.py 100.00% <100.00%> (ø)
trackintel/model/positionfixes.py 100.00% <100.00%> (ø)
trackintel/model/util.py 100.00% <100.00%> (ø)
trackintel/visualization/positionfixes.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hongyeehh
Copy link
Member

In general, I vote for replacing the accessor way of using functions. I find it unintuitive and seldom use it in my code. The caching issue is fine, as we never depend on this behavior in our implementation. I did not fully understand the description for subtyped methods and Subtyping GeoDataFrames, and their consequences for our code. So I do not fully understand your changes to the code :). I guess the issue is related to the backward compatibility making the code complicated.

The fallback idea is fine IMO; it is intuitive that if pfs or locs do not have a required column, they shall fallback to geodataframe or dataframe, and users need to reconstruct the trackintel class (just like gpd.GeoDataFrame() initialization) if they recover the required columns. The trips and tours are fine too - users shall know what their data look like and which functions they shall use. Let me know if my understanding is incorrect.

At the moment, I think further discussions are needed with @henrymartin1 and @NinaWie (and maybe @abcnishant007 (?)) if the major changes make sense. And afterward, it would be great if @bifbof could provide more explanations on the changes, such that maintenance in the future will not be an issue.

@bifbof bifbof marked this pull request as ready for review August 11, 2023 11:27
@bifbof
Copy link
Collaborator Author

bifbof commented Aug 11, 2023

This PR is ready for review!
Additionally changes:

  • added _check method, that does the same as _validate but returns bool instead of raising an error.
  • replaced validate kwarg with validate_geometry kwarg.
  • added a lot of tests for all new methods

Copy link
Member

@henrymartin1 henrymartin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

trackintel/visualization/positionfixes.py Show resolved Hide resolved
trackintel/io/postgis.py Show resolved Hide resolved
@hongyeehh hongyeehh merged commit e2bc973 into mie-lab:master Aug 15, 2023
7 checks passed
@bifbof bifbof deleted the pfs_to_class branch August 15, 2023 20:59
bifbof added a commit to bifbof/trackintel that referenced this pull request Aug 16, 2023
Here comes the next class :)
For more info see mie-lab#490 .
bifbof added a commit to bifbof/trackintel that referenced this pull request Aug 16, 2023
Here comes the next class :)
For more info see mie-lab#490 .
hongyeehh added a commit that referenced this pull request Aug 17, 2023
* ENH: enable subclassing for Triplegs

Here comes the next class :)
For more info see #490 .

* CLN: correct accessor property name

* TST: add test for property as_triplegs

---------

Co-authored-by: Ye <[email protected]>
hongyeehh added a commit that referenced this pull request Aug 17, 2023
* ENH: enable subclassing for Locations

Here comes the next class :)
For more info see #490 .

* TST: add test for as_locations property

---------

Co-authored-by: Ye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants