ENH: Use of subtyping instead of accessors #490

bifbof · 2023-07-26T15:51:50Z

With this PR I want to discuss what I find the most confusing thing in trackintel and how I think we could improve it.
I used this a draft PR to nicely show potential changes.

tldr: I want to use classes instead of accessors while keeping accessors for backward compatibility.

Currently trackintel uses pandas accessors to enable all the features we provide. The accessors are the recommended way to extend pandas and quite easy to implement. Accessors are added as attributes to pandas DataFrame (e.g. pfs.as_positionfixes)

However, this approach has limitations that make its use unintuitive.

Often import trackintel as ti is imported and then ti isn't used anywhere in the code. Instead, as_xyz attributes are used on DataFrame objects. This is often confusing, as imported code is not expected to change anything elsewhere.
for as_xyz to work, the object must already follow the xyz model. To find a description of this model, one either uses a trial-and-error method or search the documentation, being aware that the description hides behind the xyzAccessor class.
Using 5 accessors (.as_positionfixes, .as_staypoints, .as_triplegs, .as_trips, .as_tours) is probably not the way how accessors should be used. Rather one (for example .geo) for the whole library. Additionally, this leads to bloat, since all methods have to be reimplemented for every accessor class.
Writing pfs.as_positionfixes.to_csv(*args, **kwargs) is like adding a type every time we want to do something with the pfs object. If pfs would of a type right away, we could easily write pfs.to_csv(*args, **kwargs) with less overhead.

If we want to solve these problems, I would propose the following way: let's implement the trackintel model as classes that subtype GeoDataFrame and make them backwards compatible by keeping the accessor. This will create the following class hierarchy.

The orange arrows show the class returned by calling the pfs.as_positionfixes accessor. A call on Positionfixes for example returns self and a call on DataFrame returns a new Positionfixes object. TrackintelBase is a class that implements all common methods.

This hierarchy works quite well and the changes are relatively minimal, as you can see in the changes of the PR, which implements the subtyping for the Positionfixes class. All test passes while using the class instead of the pandas object. This allows to use the trackintel classes like GeoDataFrame just with some more methods. It also solves the aforementioned problems:

using ti.Positionfixes connects the import to the classes.
searching help on the ti.Positionfixes constructor is quite clear
having 5 classes is connected with how we explain the model. TrackintelBase avoids bloat as it implements shared code.
Using classes directly adds type information to the object.

Still, there are some disadvantages that need to be considered:

Caching: Accessors are cached. Thus a trackintel class accessed via an accessor is initiated only once. This can lead to a problem with the following code:
```
pfs.as_positionfixes["user_id"] == 2  # initialised with .as_positionfixes
pfs["user_id"] = 3                    # overwrite column "user_id"
pfs.as_positionfixes["user_id"] == 3  # fails 
```
This code fails because the accessor is initialized with a view of the old data, that does not update column overwrites and other changes. → if we remove caching the accessor call returns a new object each time, mirroring all changes. (This also solves obj.as_xyz does not necessarily validate data #476). But of course, this adds overhead due to multiple initializations.
subtyped methods: Some methods like to_csv clash now with the inherited DataFrame/GeoDataFrame method. For proper subtyping these methods need to take the same arguments like the super class method.
Subtyping GeoDataFrames: pandas nicely allows subtyping DataFrame, geopandas with GeoDataFrames not so much. In detail, in some methods, they override the __class__ attribute fix with GeoDataFrame thus deleting any subclassed type. An example where this happens would be this line. The fix is pretty simple (see line 130 in util.py), but now we depend on some implementation details of geopandas, which is not the most stable library in my opinion.

Further, I want to discuss a feature here namely the geopandas fallback idea, that is maybe a drawback. For the past year or so, GeoDataFrame upcast to DataFrame when it loses the geometry. I would like to use the same idea of this fallback for our classes. So a Positionfixes would fallback to GeoDataFrame if it isn't valid anymore, and further would fallback to DataFrame if it has no geometry.

But there are further complications with the Trips and Tours classes. There the geometry is optional, so we would have to create fallbacks to the DataFrame version of these classes. In total we would then have 3 groups of classes.

classes with fallbacks to GeoDataFrame and DataFrame
classes with fallbacks to GeoDataFrames, their DataFrame version, and a normal DataFrame
classes with fallback to Dataframe

While these fallbacks might be not that hard to implement, users might find it confusing if suddenly the type of their object changes. Alternatively, we could block all fallbacks, raise an error, and require users to explicitly perform these casts. But keeping the fallbacks would ensure the object is always of the right type (that for example solves #451)

With these advantages and disadvantages what is your opinion on this possible change?
Is it useful? Is something unclear?What about the fallback stuff?

codecov · 2023-07-26T15:59:21Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.19% 🎉

Comparison is base (68ac2b3) 92.51% compared to head (fe96f72) 92.71%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   92.51%   92.71%   +0.19%     
==========================================
  Files          40       40              
  Lines        2032     2086      +54     
  Branches      358      374      +16     
==========================================
+ Hits         1880     1934      +54     
  Misses        133      133              
  Partials       19       19

Files Changed	Coverage Δ
trackintel/__init__.py	`100.00% <100.00%> (ø)`
trackintel/io/postgis.py	`96.07% <100.00%> (ø)`
trackintel/model/__init__.py	`100.00% <100.00%> (ø)`
trackintel/model/positionfixes.py	`100.00% <100.00%> (ø)`
trackintel/model/util.py	`100.00% <100.00%> (ø)`
trackintel/visualization/positionfixes.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hongyeehh · 2023-07-31T14:28:16Z

In general, I vote for replacing the accessor way of using functions. I find it unintuitive and seldom use it in my code. The caching issue is fine, as we never depend on this behavior in our implementation. I did not fully understand the description for subtyped methods and Subtyping GeoDataFrames, and their consequences for our code. So I do not fully understand your changes to the code :). I guess the issue is related to the backward compatibility making the code complicated.

The fallback idea is fine IMO; it is intuitive that if pfs or locs do not have a required column, they shall fallback to geodataframe or dataframe, and users need to reconstruct the trackintel class (just like gpd.GeoDataFrame() initialization) if they recover the required columns. The trips and tours are fine too - users shall know what their data look like and which functions they shall use. Let me know if my understanding is incorrect.

At the moment, I think further discussions are needed with @henrymartin1 and @NinaWie (and maybe @abcnishant007 (?)) if the major changes make sense. And afterward, it would be great if @bifbof could provide more explanations on the changes, such that maintenance in the future will not be an issue.

bifbof · 2023-08-11T11:32:23Z

This PR is ready for review!
Additionally changes:

added _check method, that does the same as _validate but returns bool instead of raising an error.
replaced validate kwarg with validate_geometry kwarg.
added a lot of tests for all new methods

henrymartin1

Nice work!

trackintel/visualization/positionfixes.py

trackintel/io/postgis.py

Here comes the next class :) For more info see mie-lab#490 .

* ENH: enable subclassing for Triplegs Here comes the next class :) For more info see #490 . * CLN: correct accessor property name * TST: add test for property as_triplegs --------- Co-authored-by: Ye <[email protected]>

* ENH: enable subclassing for Locations Here comes the next class :) For more info see #490 . * TST: add test for as_locations property --------- Co-authored-by: Ye <[email protected]>

ENH: positionfixes to class

c3b7486

bifbof marked this pull request as draft July 26, 2023 15:51

CLN: clean up some comments

049b450

bifbof mentioned this pull request Aug 10, 2023

ENH: remove caching in DataFrame accessors #505

Merged

bifbof added 6 commits August 10, 2023 14:18

Merge branch 'master' into pfs_to_class

1a215b1

ENH: call correct methods in to_csv and plot directly

fd2d917

ENH: rename PositionfixesAccessor to Positionfixes

ba1779f

ENH: simplify validating of Positionfixes

6cebc44

TST: add tests for Positionfixes and TrackintelGeoDataFrame

cd3b350

CLN: call merge method and not merge function

91375a0

bifbof marked this pull request as ready for review August 11, 2023 11:27

hongyeehh requested review from henrymartin1, NinaWie and hongyeehh August 11, 2023 12:26

bifbof mentioned this pull request Aug 12, 2023

Move as_positionfixes.plot method into separate function #506

Closed

henrymartin1 approved these changes Aug 15, 2023

View reviewed changes

trackintel/visualization/positionfixes.py Show resolved Hide resolved

trackintel/io/postgis.py Show resolved Hide resolved

Merge branch 'master' into pfs_to_class

e544788

NinaWie approved these changes Aug 15, 2023

View reviewed changes

Merge branch 'master' into pfs_to_class

fe96f72

hongyeehh approved these changes Aug 15, 2023

View reviewed changes

hongyeehh merged commit e2bc973 into mie-lab:master Aug 15, 2023
7 checks passed

bifbof deleted the pfs_to_class branch August 15, 2023 20:59

bifbof mentioned this pull request Aug 16, 2023

ENH: enable subclassing for Staypoints #509

Merged

bifbof added a commit to bifbof/trackintel that referenced this pull request Aug 16, 2023

ENH: enable subclassing for Triplegs

99bfc1f

Here comes the next class :) For more info see mie-lab#490 .

bifbof mentioned this pull request Aug 16, 2023

ENH: enable subclassing for Triplegs #510

Merged

bifbof added a commit to bifbof/trackintel that referenced this pull request Aug 16, 2023

ENH: enable subclassing for Locations

2277f9a

Here comes the next class :) For more info see mie-lab#490 .

bifbof mentioned this pull request Aug 16, 2023

ENH: enable subclassing for Locations #511

Merged

hongyeehh mentioned this pull request Aug 26, 2023

trackintel write csv function allows writing non-trackintel format csv files #451

Open

bifbof mentioned this pull request Sep 18, 2023

ENH: enable subclassing for tours #530

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use of subtyping instead of accessors #490

ENH: Use of subtyping instead of accessors #490

bifbof commented Jul 26, 2023 •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading

hongyeehh commented Jul 31, 2023

bifbof commented Aug 11, 2023

henrymartin1 left a comment

ENH: Use of subtyping instead of accessors #490

ENH: Use of subtyping instead of accessors #490

Conversation

bifbof commented Jul 26, 2023 • edited Loading

codecov bot commented Jul 26, 2023 • edited Loading

Codecov Report

hongyeehh commented Jul 31, 2023

bifbof commented Aug 11, 2023

henrymartin1 left a comment

Choose a reason for hiding this comment

bifbof commented Jul 26, 2023 •

edited

Loading

codecov bot commented Jul 26, 2023 •

edited

Loading