-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use of subtyping instead of accessors #490
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #490 +/- ##
==========================================
+ Coverage 92.51% 92.71% +0.19%
==========================================
Files 40 40
Lines 2032 2086 +54
Branches 358 374 +16
==========================================
+ Hits 1880 1934 +54
Misses 133 133
Partials 19 19
☔ View full report in Codecov by Sentry. |
In general, I vote for replacing the accessor way of using functions. I find it unintuitive and seldom use it in my code. The The fallback idea is fine IMO; it is intuitive that if At the moment, I think further discussions are needed with @henrymartin1 and @NinaWie (and maybe @abcnishant007 (?)) if the major changes make sense. And afterward, it would be great if @bifbof could provide more explanations on the changes, such that maintenance in the future will not be an issue. |
This PR is ready for review!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Here comes the next class :) For more info see mie-lab#490 .
Here comes the next class :) For more info see mie-lab#490 .
* ENH: enable subclassing for Triplegs Here comes the next class :) For more info see #490 . * CLN: correct accessor property name * TST: add test for property as_triplegs --------- Co-authored-by: Ye <[email protected]>
* ENH: enable subclassing for Locations Here comes the next class :) For more info see #490 . * TST: add test for as_locations property --------- Co-authored-by: Ye <[email protected]>
With this PR I want to discuss what I find the most confusing thing in trackintel and how I think we could improve it.
I used this a draft PR to nicely show potential changes.
tldr: I want to use classes instead of accessors while keeping accessors for backward compatibility.
Currently
trackintel
uses pandas accessors to enable all the features we provide. The accessors are the recommended way to extend pandas and quite easy to implement. Accessors are added as attributes to pandasDataFrame
(e.g.pfs.as_positionfixes
)However, this approach has limitations that make its use unintuitive.
import trackintel as ti
is imported and thenti
isn't used anywhere in the code. Instead,as_xyz
attributes are used onDataFrame
objects. This is often confusing, as imported code is not expected to change anything elsewhere.as_xyz
to work, the object must already follow thexyz
model. To find a description of this model, one either uses a trial-and-error method or search the documentation, being aware that the description hides behind thexyzAccessor
class..as_positionfixes
,.as_staypoints
,.as_triplegs
,.as_trips
,.as_tours
) is probably not the way how accessors should be used. Rather one (for example.geo
) for the whole library. Additionally, this leads to bloat, since all methods have to be reimplemented for every accessor class.pfs.as_positionfixes.to_csv(*args, **kwargs)
is like adding a type every time we want to do something with thepfs
object. Ifpfs
would of a type right away, we could easily writepfs.to_csv(*args, **kwargs)
with less overhead.If we want to solve these problems, I would propose the following way: let's implement the trackintel model as classes that subtype
GeoDataFrame
and make them backwards compatible by keeping the accessor. This will create the following class hierarchy.The orange arrows show the class returned by calling the
pfs.as_positionfixes
accessor. A call onPositionfixes
for example returnsself
and a call onDataFrame
returns a newPositionfixes
object. TrackintelBase is a class that implements all common methods.This hierarchy works quite well and the changes are relatively minimal, as you can see in the changes of the PR, which implements the subtyping for the Positionfixes class. All test passes while using the class instead of the pandas object. This allows to use the trackintel classes like
GeoDataFrame
just with some more methods. It also solves the aforementioned problems:ti.Positionfixes
connects the import to the classes.ti.Positionfixes
constructor is quite clearTrackintelBase
avoids bloat as it implements shared code.Still, there are some disadvantages that need to be considered:
Caching: Accessors are cached. Thus a trackintel class accessed via an accessor is initiated only once. This can lead to a problem with the following code:
This code fails because the accessor is initialized with a view of the old data, that does not update column overwrites and other changes. → if we remove caching the accessor call returns a new object each time, mirroring all changes. (This also solves
obj.as_xyz
does not necessarily validate data #476). But of course, this adds overhead due to multiple initializations.subtyped methods: Some methods like
to_csv
clash now with the inheritedDataFrame
/GeoDataFrame
method. For proper subtyping these methods need to take the same arguments like the super class method.Subtyping
GeoDataFrames
: pandas nicely allows subtypingDataFrame
, geopandas withGeoDataFrames
not so much. In detail, in some methods, they override the__class__
attribute fix withGeoDataFrame
thus deleting any subclassed type. An example where this happens would be this line. The fix is pretty simple (see line 130 inutil.py
), but now we depend on some implementation details of geopandas, which is not the most stable library in my opinion.Further, I want to discuss a feature here namely the geopandas fallback idea, that is maybe a drawback. For the past year or so,
GeoDataFrame
upcast toDataFrame
when it loses the geometry. I would like to use the same idea of this fallback for our classes. So aPositionfixes
would fallback toGeoDataFrame
if it isn't valid anymore, and further would fallback toDataFrame
if it has no geometry.But there are further complications with the Trips and Tours classes. There the geometry is optional, so we would have to create fallbacks to the
DataFrame
version of these classes. In total we would then have 3 groups of classes.GeoDataFrame
andDataFrame
GeoDataFrames
, theirDataFrame
version, and a normalDataFrame
Dataframe
While these fallbacks might be not that hard to implement, users might find it confusing if suddenly the type of their object changes. Alternatively, we could block all fallbacks, raise an error, and require users to explicitly perform these casts. But keeping the fallbacks would ensure the object is always of the right type (that for example solves #451)
With these advantages and disadvantages what is your opinion on this possible change?
Is it useful? Is something unclear?What about the fallback stuff?