-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset size #145
Comments
I think what you have in mind is too early question for RedAmber, however, it is important for users to know how much scale data and what features it has compared to other data frames, thanks! 1. data sizeSince RedAmber is an on-memory, single-threaded, non-streaming, eager execution data frame in Ruby (a dynamic language). Still, I am trying to find out how large data can be handled using https://github.com/h2oai/db-benchmark . Please let me know if you have a better data set to check scalability. (It is written in R and not convenient to use.) 2. possible operationsThe references you gave me are helpful. I would like to make a comparison chart.
By the way, I think the data frame library that RedAmber should be most compared to is Polars. What do you think? |
Polars seems to use threads. A comparison chart would be helpful. Perhaps indicate features wish to add. Possibly compare with other data frame implementations. Arrow has flight https://github.com/apache/arrow/tree/master/ruby/red-arrow-flight and UCX can run on distributed memory, so larger datasets might be possible. |
Can add RedAmber to the db-benchmark h2oai/db-benchmark#250 then look for larger datasets. |
Comparing features between RedAmber, dplyr/tidyr and pandasThis is the comparison of basic feature between RedAmber and other major DataFrame libraries, comparing only for the method 'verbs' ignoring parameters and options. Remarks:
Comments or suggestions are welcome! Select columns (variables)
Select rows (records, observations)
Update columns / create new columns
Reshape dataframe
Grouping
Combine dataframes or tables
|
This is helpful. Thanks. May also want to compare with Julia where the comparison is part of the documentation. |
Can create a pull request with this if of interest. |
Yes. It would be nice if this is part of the Document in source tree. I can accept requests for modifications. |
Discussed on Arrow mailing list https://github.com/ava6969/panda-arrow.git |
It may be helpful to indicate size of datasets that can be used with Red Amber and what operations will be supported.
For a comparison with other dataframes, see Table 3 in Towards Scalable Dataframe Systems and
https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray
The text was updated successfully, but these errors were encountered: