This repository aims to reproduce in Scala the Python examples found in data-science-from-scratch by Joel Grus (see the book itself: 2nd edition from 2019)
Not all chapters have been ported as I'm going through them in the order I need them most. It all started with me having to cluster some data for a project.
- Introduction
- A Crash Course in Python
- Visualizing Data
- Linear Algebra
- Statistics
- Probability
- Hypothesis and Inference
- Gradient Descent
- Getting Data
- Working With Data
- Machine Learning
- k-Nearest Neighbors
- Naive Bayes
- Simple Linear Regression
- Multiple Regression
- Logistic Regression
- Decision Trees
- Neural Networks
- Deep Learning
- Clustering
- Natural Language Processing
- Network Analysis
- Recommender Systems
- Databases and SQL
- MapReduce
- Data Ethics
- Go Forth And Do Data Science
You can run the driver to exercise all classes with the following:
export SBT_OPTS="-Xmx8G -Xms8G"; sbt "runMain scratchscala.Driver"
Note that Python's matplotlib
must be available system-wide (see installation)
The code uses the amazing ScalaPy library developed by Shadaj Laddad, which offers excellent interoperability between Python and Scala.
It's notably used to call matplotlib in the same way the book does, since reproducing a plotting library is not the point of the book (and finding exact equivalents probably not a trivial task).
The code uses Scala 2.13, despite the release of the outstanding Scala 3.0, in part because ScalaPy does not support it yet, but also because it represents a major change (for the better, but a disruptive change nonetheless).
The code makes use of some utilities and extension methods for improved readability. Many of these can also be found in my own utilities library Aptus (not included here), as they correspond to features often missed in the standard library.
The scope of the book does not allow comparing the two languages in general, and neither does the code in this repository. My personal take on "static vs dynamic" is best articulated in this article by Robert Harper. The Scala vs Python debate specifically is much more involved since their respective ecosystems matter tremendously.
A few things do stand out from the current repo:
- Python offers some nice abstractions such as
collections.Counter
that Scala would benefit from porting. - Python oddly misses a built-in group-by mechanism, at least when looking something like this method
- Scala has nice a great group-by mechanism but misses group-by-key, count-bys and ListMap-counterparts (ListMap is insertion-order preserving)
I opted to preserve the snake case used in the book, which is not idiosyncratic in Scala. For instance I use vector_means
instead of vectorMeans
. I did not follow this rule for class and file names. I just couldn't.
I tried as much as possible to maintain the original style, and not necessarily shorten/expand code blocks, so long as the resulting Scala code looked sufficiently idiosyncratic.
- I am also in the process of rewriting some examples in more idiosyncratic Scala, or at least in a format that I find easier to comprehend. I will port those soon as well.
- Some of the code examples need a memory boost to run to completion, e.g. KNearestNeighbor needs
-Xmx8g -Xms8g
- I have found at least two similar efforts using alternative technologies: