GitHub - anthony-cros/data-science-from-scratch-scala: Reproduces in Scala the Python examples found in data-science-from-scratch

Introduction

This repository aims to reproduce in Scala the Python examples found in data-science-from-scratch by Joel Grus (see the book itself: 2nd edition from 2019)

Introduction
A Crash Course in Python
Visualizing Data
Linear Algebra
Statistics
Probability
Hypothesis and Inference
Gradient Descent
Getting Data
Working With Data
Machine Learning
k-Nearest Neighbors
Naive Bayes
Simple Linear Regression
Multiple Regression
Logistic Regression
Decision Trees
Neural Networks
Deep Learning
Clustering
Natural Language Processing
Network Analysis
Recommender Systems
Databases and SQL
MapReduce
Data Ethics
Go Forth And Do Data Science

Running

You can run the driver to exercise all classes with the following:

export SBT_OPTS="-Xmx8G -Xms8G"; sbt "runMain scratchscala.Driver"

Note that Python's matplotlib must be available system-wide (see installation)

Python integration

The code uses the amazing ScalaPy library developed by Shadaj Laddad, which offers excellent interoperability between Python and Scala.

It's notably used to call matplotlib in the same way the book does, since reproducing a plotting library is not the point of the book (and finding exact equivalents probably not a trivial task).

Scala version

The code uses Scala 2.13, despite the release of the outstanding Scala 3.0, in part because ScalaPy does not support it yet, but also because it represents a major change (for the better, but a disruptive change nonetheless).

Utilities

The code makes use of some utilities and extension methods for improved readability. Many of these can also be found in my own utilities library Aptus (not included here), as they correspond to features often missed in the standard library.

Python vs Scala

The scope of the book does not allow comparing the two languages in general, and neither does the code in this repository. My personal take on "static vs dynamic" is best articulated in this article by Robert Harper. The Scala vs Python debate specifically is much more involved since their respective ecosystems matter tremendously.

A few things do stand out from the current repo:

Python offers some nice abstractions such as collections.Counter that Scala would benefit from porting.
Python oddly misses a built-in group-by mechanism, at least when looking something like this method
Scala has nice a great group-by mechanism but misses group-by-key, count-bys and ListMap-counterparts (ListMap is insertion-order preserving)

Style

Case

I opted to preserve the snake case used in the book, which is not idiosyncratic in Scala. For instance I use vector_means instead of vectorMeans. I did not follow this rule for class and file names. I just couldn't.

Code size

I tried as much as possible to maintain the original style, and not necessarily shorten/expand code blocks, so long as the resulting Scala code looked sufficiently idiosyncratic.

Miscellaneous

I am also in the process of rewriting some examples in more idiosyncratic Scala, or at least in a format that I find easier to comprehend. I will port those soon as well.
Some of the code examples need a memory boost to run to completion, e.g. KNearestNeighbor needs -Xmx8g -Xms8g
I have found at least two similar efforts using alternative technologies:
- in Swift
- in PyRx (quite bare)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
project		project
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Table of Contents

Running

Python integration

Scala version

Utilities

Python vs Scala

Style

Case

Code size

Miscellaneous

About

Releases

Packages

Languages

License

anthony-cros/data-science-from-scratch-scala

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents

Running

Python integration

Scala version

Utilities

Python vs Scala

Style

Case

Code size

Miscellaneous

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages