Skip to content

Reproduces in Scala the Python examples found in data-science-from-scratch

License

Notifications You must be signed in to change notification settings

anthony-cros/data-science-from-scratch-scala

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository aims to reproduce in Scala the Python examples found in data-science-from-scratch by Joel Grus (see the book itself: 2nd edition from 2019)

Table of Contents

Not all chapters have been ported as I'm going through them in the order I need them most. It all started with me having to cluster some data for a project.

  1. Introduction
  2. A Crash Course in Python
  3. Visualizing Data
  4. Linear Algebra
  5. Statistics
  6. Probability
  7. Hypothesis and Inference
  8. Gradient Descent
  9. Getting Data
  10. Working With Data
  11. Machine Learning
  12. k-Nearest Neighbors
  13. Naive Bayes
  14. Simple Linear Regression
  15. Multiple Regression
  16. Logistic Regression
  17. Decision Trees
  18. Neural Networks
  19. Deep Learning
  20. Clustering
  21. Natural Language Processing
  22. Network Analysis
  23. Recommender Systems
  24. Databases and SQL
  25. MapReduce
  26. Data Ethics
  27. Go Forth And Do Data Science

Running

You can run the driver to exercise all classes with the following:

export SBT_OPTS="-Xmx8G -Xms8G"; sbt "runMain scratchscala.Driver"

Note that Python's matplotlib must be available system-wide (see installation)

Python integration

The code uses the amazing ScalaPy library developed by Shadaj Laddad, which offers excellent interoperability between Python and Scala.

It's notably used to call matplotlib in the same way the book does, since reproducing a plotting library is not the point of the book (and finding exact equivalents probably not a trivial task).

Scala version

The code uses Scala 2.13, despite the release of the outstanding Scala 3.0, in part because ScalaPy does not support it yet, but also because it represents a major change (for the better, but a disruptive change nonetheless).

Utilities

The code makes use of some utilities and extension methods for improved readability. Many of these can also be found in my own utilities library Aptus (not included here), as they correspond to features often missed in the standard library.

Python vs Scala

The scope of the book does not allow comparing the two languages in general, and neither does the code in this repository. My personal take on "static vs dynamic" is best articulated in this article by Robert Harper. The Scala vs Python debate specifically is much more involved since their respective ecosystems matter tremendously.

A few things do stand out from the current repo:

  • Python offers some nice abstractions such as collections.Counter that Scala would benefit from porting.
  • Python oddly misses a built-in group-by mechanism, at least when looking something like this method
  • Scala has nice a great group-by mechanism but misses group-by-key, count-bys and ListMap-counterparts (ListMap is insertion-order preserving)

Style

Case

I opted to preserve the snake case used in the book, which is not idiosyncratic in Scala. For instance I use vector_means instead of vectorMeans. I did not follow this rule for class and file names. I just couldn't.

Code size

I tried as much as possible to maintain the original style, and not necessarily shorten/expand code blocks, so long as the resulting Scala code looked sufficiently idiosyncratic.

Miscellaneous

  • I am also in the process of rewriting some examples in more idiosyncratic Scala, or at least in a format that I find easier to comprehend. I will port those soon as well.
  • Some of the code examples need a memory boost to run to completion, e.g. KNearestNeighbor needs -Xmx8g -Xms8g
  • I have found at least two similar efforts using alternative technologies:

About

Reproduces in Scala the Python examples found in data-science-from-scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages