dropbox-demo

An application accompanying my talk about structuring functional applications in Scala.

Run application

Backend

Prerequisites: sbt, Elasticsearch running on localhost:9200, tesseract binary available on the PATH and runnable.

Elasticsearch is available in the attached docker-compose setup. tesseract is available if you enter the attached nix-shell. The shell will also load environment variables from the env.sh file, if you have one (it's ignored in git).

Once you have these: sbt run - at time of writing the application starts on 0.0.0.0:4000, this can be configured with the HTTP_HOST/HTTP_PORT environment variables.

Frontend

Prerequisites: Node 14.x, npm (provided in a nix-shell in the frontend directory).

cd frontend
npm start

Project goals

Search images from some storage by the text on them.

I have tens of thousands of images to search, and the OCR (optical character recognition) process takes ~0.5 seconds per image for relatively small images, so live decoding is a no-no. Instead, we will allow the user to index a path from the store, and later search the database populated in that process.

The indexing will happen in the background, without the user having to wait for it to complete before getting a response. We'll run the whole process in constant memory (except this is kind of up to Tesseract, I haven't investigated its memory usage on large images yet) - so both downloading the list of images to index, and their actual bytes, is done in a streaming fashion, thanks to fs2.

Infrastructure

At the time of writing:

OCR is performed by Tesseract
Image source is Dropbox, we'll be using the User API
Indexing and full text search are possible thanks to Open Distro for Elasticsearch.

Right now, this only runs on a local machine. Tesseract is provided via the nix shell, Open Distro runs in docker. The application can be started using bloop.

Tech stack

The backend is built in Scala (obviously - that was the point of the talk), using the following libraries:

Cats Effect 3, for several things, such as monadic composition of asynchronous tasks (e.g. Elasticsearch client) and interop with other libraries from the ecosystem
fs2 - for streaming data, so that we can run the indexing process in constant memory, as long as the OCR implementation can do so
http4s - for the HTTP server, as well as a custom client for Dropbox
ciris - for compositionally loading configuration
circe - for decoding/encoding JSON
log4cats - for logging
Elasticsearch high-level Java client - for talking to Elasticsearch. Normally you could use something like elastic4s, but I only needed a subset of its functionality and wanted to show how this can be wrapped in cats.effect.IO
weaver - for testing
chimney - for transforming similar datatypes

The frontend is built with React + TypeScript. You'll need Node 14.x and npm, both are provided with the attached nix shell in the frontend directory.

Architecture

First of all, the data flow in processes that can be triggered by the user:

There are three main processes:

Indexing

Wait for the user to provide a directory to index (a path to the directory in Dropbox)
Download metadata of all images within that directory (recursively)
For each entry, get a stream of bytes of its content, push it to Tesseract
Pass the metadata and the OCR decoding result to the indexer (Elasticsearch)

API:

# paths must start with /
http :4000/index path="/images"

POST /index HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 27
Content-Type: application/json
Host: localhost:4000
User-Agent: HTTPie/2.4.0

{
    "path": "/images"
}


HTTP/1.1 202 Accepted
Content-Length: 0
Date: Sun, 02 May 2021 17:57:19 GMT

Search

Ask the user for a query
Pass the query to Elasticsearch with some level of fuzziness
Pass results back to the user (metadata of matching files)

API:

http :4000/search\?query=test

GET /search?query=test HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0



HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 02 May 2021 17:58:41 GMT
Transfer-Encoding: chunked

[
    {
        "content": "Some decoded text",
        "imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg",
        "thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg"
    },
    {
        "content": "Another file with test text",
        "imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg",
        "thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg"
    }
]

Download

Take a concrete file's path from the user (from file metadata)
Return stream of bytes for that file

API:

http :4000/view/%2Fimages%2Ffile2.jpg

GET /view/%2Fimages%2Ffile2.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0



HTTP/1.1 200 OK
Content-Type: image/jpeg
Date: Sun, 02 May 2021 18:01:30 GMT
Transfer-Encoding: chunked



+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+

Module graph

The blue boxes correspond to the processes outlined above (core logic), green boxes are high-level adapters for the underlying vendor-specific implementations - this is similar to the hexagonal architecture / Ports&Adapters patterns.

These correspond almost directly to the interfaces (Tagless Final algebras) in the project.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
docs		docs
frontend		frontend
imagesource/src		imagesource/src
indexer/src		indexer/src
nix		nix
ocr/src		ocr/src
project		project
shared/src/main/scala/com/kubukoz		shared/src/main/scala/com/kubukoz
src		src
.envrc		.envrc
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dropbox-demo

Run application

Backend

Frontend

Project goals

Infrastructure

Tech stack

Architecture

Indexing

Search

Download

Module graph

About

Releases

Packages

Languages

License

dev-acc-10000/dropbox-demo

Folders and files

Latest commit

History

Repository files navigation

dropbox-demo

Run application

Backend

Frontend

Project goals

Infrastructure

Tech stack

Architecture

Indexing

Search

Download

Module graph

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages