An application accompanying my talk about structuring functional applications in Scala.
Prerequisites: sbt, Elasticsearch running on localhost:9200
, tesseract
binary available on the PATH
and runnable.
Elasticsearch is available in the attached docker-compose
setup. tesseract
is available if you enter the attached nix-shell
. The shell will also load environment variables from the env.sh
file, if you have one (it's ignored in git).
Once you have these: sbt run
- at time of writing the application starts on 0.0.0.0:4000
, this can be configured with the HTTP_HOST
/HTTP_PORT
environment variables.
Prerequisites: Node 14.x, npm (provided in a nix-shell in the frontend
directory).
cd frontend
npm start
Search images from some storage by the text on them.
I have tens of thousands of images to search, and the OCR (optical character recognition) process takes ~0.5 seconds per image for relatively small images, so live decoding is a no-no. Instead, we will allow the user to index a path from the store, and later search the database populated in that process.
The indexing will happen in the background, without the user having to wait for it to complete before getting a response. We'll run the whole process in constant memory (except this is kind of up to Tesseract, I haven't investigated its memory usage on large images yet) - so both downloading the list of images to index, and their actual bytes, is done in a streaming fashion, thanks to fs2.
At the time of writing:
- OCR is performed by Tesseract
- Image source is Dropbox, we'll be using the User API
- Indexing and full text search are possible thanks to Open Distro for Elasticsearch.
Right now, this only runs on a local machine. Tesseract is provided via the nix shell, Open Distro runs in docker. The application can be started using bloop.
The backend is built in Scala (obviously - that was the point of the talk), using the following libraries:
- Cats Effect 3, for several things, such as monadic composition of asynchronous tasks (e.g. Elasticsearch client) and interop with other libraries from the ecosystem
- fs2 - for streaming data, so that we can run the indexing process in constant memory, as long as the OCR implementation can do so
- http4s - for the HTTP server, as well as a custom client for Dropbox
- ciris - for compositionally loading configuration
- circe - for decoding/encoding JSON
- log4cats - for logging
- Elasticsearch high-level Java client - for talking to Elasticsearch.
Normally you could use something like elastic4s, but I only needed a subset of its functionality and wanted to show how this can be wrapped in
cats.effect.IO
- weaver - for testing
- chimney - for transforming similar datatypes
The frontend is built with React + TypeScript. You'll need Node 14.x and npm, both are provided with the attached nix shell in the frontend
directory.
First of all, the data flow in processes that can be triggered by the user:
There are three main processes:
- Wait for the user to provide a directory to index (a path to the directory in Dropbox)
- Download metadata of all images within that directory (recursively)
- For each entry, get a stream of bytes of its content, push it to Tesseract
- Pass the metadata and the OCR decoding result to the indexer (Elasticsearch)
API:
# paths must start with /
http :4000/index path="/images"
POST /index HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 27
Content-Type: application/json
Host: localhost:4000
User-Agent: HTTPie/2.4.0
{
"path": "/images"
}
HTTP/1.1 202 Accepted
Content-Length: 0
Date: Sun, 02 May 2021 17:57:19 GMT
- Ask the user for a query
- Pass the query to Elasticsearch with some level of fuzziness
- Pass results back to the user (metadata of matching files)
API:
http :4000/search\?query=test
GET /search?query=test HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0
HTTP/1.1 200 OK
Content-Type: application/json
Date: Sun, 02 May 2021 17:58:41 GMT
Transfer-Encoding: chunked
[
{
"content": "Some decoded text",
"imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg",
"thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile1.jpg"
},
{
"content": "Another file with test text",
"imageUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg",
"thumbnailUrl": "http://127.0.0.1:4000/view/%2Fimages%2Ffile2.jpg"
}
]
- Take a concrete file's path from the user (from file metadata)
- Return stream of bytes for that file
API:
http :4000/view/%2Fimages%2Ffile2.jpg
GET /view/%2Fimages%2Ffile2.jpg HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: localhost:4000
User-Agent: HTTPie/2.4.0
HTTP/1.1 200 OK
Content-Type: image/jpeg
Date: Sun, 02 May 2021 18:01:30 GMT
Transfer-Encoding: chunked
+-----------------------------------------+
| NOTE: binary data not shown in terminal |
+-----------------------------------------+
The blue boxes correspond to the processes outlined above (core logic), green boxes are high-level adapters for the underlying vendor-specific implementations - this is similar to the hexagonal architecture / Ports&Adapters patterns.
These correspond almost directly to the interfaces (Tagless Final algebras) in the project.