The components of the big data platform are various software building blocks, each implemented in a Docker container. The goal of the BDP is to provide a modular ecosystem that can be easily extended and adapted to a certain use case. Therefore, messaging is used to publish and subscribe to relevant topics, instead of direct communication of the modules.
The big data platform on Docker is used in two use cases:
- Social Media Emotion Detection
The social media use case collects tweets from Twitter and analyses the text. This first iteration proves the architectural foundations and shows the containerization and communication of the independent modules. - VPS Popcorn Production
The VPS use case optimizes the production of popcorn through the simultaneous execution and evaluation of several machine learning algorithms. This second iteration implements the cognition module and several APIs for information storage and retrieval.
Please install Docker and docker-compose to run the containers.
You find instructions for the Docker installation on their website.
To test the Docker installation you can open a terminal and execute docker run hello-world
.
On some OS the Docker installation does not include docker-compose. You can find information for the installation here in case you get a message that docker-compose is missing.
The subfolders contain the different building blocks of the Big Data Platform.
- Cognition
The Cognition evaluates results of machine learning algorithms on production process information and selects which algorithm provides the parameters for the next production cycle. - DB
- Postgres
The relational database stores structured data.
- Postgres
- HMI
- Kafka
All modules rely on the messaging ability for indirect communication with other modules. Thus the Kafka building block is the base class for most other modules and more specialized blocks like a connector to PostgresDB inherit the additional Kafka functionality. - Knowledge
The knowledge module is implemented as an API to store, modify and retrieve the knowledge at any time.
The different building blocks that compose a use case are implemented as Docker container. Several containers are managed together via docker-compose files. A docker-compose file contains the information to build the required containers as shown below:
The following folder structure shows all the possible different parts for a building block in this project:
Building Block
|- docs
|- src
|-- classes
|-- configurations
|-- schema
| something.py
| Dockerfile_something
| docker-compose.yml
| readme.md
Contains images, diagrams and other supporting material for the documentation.
The source folder stores the code files as well as the Dockerfiles. Dockerfiles consist of instructions to build a container. The Dockerfile is named after the module it containerizes.
The subfolder contains the different classes, e.g., the Kafka class, for other modules to inherit.
The configurations folder stores several files:
- the
config.yml
contains the configuration for all modules in the project. Furthermore, the config file includes general use case specific configuration, e.g., regarding the objective function or the initial design. The description of each service in the docker-compose file specifies which sections of the config file will be used in the container. - the
requirements.txt
contains all the packages that need to be installed in a container. The Dockerfile copies the file into the container and installs the packages during the build process.
The modules use indirect communication with a messaging approach.
All messages are send via Kafka and verified with the related Avro schema, which is stored in an .avsc
file.
Each module specifies its input- and output-topics and the associated schemas in the config.yml
.
Several services are combined into a docker-compose file, which allows to manage all services together. Each service entry consists of:
- the container and host name.
- build information such as the base image or the path to the Dockerfile.
- assign files or volumes from the host system to a specific path in the container file system.
- environment variables, e.g., the config path and the config sections relevant to the module.
- port forwarding from the host system to the container.
Furthermore it is also possible to define a common network for all containers and to specify how Docker volumes should be used.
The readme.md
explains the module/use case and gives usage instructions.