Rates Task
Learned Python and PostgreSQL in 2 days.
I followed this tutorial to get started:
https://scotch.io/tutorials/build-a-restful-api-with-flask-the-tdd-way
As for Sunday, 4th of March I have done the following:
- Completed GET API
- Started POST API
- Wrote 12 test cases that run successfully
Today, Tuesday, 6th of March completed the Development Task
Spent 1/2 a day
- Refactored the code in accordance with OOP (only exception handling part caused some trouble, couldn't correctly hide the logic inside a class)
- Used batch insert for adding the records into the table (COPY could be used with storing the rows in memory)
- 16 unit tests ran successfully
Instructions to setup environment:
- Install Python3 (I installed Anaconda for MasOS)
- Install Flask
- Run this: export FLASK_APP="run.py" (I used Virtualenv from the tutorial)
- Install PostgreSQL
- Install Psycopg2
In order to run the tests run the following command:
python test_rates.py
To run the application:
flask run
To test GET request run:
In my opinion, a table that is used to insert data into should be kept as small as possible, since we require high performance for our INSERT statements.
I would keep only data for the last couple of weeks or even days as business requirements state.
The historical data should be kept in another database with main focus on reads, i.e. SELECT statements. The purpose of this second database is to provide analytical information for our end users.
By having to databases with different purposes would help us to separate logic and allow us to tailor the databases for our needs. So, the first database is OLTP database it can be SQL relational database or NoSQL database.
I would use a NoSQL as an OLTP database with insert, update, delete queries, since it is better when you have to regularly insert and delete a lot of records at once.
Also NoSQL database would be better to horizontally scale by sharding. The database can be partitioned in case of using SQL relational database as well, but it causes more burden and is not so straightforward. Relational databases usually scale vertically, which is not cost effective.
I would use queues like RabitMQ in order to handle all the incoming data batches, so that each batch id is stored in a queue, and batches are processed by fetching their ids from the queue. The queue can provide us kind of fault tolerancy in case of sudden peak load, and nothing would be lost or dropped when all the batch proccessing instances busy.
To handle the application logic I would use a fleet of instances that can horizontally scale. I would align the scaling logic with the queue length metric, whenever the number of batch ids in the queue are higher than a treshold I would launch a new instance to process the batches.
If the data from the OLTP database is read, then I would create read replicas in case of relational databases. Also I would use in memory caching like Redis for frequent SELECT queries.
As load grows the batch processing instances, database itself, network throughput I expect to become bottlenecks. I would think about CPU, Memory, Disk I/O and Throughput and Network performance.
As I mentioned above this issues can be addressed by using horizontally scalable design for batch processing instances, by splitting the database for historical and operational needs.
Also caching, queues and maintaining small size of the table would help to address these bottlenecks.
- The batch updates have started to become very large, but the requirements for their processing time are strict.
Being a cloud architect I would definitely build my environment in AWS. My architecture could handle very large batch updates. I could change the type of instance, i.e. scale vertically to handle a larger batch.
- Code updates need to be pushed out frequently. This needs to be done without the risk of stopping a data update already being processed, nor a data response being lost.
The bottleneck is the batch processing logic, as the instances may be unable to handle new batches properly. They might drop them and do not process. As I mentioned above queues can be used for this purpose. So, the batch can wait or be processed in parallel.
- For development and staging purposes, you need to start up a number of scaled-down versions of the system.
My architecture when I would use horizontally scalable batch processing instances could be scaled down very easily. There is no problem to scale them down, I could change the instance type to a cheaper one and even one instance would be enough for DEV and QA environments.