Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for review #56

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from
52 changes: 52 additions & 0 deletions blog/2021-11-30-Building-a-Centralized-Annotated-Database.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: Building a Centralized Annotated Database
author: Punya Modi
description: What is the need to create a centralized annotated database? What advantages can it offer?
tags: Machine Learning =
Copy link
Member

@nitinprakash96 nitinprakash96 Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the = at the end?

---

# **Value of Creating a Standardized Annotated Database**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# **Value of Creating a Standardized Annotated Database**
# **Value of creating a standardized annotated database**


As I was learning about ML and getting introduced to the world of Machine Learning one of the more memorable statements I heard was that “Machine Learning is, after all, data driven AI, and your model will be only as good or as bad as the data you have”. ML depends heavily on data, without data, it is impossible for an “AI” to learn. It is the most crucial aspect that makes algorithm training possible… No matter how great the AI team is or the size of the data set, if your data set is not good enough, the entire project will fail. While gathering a sizable amount of data is important, it is only the first step of the entire process, every project requires a sizable amount of time and effort to restructure the data and classify and label it. This is especially true for datasets used to present a realistic vision of our world. With the need for data so clear, the need arises for a mechanism to store all of this data in a fashion that doesn't compromise on the work that was already done to tailor the data to the needs of the project.


# FoodDX

FoodDX is a product from Holmusk that is meant to transform the way consumers think and feel about food, it gives real-time feedback on your meals, FoodDX allows users to upload photos of their meal for real-time nutrition scores and food tips. This is enabled via AI food scoring that leverages the companies real world dataset of over 10,000 + images.
Copy link
Member

@nitinprakash96 nitinprakash96 Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FoodDX is a product from Holmusk that is meant to transform the way consumers think and feel about food, it gives real-time feedback on your meals, FoodDX allows users to upload photos of their meal for real-time nutrition scores and food tips. This is enabled via AI food scoring that leverages the companies real world dataset of over 10,000 + images.
FoodDX is a product from Holmusk that is meant to transform the way consumers think and feel about food. It gives real-time feedback on your meals. FoodDX allows users to upload photos of their meal for real-time nutrition scores and food tips. This is enabled via AI food scoring that leverages real world dataset of over 10,000+ images.

A couple of things here:

  • you could add a few more things about fooddx on what it does. Examples at the our solutions section of https://fooddx.com. Preferably as bullet points. Don't add all of it, just a couple of them
  • 10,000 is not a big dataset (which means not that valuable either). It needs to be 100,000+. (We crossed 100,000 in our first 2 weeks of launch itself)


This was made possible by one of Holmusks previous products Glycoleap which was a product that aimed at helping diabetic patients track their diet and manage their health by letting them get an expert nutritionists advice simply by clicking a photo, and the uploading it to the app, which the nutritionists then graded. The data from this gave us a great boost for FoodDX as it enables us to have access to trusted, high quality data stored in a database. We used this initial data to have in order to train the model, but we also added in completely different data separate from the initial Glycoleap data for the cross verification data, however this was mostly manages via Dropbox where the coaches were given images via Dropbox and similarly shared their scores on the same platform. When FoodDX was up and running, in order to better organize the system, the model rating and all relevant information was stored in a database and similarly, an annotation platform was created for the coaches via retool to accurately score the image, the food type, and also help with maintaining the data, as the coach data was saved in a separate datatable.

In the current existing situation, we have a plethora of sources as summarized in the flow diagram below -

![](/images/blogposts/ML_datapipeline_sources.png)


Considering the different sources that we have, we decided to classify the sources into two types -

1. **Static Sources** - These are sources that are no longer being continuously updated. Since they would no longer be any change to these sources, we decided the data from these sources needed a one time transfer to the new database

2. **Active Sources** - These were sources are continuously updated, since these sources will continuously produce new data, we decided the there needs to be an established mechanism to allow for the transfer of data from these sources periodically


# How to Design a Database that meets the requirements

A good database design process is governed by specific rules. The first rule in creating a database design is to avoid data redundancy. It wastes space and increases the probability of faults and discrepancies within the database. The second rule is that the accuracy and comprehensiveness of information are imperative. A database containing erroneous information will lead to inaccurate analysis. Consequently, it can mislead decision-makers and adversely affect performance.

While creating the Database it is necessary to keep in mind the needs of the team using the database, rather than trying to include every bit of data, it is more important to include only the relevant data necessary.







# How to Set up the Database from existing data from previous static sources and new continuing sources
Considering the fact that we have two very different type of data sources, it is important we have different approaches to syncing data form the different sources -

1. **Static sources** - As mentioned above a static source does not have any new data flowing in and this allows us more freedom in porting the data from these sources. However the most important factor with respect to static sources are often in a different format wrt to the required format of the new database. Hence with static sources, it is often considered a priority to have a methodology to port this data while also reformatting it to match the format of the new database. For FoodDX the main static source of data was considered to be Glycoleap, In order to port the Glycoleap data we decided to create a lambda that reads the Glycoleap data, reformats it and then adds it to the new database.

2. **Active Sources** - As mentioned above an active source continuously has data coming in, due to this fact the main point of dealing with active sources is to ensure that we have a periodic, seamless transfer of data. To add onto this with most active sources it is necessary to differentiate new data that come in between two subsequent syncing cycles as it is unfeasible to transfer all the data from an active source every period as it wastes a lot of resources. For FoodDX the active source was the FoodDX app that continuously gave us new images, and these new images and the relevant information will be stored in a table, while the coach information, created using the retool dashboard will be set up in another table. To combat this we set up a Postgres trigger on the table with the coach rating that will run a join function and hence granting us a new table consisting all the food entries that have a coach rating. With this we plan to using Amazon Database Migration Service to allow for the transfer of this table to the required location of the new database


# Conclusion
The value of creating a centralized annotated database for any project especially for a Machine learning project cannot be understated. It can help streamline work while also allowing the project team to provision for the future. While creating a centralized pipeline is important, we also need to transfer all the existing data into the new pipeline.
Binary file added images/blogposts/ML_datapipeline_sources.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.