Living Costs and Food Survey (LCF) Project repository
For a short summary of the Living Costs and Food Survey (LCF) Project please see the following document .
The ONS Big Data team was approached by ONS Social Survey Division (SSD) about the possibility of using commercial data and/or data science methods to help improve the processing of the Living Costs and Food Survey (LCF).
In order to facilitate the LCF diary process, two prototypes were developed by the Big Data Team in consultation with the Social Survey Division, Surveys and Life Events Processing and the end user DEFRA.
The proposed solutions harness information from clean historic LCF diary data to help complete missing product quantity information (i.e. amount, volume or weight purchased) at the point of data entry.
https://github.com/ONSBigData/LCF-project/tree/master/LCF-analysis
Entering LCF data from diaries into the database takes a significant amount of time. Currently it is done in a system called Blaise and the most resource intensive part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.
Although the customer (DEFRA) only requires amounts to be completed for half of the survey respondents, the additional time taken to find the correct amounts (usually via internet searches outside of Blaise) is a large contributing factor in diary processing delays.
A solution which could integrate easily into the current system and coders’ work flow was piloted using flat look-up functions already available in Blaise. The goal was to give the coder an option to choose an amount from a list of matching or very similar items previously entered within the Blaise environment (eliminating the need for an internet search on different machine or browser).
The picture above shows a summary of the data processing pipeline for the flat-file prototype. The prepared lists get exported into a CSV file and handed over to the Blaise team, who convert them into a (proprietary) format suitable for loading from within the questionnaire.
Each look-up file still contains a lot of items and therefore the items’ ordering is important. When the look-up file opens in Blaise, the position of the cursor needs to be such that the next few products are the most similar to what the coder is looking for.
This has been achieved by a modified K-Nearest Neighbour classification algorithm.
https://github.com/ONSBigData/LCF-project/tree/master/LCF-shiny
As it was mentioned above,entering LCF data from diaries into the Blaise takes a significant amount of time, and the most time consuming part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.
Another solution proposed by the Big Data team was a system that is using a SOLR -based server in order to help with automatic COICOP classification and to provide the most probable weight for items based on the item cost.
SOLR is an open source, Lucene -based search engine library providing scalable enterprise indexing and search technology. Initially records created from historical LCF data are indexed so that they could be retrieved quickly based on requested criteria. By default, SOLR uses a modified TF-IDF method to calculate a similarity score between the query and all available historical LCF data
A Shiny app was created to mimic the BLAISE system in appearance and functionality in order to show how this could work from within BLAISE
Figure 4.Shiny App simulating BLAISE interface using SOLR backend to predict COICOP and propose weights
Setup instructions for installing and configuring SOLR on Ubuntu
SOLR schema currently used for this project:
<fields>
<field name="line" type="string" indexed="true" stored="true" required="true"/>
<field name="coicop" type="integer" indexed="true" stored="true"/>
<field name="EXPDESC" type="text" indexed="true" stored="true"/>
<field name="Paid1" type="float" indexed="true" stored="true"/>
<field name="Shop" type="text" indexed="true" stored="true"/>
<field name="MAFFQuan" type="float" indexed="true" stored="true"/>
<field name="MAFFUnit" type="text" indexed="true" stored="true"/>
</fields>
https://github.com/ONSBigData/LCF-project/tree/master/LCFshinyReceiptOCR
A shiny app that can OCR a default receipt picture using the Tesseract OCR library or any other picture uploaded was created as a starting point for looking into getting information from receipts into a textual format so it can be processed, matched,parsed etc.
In order to install the Tesseract library the following setup instructions for installing all Ubuntu requirements for this app are provided together with a link to hints/tips/suggestions on how to improve the quality of performing OCR with Tesseract .
https://github.com/ONSBigData/LCF-project/tree/master/LCF-COICOPclassification
A jupyter notebook containing 3 types of scikit learn classifiers (machine learning algorithms) trained to automatically assign a COICOP code based on a product description.
- Naive Bayes
- Support Vector Machines
- Random Forests
Additionaly a jupyter notebook containing a Python implementation of the BM25 algorithm used in products such as Apache Lucene and SOLR
working for the Office for National Statistics Big Data project
Released under the MIT License.