Skip to content

Repository for the Big Data Team work on the LCF Project

License

Notifications You must be signed in to change notification settings

ONSBigData/LCF-project

Repository files navigation

LCF-Project



Living Costs and Food Survey (LCF) Project repository



Executive Summary

For a short summary of the Living Costs and Food Survey (LCF) Project please see the following document .

Overview



The ONS Big Data team was approached by ONS Social Survey Division (SSD) about the possibility of using commercial data and/or data science methods to help improve the processing of the Living Costs and Food Survey (LCF).

diaryprocess

Figure 1. LCF Diary process

In order to facilitate the LCF diary process, two prototypes were developed by the Big Data Team in consultation with the Social Survey Division, Surveys and Life Events Processing and the end user DEFRA.

The proposed solutions harness information from clean historic LCF diary data to help complete missing product quantity information (i.e. amount, volume or weight purchased) at the point of data entry.



strand A: Using historical data to create a lookup



https://github.com/ONSBigData/LCF-project/tree/master/LCF-analysis

Entering LCF data from diaries into the database takes a significant amount of time. Currently it is done in a system called Blaise and the most resource intensive part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.

Although the customer (DEFRA) only requires amounts to be completed for half of the survey respondents, the additional time taken to find the correct amounts (usually via internet searches outside of Blaise) is a large contributing factor in diary processing delays.

A solution which could integrate easily into the current system and coders’ work flow was piloted using flat look-up functions already available in Blaise. The goal was to give the coder an option to choose an amount from a list of matching or very similar items previously entered within the Blaise environment (eliminating the need for an internet search on different machine or browser).

FlatFileApp

Figure 2. LCF flat file solution process

The picture above shows a summary of the data processing pipeline for the flat-file prototype. The prepared lists get exported into a CSV file and handed over to the Blaise team, who convert them into a (proprietary) format suitable for loading from within the questionnaire.

Each look-up file still contains a lot of items and therefore the items’ ordering is important. When the look-up file opens in Blaise, the position of the cursor needs to be such that the next few products are the most similar to what the coder is looking for.

This has been achieved by a modified K-Nearest Neighbour classification algorithm.



strand B: Using a SOLR-based indexing solution

https://github.com/ONSBigData/LCF-project/tree/master/LCF-shiny



As it was mentioned above,entering LCF data from diaries into the Blaise takes a significant amount of time, and the most time consuming part is the amount (weight) information retrieval as it is often missing in the diary / on the receipt.

BlaiseApp

Figure 3. Screenshot of Blaise system

Another solution proposed by the Big Data team was a system that is using a SOLR -based server in order to help with automatic COICOP classification and to provide the most probable weight for items based on the item cost.

SOLR is an open source, Lucene -based search engine library providing scalable enterprise indexing and search technology. Initially records created from historical LCF data are indexed so that they could be retrieved quickly based on requested criteria. By default, SOLR uses a modified TF-IDF method to calculate a similarity score between the query and all available historical LCF data

A Shiny app was created to mimic the BLAISE system in appearance and functionality in order to show how this could work from within BLAISE

SOLRShinyApp

Figure 4.Shiny App simulating BLAISE interface using SOLR backend to predict COICOP and propose weights



Setup instructions for installing and configuring SOLR on Ubuntu



SOLR schema currently used for this project:

      <fields>

      <field name="line" type="string" indexed="true" stored="true" required="true"/>
      <field name="coicop" type="integer" indexed="true" stored="true"/>
      <field name="EXPDESC" type="text" indexed="true" stored="true"/>
      <field name="Paid1" type="float" indexed="true" stored="true"/>
      <field name="Shop" type="text" indexed="true" stored="true"/>
      <field name="MAFFQuan" type="float" indexed="true" stored="true"/>
      <field name="MAFFUnit" type="text" indexed="true" stored="true"/>

      </fields>

addendum I: LCF Scanning Receipt Optical Character Recognition Shiny app prototype



https://github.com/ONSBigData/LCF-project/tree/master/LCFshinyReceiptOCR

A shiny app that can OCR a default receipt picture using the Tesseract OCR library or any other picture uploaded was created as a starting point for looking into getting information from receipts into a textual format so it can be processed, matched,parsed etc.

OCRShinyApp

Figure 5. LCF Receipt Scanning minimal Shiny Application



In order to install the Tesseract library the following setup instructions for installing all Ubuntu requirements for this app are provided together with a link to hints/tips/suggestions on how to improve the quality of performing OCR with Tesseract .



addendum II: prototype COICOP Classification using Scikit-Learn jupyter notebook



https://github.com/ONSBigData/LCF-project/tree/master/LCF-COICOPclassification

A jupyter notebook containing 3 types of scikit learn classifiers (machine learning algorithms) trained to automatically assign a COICOP code based on a product description.

  • Naive Bayes
  • Support Vector Machines
  • Random Forests

Additionaly a jupyter notebook containing a Python implementation of the BM25 algorithm used in products such as Apache Lucene and SOLR



Contributors

Iva Spakulova

Theodore Manassis

Alessandra Sozzi

working for the Office for National Statistics Big Data project

LICENSE

Released under the MIT License.

About

Repository for the Big Data Team work on the LCF Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •