Skip to content

Latest commit

 

History

History
77 lines (71 loc) · 8.48 KB

README.md

File metadata and controls

77 lines (71 loc) · 8.48 KB

ICM Philippines Mapping Project

Setup:

  • Environment Setup:
    • Clone the master branch of the repo
    • Create the conda environment from bash with conda env create -f environment.yml
    • Activate the conda environment from bash with conda activate philmappingenv
    • You're good to go
    • If you update / add packages, once the environment is activated, add them to the environment.yml, save the file, and from bash run conda env update --file environment.yml
  • Contributions:
    • To contribute to this repo, please create a development branch (i.e. myname_dev) and open a PR against master for any contributions:
      • Best practices for PR creation and management:
        • Ensure you're working on the branch you created with (from bash) git checkout myname_dev
        • Your working branch at this stage should be in sync with master exactly
        • Make you code changes and, once ready to open the PR, execute the below (from bash):
          • git add .
          • git commit -m "<your commit message here>"
          • git push
        • At this point, open GitHub in browser, and go to "branches" tab of the repo
        • Click "New Pull Request"
        • Fill out details, add a reviewer unless your changes are known to the repo managers, and click through to open the PR
        • After it has been merged, execute the below (from bash) to ensure your branch is synced with master:
          • git pull
          • git merge origin/master
          • git push
        • Rinse and repeat; you're good to go

Workstream 1 -- Data Cleaning

Summary:

The goal of this workstream is to use a single source of truth (SSOT) file of province/city/barangay pairings to clean up another file -- also containing these three levels of geographic granularity -- matching as many occurences in the second (unclean) file as possible to SSOT file.

Work Drivers:

  • Repo Setup and Environment Management:
    • Create environment management files for repo:
    • Create branching setup for the repo:
      • Create a paul_dev branch to ensure we're working with best SDLC practices
  • Build Geo Label Matching Logic:
    • Import and manage file that screens out non-ICM regions -- raw_data/non_icm_loc.csv:
      • Set Batangas and Bulacan to NOT be removed (done in Excel)
      • Add First, Second, Third, and Fourth (districts of Manila) to file and set the them to be removed (done in Excel)
      • Import file and drop all NaNs (so that now all provinces listed can be used as a negative screen)
    • Create SSOT file:
      • Import raw_data/new_locations.csv file (raw data taken from this source here) and remove all rows that are in a province that is contained in the raw_data/non_icm_loc.csv file accompanied by the value: True (meaning it should be removed as it is not a part of the regions ICM serves)
      • Save out the newly created SSOT file -- processed_data/ssot_df.csv -- for reference and future use
    • Clean up the unclean file -- raw_data/original_locations.csv -- and add new correct geo-mapping fields:
      • As done with the SSOT file, remove all rows from the unclean file that are in a province contained in the raw_data/non_icm_loc.csv file accompanied by the value: True (meaning it should be removed as it is not a part of the regions ICM serves)
      • Create under_construction_df -- a new go-forward DF which will be a copy of the unclean file with the new cleaned columns added
      • Create a new column -- province_cleaned -- to be appended to the under_construction_df, with the correct name for the province associated with each row:
        • Create province_mapping_df, which will eventually serve as a mapping dictionnary of unclean to clean names, but will start by simply storing all unique province names in the unclean file
        • Iterate over each unique province name in the unclean file, and check if the value in the province column matches a value contained in the province column of the SSOT file (accounting for capitalization differences)
        • After having done the automated matches possible, perform the manual matching necessary based on additional research:
          • Manually match "City of Isabela (Capital)", "Cotabato", and "Davao Occidental"
        • Use the province_mapping_df (and any other custom logic needed) to create the new province_cleaned column
      • Write out processed_data/under_construction_df.csv to log the work done so far
      • City and Barangay are trickier and so we need to solve them all at once with more detailed analysis:
        • Prior to jumping in, we'll prep by cleaning up some object formatting issues.
        • First we go for all low-hanging fruit -- cities that we can match to the ssot_df because we can find an exact pairing between sets of province, city, and barangay between the just_geo_names_df df and the ssot_df. We'll perform this matching via a left join of the ssot_df onto the just_geo_names_df. We'll then flag all the rows that were matched successfully with this simple method
        • Create a df with just the geo names we couldn't match to the ssot_df across all 3 geos so we can count the records still left to match. We'll do this multiple times from here on out until we arrive at 0 records we can't match
        • (Round 1 of ad hoc research) Now let's make any fixes we noticed through ad hoc exploration and see how that affects our match rate
        • (Round 1 of ad hoc research) It appears we've spotted one trend that can be corrected algorithmically -- we should look for instances of the city names that use the formulation "CITY OF xxxxxxx" and replace them with the formulation "xxxxxxx CITY"
        • (Round 1 of ad hoc research) Looks like the formulation change from "CITY OF xxxxxxx" to "xxxxxxx CITY" fixed 952 -- (8451-7499) -- records!
        • (Round 2 of ad hoc research) It appears we've spotted another trend that can be corrected algorithmically -- we should look for instances where the baranguy name doesn't match the SSOT because of the addition of the word or abbreviation for "population" -- poblacion.
        • (Round 2 of ad hoc research) Looks like the change to strip all barangays of the (POB.) string fixed 1,375 -- (7499-6124) -- records.
        • (Round 3 of ad hoc research) It appears we've spotted another trend that can be corrected algorithmically -- we should look for instances where the baranguy name doesn't match the SSOT because the barangay name is "empty" and delete the row.
        • (Round 3 of ad hoc research) Looks like the change to delete all rows where the "barangay" value was "EMPTY" fixed 203 -- (6124-5921) -- records.
        • (Round 4 of ad hoc research) It appears that my previous logic for cleaning up province names failed slightly, as it didn't account for instances of duplication (i.e. it tagged the province name of "LEYTE\n LEYTE" as correct because it does CONTAIN "LEYTE". It should be a quick fix to manually remove these instances.
        • (Round 4 of ad hoc research) Looks like the change to remedy the duplicated "LEYTE" cleaned province names fixed 1,499 -- (5921-4422) -- records.
        • (Round 5 of ad hoc research) ... whatever I discover next in digging into the problematic_geo_names df to identify trends in mismatches between the under_construction_df and the ssot_df will go here.
      • Write out processed_data/under_construction_df.csv to log the work done so far

Additonal Notes / Exogenous Comments:

  • The file Region-Province-Names.pdf (downloaded from this link here) is the complete official list of the current names for geographies as of 12/31/2019 per the Official PSA (philippines statistical authority)
    • We need to follow these names as the official region/province names. Neither original_locations.csv or new_locations.csv may follow this naming schema, but it is the official philippines naming convention (SSOT)
  • LUZON, VISAYAS, and MINDINAO are not official regions but actually just the 3 subsections of the Philippines (Top, Middle, Bottom in that order)
  • We should focus on the Province-City-Barangay match; but Regions can help us subsection the data