Skip to content

Commit

Permalink
Merge branch 'data-wrangling'
Browse files Browse the repository at this point in the history
  • Loading branch information
basaks committed Apr 23, 2024
2 parents 020c524 + edb5dff commit 97b98bd
Show file tree
Hide file tree
Showing 9 changed files with 259 additions and 51 deletions.
23 changes: 23 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM python:3.10-slim
MAINTAINER Sudipta Basak <[email protected]>

WORKDIR /usr/src/uncover-ml

RUN apt update && apt upgrade -y
RUN apt-get install -y --no-install-recommends \
make \
gcc \
libc6-dev \
libopenblas-dev \
libgdal-dev \
libhdf5-dev


RUN apt install git openmpi-bin libopenmpi-dev -y \
&& rm -rf /var/lib/apt/lists/* \
&& alias pip=pip3

RUN pip install -U pip

# RUN ./cubist/makecubist .
# RUN pip install -e .[dev]
24 changes: 24 additions & 0 deletions Dockerfile3p11
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.11-slim
MAINTAINER Sudipta Basak <[email protected]>

WORKDIR /usr/src/uncover-ml

RUN apt update && apt upgrade -y
RUN apt-get install -y --no-install-recommends \
make \
gcc \
libc6-dev \
libopenblas-dev \
libgdal-dev \
libhdf5-dev


RUN apt install git openmpi-bin libopenmpi-dev -y \
&& g++ \
&& rm -rf /var/lib/apt/lists/* \
&& alias pip=pip3

RUN pip install -U pip

# RUN ./cubist/makecubist .
# RUN pip install -e .[dev]
164 changes: 164 additions & 0 deletions pbs/READMEPython3p10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Uncover-ml on the NCI

This README is a quick guide to getting the uncover-ml library up and running
in a PBS batch environment that has MPI support. This setup is common in
HPC systems such as the NCI (raijin).

The instructions below should apply to both single- and multi-node runs
on the NCI. Just set ncpus in the PBS directives in the job submission
script accordingly (e.g. ncpus=32 for 2 nodes).

The instructions assume you are using bash shell.

## Pre-installation

These instructions currently only work with gcc and not the Intel compiler.
Note that on NCI it appears python is compiled against gcc anyway.

1. Load the modules requried for installation and running:
```bash
$ module load python3/3.10.4 gdal/3.5.0 openmpi/4.0.2
```
(Alternatively, you may wish to add the above lines to your ~/.profile)

2. Now add the following lines to the end of your ~/.profile:
```bash
export PATH=$HOME/.local/bin:$PATH
export PYTHONPATH=$HOME/.local/lib/python3.9/site-packages:$PYTHONPATH
export VIRTUALENVWRAPPER_PYTHON=/apps/python3/3.9.2/bin/python3
export LC_ALL=en_AU.UTF-8
export LANG=en_AU.UTF-8
source $HOME/.local/bin/virtualenvwrapper.sh
```

3. Install virtualenv and virtualenvwrapper by running the following command
on the terminal:
```bash
$ pip3 install --user virtualenv virtualenvwrapper
```

5. Refresh your environment by reloading your profile:
```bash
$ source ~/.bashrc
```

## Installation

1. Create a new virtualenv for uncoverml:
```bash
$ mkvirtualenv --system-site-packages uncoverml
```

2. Make sure the virtualenv is activated:
```bash
$ workon uncoverml
```

3. Clone the uncoverml repo into your home directory:
```bash
$ cd ~
$ git clone [email protected]:GeoscienceAustralia/uncover-ml.git
```

4. Install mpi4py
```bash
$ pip install --no-cache-dir mpi4py==3.1.3 --no-binary=mpi4py
```

5. Install uncoverml:
```bash
$ cd uncover-ml
$ python setup.py install
```

5. Once installation has completed, you can run the tests to verify everything
has gone correctly:
```bash
$ pip install pytest
$ py.test ~/uncover-ml/tests/
```

## Updating the Code
To update the code, first make sure you are in the `uncoverml` virtual environment:
```bash
$ workon uncoverml
```
Next, pull the latest commit from the master branch, and install:
```bash
$ cd ~/uncover-ml
$ git pull origin
$ python setup.py install
```
If the pull and the installation complete successfully, the code is ready to run!


## Running Batch Jobs

in the `pbs` subfolder of uncover-ml there are some example scripts and a
helper function to assist launching batch jobs over multiple nodes with pbs

### Batch testing

To check everything is working, submit the tests as a batch job:
```bash
$ cd ~/uncover-ml/pbs
$ qsub submit_tests.sh
```

### MPIRun

`uncoverml` uses MPI internally for parallelization. To run a script or demo
simply do

```bash
$ mpirun -n <num_procs> <command>
```

whilst a PBS job submission might look like this:

```bash
#!/bin/bash
#PBS -P ge3
#PBS -q normal
#PBS -l walltime=01:00:00,mem=128GB,ncpus=32,jobfs=20GB
#PBS -l wd

# setup environment
module unload intel-cc
module unload intel-fc
module load python3/3.4.3 python3/3.4.3-matplotlib
module load load hdf5/1.8.10 gdal/2.0.0
source $HOME/.profile

# start the virtualenv
workon uncoverml

# run command
mpirun --mca mpi_warn_on_fork 0 uncoverml learn national_gamma_no_zeros.yaml -p 10
mpirun --mca mpi_warn_on_fork 0 uncoverml predict national_gamma_no_zeros.model -p 40
```

where in this case mpirun is able to determine the number of available
cores via PBS. This job submits the `learn`ing and `predict`ion jobs one
after the other. The `-p 10` or `-p 40` options partitions the input
covariates into 10/40 partitions during learning/prediction jobs.

### PBS job configuration
[Refer to the NCI user guide](https://opus.nci.org.au/display/Help/Raijin+User+Guide)
for different cpu and memory configuration options.

Specifically, [this section](https://opus.nci.org.au/display/Help/Raijin+User+Guide#RaijinUserGuide-QueueLimits)
details the various combinations available. We recommend using the `normal`,
`express`, `normalbw`, or the `expressbw` option with the required memory.

### Running the demos
In the pbs folder there are two scripts called `submit_demo_predicion.sh`
and `submit_demo_learning.sh` that will submit a batch job to PBS that uses
mpirun and ipympi to run the demos. Feel free to modify the PBS directives
as needed, or copy these scripts to a more convenient location.






27 changes: 0 additions & 27 deletions scripts/ceno_stuff.py → scripts/blend_rasters_smooth_weighted.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,30 +130,3 @@

with rio.open(f'MRVBF_S_8_5_blend_dist_meso_weighted_average_quadratic_bigtiff.tif', 'w', ** profile) as dst:
dst.write(output)


# [Yesterday 17:31] John Wilford
#
# Im preparing another blend like the example below near_rough = "DEM_fill_smooth_mag_2_cog_nan.tif"
#
# far_smooth = "/g/data/jl14/new_ceno_inputs/zero_blend.tif"
#
# scale = "/g/data/jl14/80m_covarites/proximity/P_Dist_Meso.tif"
#
# [Yesterday 17:32] John Wilford
#
# near ==/g/data/jl14/80m_covarites/terrain/T_MRVBF_S.tif
#
# [Yesterday 17:52] John Wilford
#
# far_smooth == /g/data/jl14/new_ceno_inputs/five_blend_float.tif
#
# [Yesterday 17:52] John Wilford
#
# scale = "/g/data/jl14/80m_covarites/proximity/P_Dist_Meso.tif"
#
# [Yesterday 17:53] John Wilford
#
# also please try it again with far_smooth == /g/data/jl14/new_ceno_inputs/six_blend_float.tif
#
# like 1
52 changes: 36 additions & 16 deletions scripts/dedupe_shape.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,31 @@
from joblib import Parallel, delayed
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype
import rasterio
import geopandas as gpd


def dedupe_raster(shp: Path, tif: Path, deduped_shp: Path):
geom_cols = ['POINT_X', 'POINT_Y']
rows_and_cols = ["rows", "cols"]
fixed_cols = geom_cols + rows_and_cols
fixed_cols_set = set(fixed_cols)


def dedupe_shape(shp: Path, tif: Path, deduped_shp: Path):
"""
:param shp: input shapefile with dense points
:param tif: sample tif to read resolution details
:param deduped_shp: output shapefile with one point per down-sampled raster resolution
:return:
"""
print("====================================\n", f"deduping {shp.as_posix()}")
geom_cols = ['POINT_X', 'POINT_Y']
pts = gpd.read_file(shp)
for g in geom_cols:
non_number_cols = [c for c, t in zip(pts.dtypes.index.to_list(), pts.dtypes.values) if ~is_numeric_dtype(t)] # contain geometry
number_cols = [c for c, t in zip(pts.dtypes.index.to_list(), pts.dtypes.values) if is_numeric_dtype(t)]
cols = non_number_cols + number_cols
pts = pts[cols]
for g in fixed_cols:
if g in pts.columns:
pts = pts.drop(g, axis=1)
coords = np.array([(p.x, p.y) for p in pts.geometry])
Expand All @@ -40,10 +50,10 @@ def dedupe_raster(shp: Path, tif: Path, deduped_shp: Path):
)
pts["rows"], pts["cols"] = rasterio.transform.rowcol(transform, coords[:, 0], coords[:, 1])

pts_count = pts.groupby(by=['rows', 'cols'], as_index=False).agg(pixel_count=('rows', 'count'))
pts_mean = pts.groupby(by=['rows', 'cols'], as_index=False).mean()
pts_deduped = pts_mean.merge(pts_count, how='inner', on=['rows', 'cols'])

pts_count = pts.groupby(by=rows_and_cols, as_index=False).agg(pixel_count=('rows', 'count'))
pts_mean = pts.groupby(by=['rows', 'cols'], as_index=False).apply(custom_func, numbers_cols=number_cols,
non_number_cols=non_number_cols)
pts_deduped = pts_mean.merge(pts_count, how='inner', on=rows_and_cols)
pts_deduped = gpd.GeoDataFrame(pts_deduped,
geometry=gpd.points_from_xy(pts_deduped['POINT_X'], pts_deduped['POINT_Y']),
crs="EPSG:3577" # Australian Albers
Expand All @@ -52,19 +62,29 @@ def dedupe_raster(shp: Path, tif: Path, deduped_shp: Path):
return pts_deduped


def custom_func(group, numbers_cols, non_number_cols):
output = {**group.iloc[0, :][non_number_cols + fixed_cols]}
for col in numbers_cols:
arr = group.loc[:, col]
output[col] = np.mean(arr[arr != -9999])
gdf = pd.DataFrame.from_dict({k: [v] for k, v in output.items()})
# gdf.index = group.index # this line sometimes did not work on large shapefiles from John
return gdf


if __name__ == '__main__':
shapefiles = Path("configs/data/")
downscale_factor = 6 # keep 1 point in a 6x6 cell
downscale_factor = 1 # keep 1 point in a 6x6 cell

dem = Path('/home/my_dem.tif')
dem = Path('configs/data/LATITUDE_GRID1.tif')
output_dir = Path('1in6')
output_dir.mkdir(exist_ok=True, parents=True)

# for s in shapefiles.glob("*.shp"):
# deduped_shp = output_dir.joinpath(s.name)
# dedupe_raster(shp=s, tif=dem, deduped_shp=deduped_shp)
for s in shapefiles.glob("geochem_sites.shp"):
deduped_shp = output_dir.joinpath(s.stem + "_mean_removed_9999.shp")
dedupe_shape(shp=s, tif=dem, deduped_shp=deduped_shp)

Parallel(
n_jobs=-1,
verbose=100,
)(delayed(dedupe_raster)(s, dem, output_dir.joinpath(s.name)) for s in shapefiles.glob("geochem_sites.shp"))
# Parallel(
# n_jobs=-1,
# verbose=100,
# )(delayed(dedupe_raster)(s, dem, output_dir.joinpath(s.name)) for s in shapefiles.glob("geochem_sites.shp"))
5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ def build_cubist():
print(out)
except:
out = subprocess.run(['./cubist/makecubist', '.'])

subprocess.check_output("git config --global --add safe.directory /usr/src/uncover-ml", shell=True)
git_hash = subprocess.check_output(['git', 'rev-parse',
'HEAD']).decode().strip()
with open('uncoverml/git_hash.py', 'w') as f:
Expand Down Expand Up @@ -85,7 +87,7 @@ def run(self):
'pycontracts == 1.7.9',
'tables >= 3.2.2',
'rasterio == 1.3.7',
'catboost == 1.0.3',
'catboost >= 1.2.1',
'affine >= 2.2.1',
'pyshp == 2.1.0',
'click >= 6.6',
Expand All @@ -111,6 +113,7 @@ def run(self):
"imageio==2.9.0",
"optuna==3.2.0",
"seaborn==0.13.0",
"lightgbm==3.3.5",
],
extras_require={
'kmz': [
Expand Down
6 changes: 4 additions & 2 deletions tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from uncoverml.models import (apply_masked,
apply_multiple_masked,
modelmaps)
from uncoverml.optimise.models import transformed_modelmaps
from uncoverml.optimise.models import transformed_modelmaps, no_test_support_classifiers

models = {**transformed_modelmaps, **modelmaps}

Expand Down Expand Up @@ -101,7 +101,9 @@ def test_trasnsformed_model_attr(get_transformed_model):
'multicubist',
'decisiontree',
'extratree',
'catboost'
'catboost',
'svrmulti',
* list(no_test_support_classifiers.keys())
]])
def models_supported(request):
return request.param
Expand Down
Loading

0 comments on commit 97b98bd

Please sign in to comment.