In collaboration with North Carolina Department of Transportation (NCDOT), UNC highway safety research center (HSRC), and DOT Volpe national transportation systems center, we at UNC Renaissance Computing Institute (RENCI) developed roadside feature models using convolutional neural networks (CNNs) in an iterative active learning (AL) pipeline integrated into an AI tool to detect safety features such as guardrails and utility poles in dispersed NC rural roads. We utilized transfer learning to extract a common feature backbone that was then used in an iterative AL process supported by a web-based annotation tool. The annotation tool not only allows us to collect annotations through an iterative AL process for multiple safety features, it also enables visual analysis and assessment of model prediction performance in the geospatial context. AL techniques were used to direct human annotators to label images that would most effectively improve the model aimed at minimizing the number of required training labels while maximizing the model’s performance. The iterative AL process with a common feature extraction backbone allowed fast model inference on millions of images in the AL sampling space enabling a rapid transition between AL rounds. Model weights were then fine tuned in the last round of AL to obtain the best accuracy for the final model. Our AI tool can be used to detect other roadside safety features and be extended to also locate them for assessing roadside hazard rating.
To use our final models obtained through the iterative AL process directly or as pretrained models for transfer learning, you can download them here:
The models are in HDF5 format and can be easily loaded into tensorflow. To load the models into other machine learning frameworks such as PyTorch, you will need to first load the Xception CNN architecture with pretrained weights into a PyTorch model, then load our model weights onto the PyTorch model. You can refer to this post with code examples to get an idea on the details.
The NCDOT videolog image data captured for primary road and secondary roads with 14 divisions in 2018 and 2019 were used to train, validate, and test our models. The original images in the videolog have varying resolutions within and across divisions. For example, resolutions of two images from the same division are 2356x1200 and 2748x2198, respectively. We resized original images with varying resolutions to the uniform resolution 299x299 and normalized image intensities to be in the range of [0, 1] before feeding images to model training, validation, testing, and prediction. For more details on the image data including how these images are organized, refer to this readme.
The web-based annotation tool is implemented using the Docker platform, Django web framework on the server backend, and React JavaScript library on the client frontend. iRODS is also used as an optional middleware component to manage and transfer images for local development on any personal computer. If the tool is deployed on a server with the data volume mounted directly, iRODS can be easily turned off by setting USE_IRODS
environment variable to False
.
A development database sql file is included in this repo that is used to ingest some test data in the postgreSQL database backend for the annotation tool. The data ingestion is taken care of by the automated server deployment script, specifically in this entrypoint.sh file which will be triggered to run automatically when running this up.sh script to build and bring up the annotation tool in a local development environment. The database schema can be derived from this development database sql file as well. In a production environment, you can use this development database sql file included in the repo to stage the backend database, delete the test data, and ingest data for production database by running the management command scripts in this directory where each management command script has help documentation including how to run it. See loading metadata into the database script as an example. After the new data is ingested into the production database, you can run dump_db.sh to backup the database into a sql file named pg.production.sql for the production database and pg.develop.sql for the development database, which will then be used to ingest data from the backup sql file to the server database when rebuilding the server for a new deployment by running up.sh for development environment or up_prod.sh for production environment.
The following sections are aimed for developers interested in working on the code. It provides guidance for setting up docker-based Django server and React-based client development and deployment environments for the annotation tool as well as for running offline data analysis and machine learning code as part of the active learning pipeline tightly integrated with the annotation tool.
This section provides guidance for setting up docker-based Django server and React-based client development and deployment environments for the annotation tool.
Docker and Docker Compose need to be installed. On Windows 10 and above, native Docker may be installed and used. Otherwise, a Linux VM is needed. In addition, Node.js needs to be installed to use npm and webpack for client code development and deployment.
-
git clone source code from this repo recursively to include the client code submodule
git clone --recursive https://github.com/RENCI/ncdot-road-safety.git cd ncdot-road-safety
-
If an iRODS server optional middleware component is installed to manage and transfer images for local development on a personal computer on which image data cannot be accessed directly, you will need to create a file named
local_settings.py
inserver/road_safety
directory and put the sensitive iRODS server access credential information in the file. An example forlocal_settings.py
is provided below that can be used as a template for creating the file.# iRODS server configuration info IRODS_ROOT = '/tmp' IRODS_ICOMMANDS_PATH = '/usr/bin' IRODS_HOST = 'host_name.research.org' IRODS_PORT = 1247 IRODS_USER = 'irods_proxy_user_name' IRODS_PWD = 'irods_proxy_user_password' IRODS_ZONE = 'irods_zone_name' IRODS_RESC = 'irods_resource_name'
In addition, change
UID
(default is1000
) andGID
(default is1000
) inserver/docker-compose.yml
as needed to correspond to the uid and gid of the user on the host who is running docker containers for the tool. The default values should cover most if not all cases in a local development environment. -
If the image data can be accessed directly in a directory from the web server in a deployment environment, iRODS can be turned off by setting
USE_IRODS
environment variable toFalse
. You can create an environment file named.env.prod
which was set asenv_file
indocker-compose-prod.yml
. An example for.env.prod
is provided below that can be used as a template.DEBUG=False SECRET_KEY=secret_key_signature_created_to_be_used_by_django PGDATABASE=postgres_database_name PGUSER=postgres_db_user_name PGPASSWORD=postgres_db_user_password USER_ID=user_id_on_host GROUP_ID=group_id_on_host SSL_CERT_DIR=ssl_cert_directory ACCOUNTS_APPROVAL_REQUIRED=True EMAIL_HOST_USER=email_user_name_for_approving_user_account EMAIL_HOST_PWD=email_user_password_for_approving_user_account EMAIL_HOST=email_host_name EMAIL_PORT=587 USE_IRODS=False IMAGE_ROOT=image_data_root_directory_on_the_host_to_be_mounted_on_container DEFAULT_FROM_EMAIL=email_user_used_for_default_from_email EMAIL_ADMIN_LIST=admin_email_address_1---admin_email_address_2 IPAM_CONFIG_SUBNET=xxx.xxx.0.0/28
-
Run the following commands
cd ncdot-road-safety-client npm install
-
For local development with debugging turned on, run
npm run dev
; for production deployment, runnpm run production
. -
To collect built client bundle file
index_bundle.js
as static files to be served from the server,cd
to theserver
directory first, then run./collect.sh
script.
- From the
server
directory of the source tree, run./up.sh
to build all containers for local development environment or run./up_prod.sh
to build all containers for production or test server environment where.env.prod
needs to be set up and loaded by docker-compose. - At this point you should be able to open up your browser to get to the tool home page for a local development environment: http://localhost:8000, or http://192.168.56.101:8000/ from the host if host-only adaptor is set up in VirtualBox for the Linux VM running on a windows box. For production or test server environment where SSL is enabled, you can go to https://host.server.address from your browser to access the tool.
- From the
server
directory of the source tree, run./down.sh
to bring down and clean up all containers. Alternatively, you can rundocker-compose stop
followed bydocker-compose up
if you want to keep the states of all containers for continuous development.
docker-compose up
--- bring up all containersdocker-compose stop
--- stop all containersdocker-compose ps
--- check status of all containersdocker rm -fv $(docker ps -a -q)
--- remove all containersdocker rmi -f <image_id>
where<image_id>
is the image id output fromdocker images
command which you want to remove.
We run data processing/analysis and machine learning/inference offline, which are tightly integrated with the annotation tool, in an iterative active learning process. In order to run data processing/analysis scripts and machine learning/inference scripts, we recommend to set up a conda environment with all dependency libraries installed. We ran all scripts in a conda environment with python 3.8. Refer to conda environment setup instructions for details. Within the conda environment, tensorflow 2 with gpu support needs to be installed. For example, it may be installed by running the commands below:
conda install cudatoolkit
pip install tensorflow-gpu
conda install -c conda-forge cudnn
In addition, numpy, scipy, matplotlib, pandas, scikit-learn, pillow, dask and fastparquet are libraries used in some data processing and analysis scripts that may need to be installed.
The following steps summarize how to run the active learning pipeline.
- Use the annotation tool to collect user annotations for images sampled via an active learning sampling strategy.
- Output user annotations from the server by running
docker exec dot-server python manage.py output_image_info_for_al <feature_name> metadata/user_annots.csv
where<feature_name>
is the feature that has been annotated such as guardrail or pole, andmetadata/user_annots.csv
is the csv output file name that the command will output collected annotations to. - Output remaining image base names by running:
docker exec -ti dot-server python manage.py output_image_base_names metadata/remain_image_base_names.csv
to be used for creating image uncertainty measures for the next round. - Prepare images for active learning by running
python prepare_images_for_active_learning.py --input_file <user_annots.csv> --prior_input_file <all_user_annots_from_prior_al_round> --all_annot_file <all_user_annots.csv> --cur_round <cur_al_round_number> --feature_name <guardrail_or_pole_or_others> --root_dir <root_dir_to_create_al_data> --exist_train_yes_file <existing_positive_train_data_to_add_to_al> --exist_train_no_file <existing_negative_train_data_to_add_to_al>
frommachine-learning
subdirectory. Note there are some other parameters that can be set from the command line with a different value from the default. Refer to the help comment for each supported parameter in the code for details. For example,--is_unbalanced
can be passed from the command line when the script prepares the data for active learning without taking actions to balance the data, e.g., undersample majority class to balance with minority class instances. - Optionally run
python create_class_weights.py
for the data directory created by the step above ifis_unbalanced
flag is on which will output class weights for the unbalanced training data to be used when computing the loss in the model training by giving minority class instances more weight than the majority class instances. - Run active learning to refine model from
machine-learning
directory by runningnohup python active_learning.py --train_dir <train_data_dir> --val_dir <validation_data_dir> --test_dir <holdout_test_data_dir> --model_file <input_base_model_to_further_train> --output_model_file <output_model_file> --class_weights {0: 0.59, 1:3.12} &
where class_weights dictionary is passed to fit() function for unbalanced training data. For balanced data, class_weights can be passed as{0: 1, 1: 1}
to give both classes same weight. There are several additional parameters such asnum_of_epoch
,batch_size
,make_inference_only
, andfine_tune_all_weights
which can be overridden from the command line as well. Refer to the help comment for other supported parameters in the code for details. Note that early stopping callback is used in active learning model training and only the best model with smallest loss is saved at the end of each epoch, so use the saved best model if early stopping takes effect. Also runpython compute_model_classification_report.py
to create classification report of the best saved model on the balanced holdout test set as needed if early stopping takes effect. - Run split model
python split_model_into_feature_and_head.py
while overriding default parameters as needed for fast inference on whole active learning sample pool with common feature extraction backbone fixed. - Run fast model prediction from feature vectors
nohup python model_predict_features.py
while overriding default parameters as needed. - Evaluate model performance and create image uncertainty scores using active learning sampling strategies. For example, run
data_processing/create_uncertainty_scores.py
while overriding default parameters as needed to create uncertainty scores based on uncertainty sampling. For unbalanced data sampling such as active learning sampling for guardrail feature, similarity based sampling strategy in the feature embedding space can be used by runningdata_processing/compute_centroid_of_features.py
to create updated centroid of training data with the new current round annotation train data included, then runningdata_processing/create_similiarity_scores.py
to create similarity scores, then runningdata_processing/analyze_similarity_and_prediction.py
to analyze relationships between similarity scores and model predictions and create uncertainty measures and groups. - Load uncertainty scores in the annotation tool database by running
docker exec dot-server python manage.py load_uncertainty_measures metadata/image_uncertainty_scores_round1.csv guardrail
followed by runningdocker exec dot-server python manage.py create_uncertainty_groups guardrail 500
to create uncertainty groups for speeding up uncertainty measure based image queries. - Load the latest model predictions into the annotation tool for diagnostic visualization and analysis by running
docker exec dot-server python manage.py update_ml_predict <prediction_csv_file>
.
Running the steps above gets the annotation tool ready for another round of active learning.