Confidence-based geotagging model has achived the following results:
- Median Average Error (Haversine Distance) of 4.97km
- Mean Average Error (Haversine Distance) of 541.92km
To train your model, follow the instructions in training.ipynb
or run it with train.py
in the terminal:
python train.py --data_dir <data_path> --save_prefix <model_path> --arch char_lstm --split_uids
--batch_size 128 --loss l1 --optimizer adamw --scheduler constant --lr 5e-4 --num_epoch 10
--conf_estim --confidence_validation_criterion
All arguments can be seen in aux_files/args_parser.py
pytorch == 1.7.1
numpy == 1.19.2
scikit-learn == 0.22.2
tqdm == 4.62.3
pandas == 1.0.3
To upload a custom dataset, you will need to implement a Dataloader in data_loading.py. This Dataloader must return a list of texts, a list of coordinates [longitude, latitude]
. Then, add the result to the get_dataset
method in aux_files/args_parser.py, and you'll be able to select it with the dataset_name
argument.
For relevant data sets, please check the Existing Challenges section
To use confidence estimation, set the conf_estim
and confidence_validation_criterion
arguments to True. You can set the array to model_save_band
to show the top predictions by confidence_bands
(as a percentage from 0 to 100).
Use model_save_band
to save the model by the best metric value for the selected band.
An example of using the trained model is in prediction.ipynb
The first suggested methodology (Challenge 1) on training the model is to look into the dataset of top most populated regions around the world.
The provided dataset is here, which:
- is an annotated corpus of 500k texts, as well as the respective geocoordinates
- covers 123 regions
- includes 5000 tweets per location
Challenge 2 sets the goal to identify the correlation between the time/date of post, the content, and the location.
Time zone differences, as well as seasonality of the events, should be analyzed and used to predict the location. For example: snow is more likely to appear in the Northern Hemisphere, especially if in December. Rock concerts are more likely to happen in the evening and in bigger cities, so the time of the post about a concert should be used to identify the time zone of the author and narrow down the list of potential locations.
The provided dataset is here, which:
- is a .json of >600.000 texts
- collected over the span of 12 months
- covers 15 different time zones
- focuses on 6 countries (Cuba, Iran, Russia, North Korea, Syria, Venezuela)
If you would like to contact us with any questions, concerns, or feedback, [email protected] is our email.
You also can check out our site, yachay.ai, or any of our socials below.