Breviloquia Italica: data pipeline

This resource contains the full sourcecode for the data pipeline of the Breviloquia Italica project.

Description

The pipeline is organized into a series of numbered scripts subdivided into the stages of preparation, transformation, selection and annotation. Their dependencies are encoded into Makefile, which can also be executed to target specific outputs in the pipeline. Here is a dependency graph depicting inputs, scripts and outputs:

flowchart TD;

subgraph P1 [PREPARATION]

00[00_unpack-data.sh]:::code;
10[10_extract-places.sh]:::code;
11[11_extract-tweets.sh]:::code;
12[12_flatten-tweets.sh]:::code;
20[20_cleanup-places.py]:::code;
21[21_cleanup-tweets.py]:::code;

2022-MM-DD.jsonl[/data/2022-MM-DD.jsonl/]:::data;
places.jsonl[/places.jsonl/]:::data;
places.parquet[/places.parquet/]:::data;
tweets.jsonl[/tweets.jsonl/]:::data;
tweets.parquet[/tweets.parquet/]:::data;
tweets.csv[/tweets.csv/]:::data;

data.zip[/data.zip/]:::extdata --- 00;
00 --> 2022-MM-DD.jsonl;

2022-MM-DD.jsonl --- 10 --> places.jsonl;
places.jsonl --- 20 --> places.parquet;

2022-MM-DD.jsonl --- 11 --> tweets.jsonl;
tweets.jsonl --- 21 --> tweets.parquet;

2022-MM-DD.jsonl --- 12 ----> tweets.csv;

end

subgraph P2 [TRANSFORMATION]

30[30_tokenize-tweets.py]:::code;
31[31_locate-tweets.py]:::code;

tweets-tok.parquet[/tweets-tok.parquet/]:::data;
tweets-geo.parquet[/tweets-geo.parquet/]:::data;

_02[/italy-regions.geojson/]:::extdata --- 31;
places.parquet --- 31;
tweets.parquet --- 31;
31 --> tweets-geo.parquet;

tweets.parquet --- 30;
30 --> tweets-tok.parquet;

end

subgraph P3 [SELECTION]

40[40_compute-wforms-occ.py]:::code;
41[41_compute-wforms-usr.py]:::code;
42[42_compute-wforms-bat.py]:::code;

wforms-occ.parquet[/wforms-occ.parquet/]:::data;
wforms-usr.parquet[/wforms-usr.parquet/]:::data;
wforms-bat.parquet[/wforms-bat.parquet/]:::data;

tweets-tok.parquet --- 40 --> wforms-occ.parquet;

tweets-tok.parquet --- 41;
tweets.parquet --- 41;
41 --> wforms-usr.parquet;

wforms-occ.parquet --- 42;
_03[/attested-forms.csv/]:::extdata --- 42;
wforms-usr.parquet --- 42;
42 --> wforms-bat.parquet;

end

subgraph P4 [ANNOTATION]

50[50_export-ann-batches.py]:::code;
51[[51_process-ann-batches.md]]:::code;
52[52_import-ann-batches.py]:::code;

wforms-ann-batch-N.csv[/"wforms-ann-batch-{1,2}.csv"/]:::data
wforms-ann-batch-N.gsheet.csv[/"wforms-ann-{batch-1,batch-2,patches}.gsheet.csv"/]:::extdata;
wforms-ann.parquet[/wforms-ann.parquet/]:::data;
wforms-ann.csv[/wforms-ann.csv/]:::data;

wforms-bat.parquet --- 50;
50 -.- wforms-ann.parquet;
50 --> wforms-ann-batch-N.csv;

wforms-ann-batch-N.csv --- 51;
tweets.csv --- 51;
51 --> wforms-ann-batch-N.gsheet.csv;

wforms-ann-batch-N.gsheet.csv --- 52;
52 --> wforms-ann.parquet;
52 --> wforms-ann.csv;

end

subgraph P5 [EXPORT]

60[60_export-tweets-ids.sh]:::code;
61[61_export-occurrences.py]:::code;

tweets-ids.csv[/tweets-ids.csv/]:::data;
occurrences.csv[/occurrences.csv/]:::data;

2022-MM-DD.jsonl --- 60 ----> tweets-ids.csv;

tweets.jsonl --> 61;
tweets-geo.parquet --> 61;
wforms-occ.parquet --> 61;
wforms-ann.parquet --> 61;
61 --> occurrences.csv;

end

P1 ~~~~~~~ P2;
P2 ~~~~ P3;
P3 ~~~~~~ P4;
P4 ~~~~~~~~ P5;

classDef code stroke:red;
classDef data stroke:green;
classDef extdata stroke:blue;

Data visualizations and statistics are produced by a few Python scripts, including some Jupyter notebooks. Makefile encodes the dependencies of these too, as depicted in this graph:

flowchart TB;

subgraph P5 [ANALYSIS]
90[90_basic-stats.ipynb]:::code;
91[91_choro-stats.ipynb]:::code;
92[92_annos-stats.ipynb]:::code;
98[98_parts-chart.py]:::code;
99[99_choro-chart.py]:::code;

places.parquet[/places.parquet/]:::data;
tweets.parquet[/tweets.parquet/]:::data;
tweets-tok.parquet[/tweets-tok.parquet/]:::data;
wforms-bat.parquet[/wforms-bat.parquet/]:::data;
world-nations.geojson[/world-nations.geojson/]:::extdata;
italy-regions.geojson[/italy-regions.geojson/]:::extdata;

world-nations.geojson ---- 90;
italy-regions.geojson ---- 90;
tweets.parquet --- 90;
places.parquet --- 90;
wforms-bat.parquet --- 90;
tweets-tok.parquet --- 90;
90 -.-> 90;

italy-regions.geojson ---- 91;
D9[/"wforms-{bat,ann}.parquet"/]:::dataref --- 91;
D8[/"tweets-{tok,geo}.parquet"/]:::dataref --- 91;
91 -.-> 91;

D1[/"wforms-{ann,bat,occ,usr}.parquet"/]:::dataref --- 92;
%% for spacing only:
italy-regions.geojson ~~~ D1;
92 -.-> 92;

D2[/"wforms-{occ,usr}.parquet"/]:::dataref --- 98;
98 --> subsets.pdf;
98 -.-> 98;

italy-regions.geojson ---- 99;
D3[/"wforms-{bat,ann}.parquet"/]:::dataref --- 99;
D4[/"tweets-{tok,geo}.parquet"/]:::dataref --- 99;
99 --> choros-*.pdf["choros-{sample,more-1,more-2}.pdf"];
99 -.-> 99;

end

classDef code stroke:red;
classDef data stroke:green;
classDef extdata stroke:blue;
classDef dataref stroke:green,stroke-width:2px,stroke-dasharray: 10 10,font-style:italic;

Workbooks named as XX_*.ipynb are dead ends or in-progress work, so they are not documented in the graphs above.

jupyterlab.sh and Makefile.hpc are development tools used to prepare and run the pipeline in our HPC environment, and therefore are probably not of general interest.

requirements.txt lists all Python dependencies, as is customary.

Authors

Paolo Brasolin.

License

This work is openly licensed via CC BY 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Breviloquia Italica: data pipeline

Description

Authors

License

About

Releases 4

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
lib		lib
.gitignore		.gitignore
.zenodo.json		.zenodo.json
00_unpack-data.sh		00_unpack-data.sh
10_extract-places.sh		10_extract-places.sh
11_extract-tweets.sh		11_extract-tweets.sh
12_flatten-tweets.sh		12_flatten-tweets.sh
20_cleanup-places.py		20_cleanup-places.py
21_cleanup-tweets.py		21_cleanup-tweets.py
30_tokenize-tweets.py		30_tokenize-tweets.py
31_locate-tweets.py		31_locate-tweets.py
40_compute-wforms-occ.py		40_compute-wforms-occ.py
41_compute-wforms-usr.py		41_compute-wforms-usr.py
42_compute-wforms-bat.py		42_compute-wforms-bat.py
50_export-ann-batches.py		50_export-ann-batches.py
51_process-ann-batches.md		51_process-ann-batches.md
52_import-ann-batches.py		52_import-ann-batches.py
53_import-ann-batches.ipynb		53_import-ann-batches.ipynb
60_export-tweets-ids.sh		60_export-tweets-ids.sh
61_export-occurrences.py		61_export-occurrences.py
90_basic-stats.ipynb		90_basic-stats.ipynb
91_choro-stats.ipynb		91_choro-stats.ipynb
92_annos-stats.ipynb		92_annos-stats.ipynb
98_parts-chart.ipynb		98_parts-chart.ipynb
99_choro-chart.ipynb		99_choro-chart.ipynb
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
Makefile		Makefile
Makefile.hpc		Makefile.hpc
README.md		README.md
XX_benchmark-time.ipynb		XX_benchmark-time.ipynb
XX_benchmark-yield.ipynb		XX_benchmark-yield.ipynb
XX_inspect-south-tyrol.ipynb		XX_inspect-south-tyrol.ipynb
jupyterlab.sh		jupyterlab.sh
requirements.txt		requirements.txt

breviloquia-italica/pipeline

Folders and files

Latest commit

History

Repository files navigation

Breviloquia Italica: data pipeline

Description

Authors

License

About

Resources

Stars

Watchers

Forks

Releases 4

Contributors 2

Languages