You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one XML metadata file.
59
+
Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one `XML` metadata file.
Copy file name to clipboardexpand all lines: docs/Demo.md
+31-20
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# Demo
2
2
3
-
A working example of alto2txt.
3
+
A working example of `alto2txt`.
4
4
5
-
Input xml files from digitised newspapers create an object for every section, paragraph, sentence, and individual word, making it difficult to read articles. Each newspaper page has an associated alto (.xml) file with content, and the pages share a mets (.xml) file with meta data about what articles/other content contain and where.
5
+
Input `XML` files from digitised newspapers create an object for every section, paragraph, sentence, and individual word, making it difficult to read articles. Each newspaper page has an associated alto (`.xml`) file with content, and the pages share a mets (`.xml`) file with meta data about what articles/other content contain and where.
6
6
7
-
The resulting .txt files are one per article, which may span multiple newspaper pages.
7
+
The resulting `.txt` files are one per article, which may span multiple newspaper pages.
8
8
9
9
## Quick Demo
10
10
@@ -17,10 +17,21 @@ Navigate to an empty directory in the terminal and run the following commands:
17
17
> cd alto2txt
18
18
> conda create -n py37alto python=3.7
19
19
> conda activate py37alto
20
-
> pip install -r requirements.txt
21
-
> ./extract_publications_text.py -p single demo-files demo-output
22
20
```
23
-
The resulting plain text files of the articles are in `alto2txt/demo-output/`.
21
+
22
+
To install that checkout you can
23
+
```
24
+
> pip install pyproject.toml
25
+
```
26
+
or you can simply install the latest release (but this may not be up to date with local changes)
27
+
```
28
+
> pip install alto2txt
29
+
```
30
+
regardless this should make the following command run
31
+
```
32
+
> alto2txt -p single demo-files demo-output
33
+
```
34
+
and the resulting plain text files of the articles will be in `alto2txt/demo-output/`.
24
35
25
36
Read on for a more in-depth explanation.
26
37
@@ -32,7 +43,7 @@ It is recommended to use [Anaconda](https://docs.anaconda.com/anaconda/install/i
32
43
33
44
#### Download the code directory
34
45
35
-
If you are familiar with git, use the following command in a blank directory from your terminal:
46
+
If you are familiar with `git`, use the following command in a blank directory from your terminal:
Install the required packages which are outlined in `requirements.txt`:
64
75
65
76
```
66
-
pip install -r requirements.txt
77
+
pip install pyproject.toml
67
78
```
68
-
Follow the instructions to download and install the packages. You should now have all the required Python packages within your conda environment to run Alto2txt.
79
+
Follow the instructions to download and install the packages. You should now have all the required Python packages within your conda environment to run `alto2txt`.
69
80
70
81
71
82
72
-
## Run Alto2Txt
83
+
## Run `alto2txt`
73
84
74
85
Make sure you have navigated to the `alto2txt` directory in your terminal or Anaconda prompt. For this demo, we are using a single edition for a single publication. The output files will be created in `/demo-output` which you can check is currently empty.
75
86
76
87
```
77
-
./extract_publications_text.py -p single demo-files demo-output
88
+
alto2txt -p single demo-files demo-output
78
89
```
79
90
80
91
Here we use the positional argument `-p` to determine which process type, in this case `single`. The script can be run on many publications and years by default, but in this case we only have one publication. [Click here](/#process-types) to read more about different process types.
81
92
82
-
The next argument `demo-files` provides the input directory, and then `demo-output` provides the output directory (which should be empty). Once alto2txt has run, the output directory structure will mirror the input directory.
93
+
The next argument `demo-files` provides the input directory, and then `demo-output` provides the output directory (which should be empty). Once `alto2txt` has run, the output directory structure will mirror the input directory.
83
94
84
95
We will now look in more detail at the ALTO/METS input files and output plain text files.
85
96
86
97
87
98
## Input ALTO/METS files
88
99
89
-
We ran alto2txt on the ALTO/METS files within a subdirectory called `demo-files`. These come from a newspaper published on the 17th of February, 1824. The directory tree structure is important, and will be mirrored in the output.
100
+
We ran `alto2txt` on the ALTO/METS files within a subdirectory called `demo-files`. These come from a newspaper published on the 17th of February, 1824. The directory tree structure is important, and will be mirrored in the output.
90
101
91
102
```
92
103
alto2txt/
@@ -119,7 +130,7 @@ There are four files with the file name ending in `_000x.xml`. These alto files
119
130
<String ID = "word000001" ... CONTENT = "hello" ... />
120
131
```
121
132
122
-
Alto2txt will extract all these individual words and create a text file for each article.
133
+
`alto2txt` will extract all these individual words and create a text file for each article.
123
134
124
135
#### METS File Contents
125
136
@@ -136,7 +147,7 @@ Here is a short example, which defines **Article 01** as the first paragraph on
136
147
</mets:smLinkGrp>
137
148
</mets:structLink>
138
149
```
139
-
Alto2txt will produce a `.txt` file for every Article (and other content, for example Advert) defined in this mets file.
150
+
`alto2txt` will produce a `.txt` file for every Article (and other content, for example Advert) defined in this mets file.
140
151
141
152
142
153
## Output Files
@@ -163,31 +174,31 @@ A total of 26 articles are extracted from the alto files, and one advert. Each p
163
174
164
175
## Further Examples
165
176
166
-
Running these steps for your own files works in the same way. Your source and/or output directory does not need to be within `/alto2txt/`as long as you put the full path name into the command arguments.
177
+
Running these steps for your own files works in the same way. Your source and/or output directory as long as you put the path name into the command arguments.
167
178
168
179
169
180
#### Run on a single publication, multiple years, multiple editions
170
181
171
182
```
172
-
./extract_publications_text.py -p single input-directory output-directory
183
+
alto2txt -p single input-directory output-directory
173
184
```
174
185
175
186
176
187
#### Run on multiple publications, multiple years, multiple editions
> *Note*: the formatting below is altered for readability
14
14
```
15
-
extract_publications_text.py [-h [HELP]]
16
-
[-d [DOWNSAMPLE]]
17
-
[-p [PROCESS_TYPE]]
18
-
[-l [LOG_FILE]]
19
-
[-n [NUM_CORES]]
20
-
xml_in_dir txt_out_dir
21
-
15
+
usage: alto2txt [-h]
16
+
[-p [PROCESS_TYPE]]
17
+
[-l [LOG_FILE]]
18
+
[-d [DOWNSAMPLE]]
19
+
[-n [NUM_CORES]]
20
+
xml_in_dir txt_out_dir
21
+
22
22
Converts XML publications to plaintext articles
23
23
24
24
positional arguments:
25
25
xml_in_dir Input directory with XML publications
26
26
txt_out_dir Output directory for plaintext articles
27
27
28
28
optional arguments:
29
-
-h, --help Show this help message and exit
30
-
-d, --downsample Downsample, process every [integer] nth edition. Default 1
31
-
-l, --log-file Log file. Default out.log
32
-
-p, --process-type Process type.
33
-
One of: single,serial,multi,spark
34
-
Default: multi
35
-
-n, --num-cores Number of cores (Spark only). Default 1
29
+
-h, --help show this help message and exit
30
+
-p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
31
+
Process type. One of: single,serial,multi,spark Default: multi
32
+
-l [LOG_FILE], --log-file [LOG_FILE]
33
+
Log file. Default out.log
34
+
-d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
35
+
Downsample. Default 1
36
+
-n [NUM_CORES], --num-cores [NUM_CORES]
37
+
Number of cores (Spark only). Default 1")
36
38
```
37
39
To read about downsampling, logs, and using spark see [Advanced Information](advanced.md).
38
40
39
41
40
42
## Quick Install
41
43
42
-
If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install Alto2txt by navigating to an empty directory in the terminal and run the following commands:
44
+
If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install `alto2txt` by navigating to an empty directory in the terminal and run the following commands:
0 commit comments