Skip to content

Commit b23b593

Browse files
committed
Update documentation to reflect install.
1 parent c96a2ef commit b23b593

File tree

6 files changed

+304
-217
lines changed

6 files changed

+304
-217
lines changed

README.md

+17-20
Original file line numberDiff line numberDiff line change
@@ -56,17 +56,16 @@ pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
5656

5757
## Usage
5858

59-
Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one XML metadata file.
59+
Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one `XML` metadata file.
6060

6161

6262

6363
```
64-
extract_publications_text.py [-h] [-d [DOWNSAMPLE]]
65-
[-p [PROCESS_TYPE]]
66-
[-l [LOG_FILE]]
67-
[-n [NUM_CORES]]
68-
xml_in_dir txt_out_dir
69-
64+
usage: alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
65+
xml_in_dir txt_out_dir
66+
alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
67+
xml_in_dir txt_out_dir
68+
7069
Converts XML publications to plaintext articles
7170
7271
positional arguments:
@@ -75,19 +74,17 @@ positional arguments:
7574
7675
optional arguments:
7776
-h, --help show this help message and exit
78-
-d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
79-
Downsample. Default 1
77+
-p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
78+
Process type. One of: single,serial,multi,spark Default: multi
8079
-l [LOG_FILE], --log-file [LOG_FILE]
8180
Log file. Default out.log
82-
-p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
83-
Process type.
84-
One of: single,serial,multi,spark
85-
Default: multi
81+
-d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
82+
Downsample. Default 1
8683
-n [NUM_CORES], --num-cores [NUM_CORES]
8784
Number of cores (Spark only). Default 1")
8885
```
8986

90-
`xml_in_dir` is expected to hold XML for multiple publications, in the following structure:
87+
`xml_in_dir` is expected to hold `XML` for multiple publications, in the following structure:
9188

9289
```
9390
xml_in_dir
@@ -129,32 +126,32 @@ The following `XSLT` files need to be in an `extract_text.xslts` module:
129126

130127
## Process publications
131128

132-
Assume `~/BNA` exists and matches the structure above.
129+
Assume folder `BNA` exists and matches the structure above.
133130

134131
Extract text from every publication:
135132

136133
```bash
137-
./extract_publications_text.py ~/BNA txt
134+
alto2txt BNA txt
138135
```
139136

140137
Extract text from every 100th issue of every publication:
141138

142139
```bash
143-
./extract_publications_text.py ~/BNA txt -d 100
140+
alto2txt BNA txt -d 100
144141
```
145142

146143
## Process a single publication
147144

148145
Extract text from every issue of a single publication:
149146

150147
```bash
151-
./extract_publications_text.py -p single ~/BNA/0000151 txt
148+
alto2txt -p single BNA/0000151 txt
152149
```
153150

154151
Extract text from every 100th issue of a single publication:
155152

156153
```bash
157-
./extract_publications_text.py -p single ~/BNA/0000151 txt -d 100
154+
alto2txt -p single BNA/0000151 txt -d 100
158155
```
159156

160157
## Configure logging
@@ -164,7 +161,7 @@ By default, logs are put in `out.log`.
164161
To specify an alternative location for logs, use the `-l` flag e.g.
165162

166163
```bash
167-
./extract_publications_text.py -l mylog.txt ~/BNA txt -d 100 2> err.log
164+
alto2txt -l mylog.txt BNA txt -d 100 2> err.log
168165
```
169166

170167
## Process publications via Spark

docs/Demo.md

+31-20
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Demo
22

3-
A working example of alto2txt.
3+
A working example of `alto2txt`.
44

5-
Input xml files from digitised newspapers create an object for every section, paragraph, sentence, and individual word, making it difficult to read articles. Each newspaper page has an associated alto (.xml) file with content, and the pages share a mets (.xml) file with meta data about what articles/other content contain and where.
5+
Input `XML` files from digitised newspapers create an object for every section, paragraph, sentence, and individual word, making it difficult to read articles. Each newspaper page has an associated alto (`.xml`) file with content, and the pages share a mets (`.xml`) file with meta data about what articles/other content contain and where.
66

7-
The resulting .txt files are one per article, which may span multiple newspaper pages.
7+
The resulting `.txt` files are one per article, which may span multiple newspaper pages.
88

99
## Quick Demo
1010

@@ -17,10 +17,21 @@ Navigate to an empty directory in the terminal and run the following commands:
1717
> cd alto2txt
1818
> conda create -n py37alto python=3.7
1919
> conda activate py37alto
20-
> pip install -r requirements.txt
21-
> ./extract_publications_text.py -p single demo-files demo-output
2220
```
23-
The resulting plain text files of the articles are in `alto2txt/demo-output/`.
21+
22+
To install that checkout you can
23+
```
24+
> pip install pyproject.toml
25+
```
26+
or you can simply install the latest release (but this may not be up to date with local changes)
27+
```
28+
> pip install alto2txt
29+
```
30+
regardless this should make the following command run
31+
```
32+
> alto2txt -p single demo-files demo-output
33+
```
34+
and the resulting plain text files of the articles will be in `alto2txt/demo-output/`.
2435

2536
Read on for a more in-depth explanation.
2637

@@ -32,7 +43,7 @@ It is recommended to use [Anaconda](https://docs.anaconda.com/anaconda/install/i
3243

3344
#### Download the code directory
3445

35-
If you are familiar with git, use the following command in a blank directory from your terminal:
46+
If you are familiar with `git`, use the following command in a blank directory from your terminal:
3647

3748
```
3849
git clone https://github.com/Living-with-machines/alto2txt.git
@@ -63,30 +74,30 @@ conda activate py37alto
6374
Install the required packages which are outlined in `requirements.txt`:
6475

6576
```
66-
pip install -r requirements.txt
77+
pip install pyproject.toml
6778
```
68-
Follow the instructions to download and install the packages. You should now have all the required Python packages within your conda environment to run Alto2txt.
79+
Follow the instructions to download and install the packages. You should now have all the required Python packages within your conda environment to run `alto2txt`.
6980

7081

7182

72-
## Run Alto2Txt
83+
## Run `alto2txt`
7384

7485
Make sure you have navigated to the `alto2txt` directory in your terminal or Anaconda prompt. For this demo, we are using a single edition for a single publication. The output files will be created in `/demo-output` which you can check is currently empty.
7586

7687
```
77-
./extract_publications_text.py -p single demo-files demo-output
88+
alto2txt -p single demo-files demo-output
7889
```
7990

8091
Here we use the positional argument `-p` to determine which process type, in this case `single`. The script can be run on many publications and years by default, but in this case we only have one publication. [Click here](/#process-types) to read more about different process types.
8192

82-
The next argument `demo-files` provides the input directory, and then `demo-output` provides the output directory (which should be empty). Once alto2txt has run, the output directory structure will mirror the input directory.
93+
The next argument `demo-files` provides the input directory, and then `demo-output` provides the output directory (which should be empty). Once `alto2txt` has run, the output directory structure will mirror the input directory.
8394

8495
We will now look in more detail at the ALTO/METS input files and output plain text files.
8596

8697

8798
## Input ALTO/METS files
8899

89-
We ran alto2txt on the ALTO/METS files within a subdirectory called `demo-files`. These come from a newspaper published on the 17th of February, 1824. The directory tree structure is important, and will be mirrored in the output.
100+
We ran `alto2txt` on the ALTO/METS files within a subdirectory called `demo-files`. These come from a newspaper published on the 17th of February, 1824. The directory tree structure is important, and will be mirrored in the output.
90101

91102
```
92103
alto2txt/
@@ -119,7 +130,7 @@ There are four files with the file name ending in `_000x.xml`. These alto files
119130
<String ID = "word000001" ... CONTENT = "hello" ... />
120131
```
121132

122-
Alto2txt will extract all these individual words and create a text file for each article.
133+
`alto2txt` will extract all these individual words and create a text file for each article.
123134

124135
#### METS File Contents
125136

@@ -136,7 +147,7 @@ Here is a short example, which defines **Article 01** as the first paragraph on
136147
</mets:smLinkGrp>
137148
</mets:structLink>
138149
```
139-
Alto2txt will produce a `.txt` file for every Article (and other content, for example Advert) defined in this mets file.
150+
`alto2txt` will produce a `.txt` file for every Article (and other content, for example Advert) defined in this mets file.
140151

141152

142153
## Output Files
@@ -163,31 +174,31 @@ A total of 26 articles are extracted from the alto files, and one advert. Each p
163174

164175
## Further Examples
165176

166-
Running these steps for your own files works in the same way. Your source and/or output directory does not need to be within `/alto2txt/` as long as you put the full path name into the command arguments.
177+
Running these steps for your own files works in the same way. Your source and/or output directory as long as you put the path name into the command arguments.
167178

168179

169180
#### Run on a single publication, multiple years, multiple editions
170181

171182
```
172-
./extract_publications_text.py -p single input-directory output-directory
183+
alto2txt -p single input-directory output-directory
173184
```
174185

175186

176187
#### Run on multiple publications, multiple years, multiple editions
177188

178189
```
179-
./extract_publications_text.py input-directory output-directory
190+
alto2txt input-directory output-directory
180191
```
181192

182193
#### Extract every 100th edition from every publication
183194

184195
```
185-
./extract_publications_text.py input-directory output-directory -d 100
196+
alto2txt input-directory output-directory -d 100
186197
```
187198
Where `-d` determines the downsample value.
188199

189200
#### Extract every 100th edition from one publication
190201

191202
```
192-
./extract_publications_text.py -p single input-directory output-directory -d 100
203+
alto2txt -p single input-directory output-directory -d 100
193204
```

docs/README.md

+31-30
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,54 @@
1-
# Alto2txt: Extract plain text from digitised newspapers
1+
# `alto2txt`: Extract plain text from digitised newspapers
22

33
*Version extract_text 0.3.0*
44

5-
Alto2txt converts XML publications to plaintext articles with minimal metadata.
5+
`alto2txt` converts `XML` publications to plaintext articles with minimal metadata.
66
ALTO and METS is the current industry standard for newspaper digitization used by hundreds of modern, large-scale newspaper digitization projects.
7-
One text file is output per article, each complemented by one XML metadata file.
7+
One text file is output per article, each complemented by one `XML` metadata file.
88

9-
**XML compatibility: METS 1.8/ALTO 1.4, METS 1.3/ALTO 1.4, BLN, or UKP format**
9+
**`XML` compatibility: METS 1.8/ALTO 1.4, METS 1.3/ALTO 1.4, BLN, or UKP format**
1010

1111
## Usage
1212

13-
13+
> *Note*: the formatting below is altered for readability
1414
```
15-
extract_publications_text.py [-h [HELP]]
16-
[-d [DOWNSAMPLE]]
17-
[-p [PROCESS_TYPE]]
18-
[-l [LOG_FILE]]
19-
[-n [NUM_CORES]]
20-
xml_in_dir txt_out_dir
21-
15+
usage: alto2txt [-h]
16+
[-p [PROCESS_TYPE]]
17+
[-l [LOG_FILE]]
18+
[-d [DOWNSAMPLE]]
19+
[-n [NUM_CORES]]
20+
xml_in_dir txt_out_dir
21+
2222
Converts XML publications to plaintext articles
2323
2424
positional arguments:
2525
xml_in_dir Input directory with XML publications
2626
txt_out_dir Output directory for plaintext articles
2727
2828
optional arguments:
29-
-h, --help Show this help message and exit
30-
-d, --downsample Downsample, process every [integer] nth edition. Default 1
31-
-l, --log-file Log file. Default out.log
32-
-p, --process-type Process type.
33-
One of: single,serial,multi,spark
34-
Default: multi
35-
-n, --num-cores Number of cores (Spark only). Default 1
29+
-h, --help show this help message and exit
30+
-p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
31+
Process type. One of: single,serial,multi,spark Default: multi
32+
-l [LOG_FILE], --log-file [LOG_FILE]
33+
Log file. Default out.log
34+
-d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
35+
Downsample. Default 1
36+
-n [NUM_CORES], --num-cores [NUM_CORES]
37+
Number of cores (Spark only). Default 1")
3638
```
3739
To read about downsampling, logs, and using spark see [Advanced Information](advanced.md).
3840

3941

4042
## Quick Install
4143

42-
If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install Alto2txt by navigating to an empty directory in the terminal and run the following commands:
44+
If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install `alto2txt` by navigating to an empty directory in the terminal and run the following commands:
4345

4446
```
4547
> git clone https://github.com/Living-with-machines/alto2txt.git
4648
> cd alto2txt
4749
> conda create -n py37alto python=3.7
4850
> conda activate py37alto
49-
> pip install -r requirements.txt
51+
> pip install pyproject.toml
5052
```
5153

5254
[Click here](/Demo.md) for more in-depth installation instructions using demo files.
@@ -78,21 +80,21 @@ xml_in_dir/
7880
Assuming `xml_in_dir` follows this structure, run alto2txt with the following in the terminal:
7981

8082
```bash
81-
./extract_publications_text.py ~/xml_in_dir ~/txt_out_dir
83+
alto2txt xml_in_dir txt_out_dir
8284
```
8385

8486
To downsample and only process every 100th edition:
8587

8688
```bash
87-
./extract_publications_text.py ~/xml_in_dir ~/txt_out_dir -d 100
89+
alto2txt xml_in_dir txt_out_dir -d 100
8890
```
8991

9092

9193
## Process Single Publication
9294

9395
[A demo for processing a single publication is available here.](Demo.md)
9496

95-
If `-p|--process-type single` is provided then `xml_in_dir` is expected to hold XML for a single publication, in the following structure:
97+
If `-p|--process-type single` is provided then `xml_in_dir` is expected to hold `XML` for a single publication, in the following structure:
9698

9799
```
98100
xml_in_dir/
@@ -102,16 +104,16 @@ xml_in_dir/
102104
└── year
103105
```
104106

105-
Assuming `xml_in_dir` follows this structure, run alto2txt with the following in the terminal:
107+
Assuming `xml_in_dir` follows this structure, run `alto2txt` with the following in the terminal in the folder `xml_in_dir` is stored in:
106108

107109
```bash
108-
./extract_publications_text.py -p single ~/xml_in_dir ~/txt_out_dir
110+
alto2txt -p single xml_in_dir txt_out_dir
109111
```
110112

111113
To downsample and only process every 100th edition from the one publication:
112114

113115
```bash
114-
./extract_publications_text.py -p single ~/xml_in_dir ~/txt_out_dir -d 100
116+
alto2txt -p single xml_in_dir txt_out_dir -d 100
115117
```
116118

117119
## Plain Text Files Output
@@ -125,7 +127,7 @@ Quality assurance is performed to check for:
125127

126128
* Unexpected directories.
127129
* Unexpected files.
128-
* Malformed XML.
130+
* Malformed `XML`.
129131
* Empty files.
130132
* Files that otherwise do not expose content.
131133

@@ -135,5 +137,4 @@ Quality assurance is performed to check for:
135137
* Check and ensure that articles that span multiple pages are pulled into a single article file.
136138
* Smarter handling of articles spanning multiple pages.
137139

138-
139-
> Last updated 2022-06-30
140+
> Last updated 2022-11-10

0 commit comments

Comments
 (0)