- exportMetadata.py
- iniToJson.py
- addOlderReleases.py
- addDependencies.py
- generateKeywords.py
- addRelatedEntries.py
- exportJsonMetadata.py
- addStatistics.py
- getTheories.py
Module exportMetadata
Runs the rest of the scripts in the correct order and provides feedback in the form of a progress bar.
Please run pip install -r requirements.txt before running this script.
exportMetadata()
: Main method which calls each submodule in turn. No options can be passed.
updateProgressBar(desc, t)
:
Module iniToJson
Converts the metadata stored in an INI file into individual JSON files. The shortname, title, date and abstract are preserved as is, but the other attributes are transformed into more appropriate formats like arrays and objects.
Author emails are extracted from the entry data and are collated into an authors.json file.
deduplicate(name)
: Many authors have spelt there name in many ways. This is a manual
de-duplication of these.
iniToJson()
: Iterates over each entry in the metadata/metadata file and extracts
the information before outputting the data as a JSON file.
processName(val, authorsDictionary)
: Extracts a string of authors into a python list. The website/email
address is extracted here.
Note: Only one website/email address is kept per author, if there
are multiple values they are overwritten.
standardiseInitials(name)
: Standardises the format of a name with initials.
Format is:
* Initial always followed by a period
* Initials are always seperated by a space
Module addOlderReleases
This script traverses the metadata/release-dates
and metadata/releases
files and adds all the releases (except the most recent) of each entry
to its JSON file.
addOlderReleases()
: First build a list of release dates, second traverse each release and
add all but the most recent to its entry file.
Module addDependencies
The dependencies of an AFP entry are listed in the ROOT file, and as it is regular, this script uses a regular expression to extract the dependencies and adds them to the JSON file of the entry.
addDependencies()
: For each entry in the thys/ directory, extract the dependencies and add
them to the JSON file.
Module generateKeywords
Generates a list of keywords for the search autocomplete. Each entry’s abstract is sanitised and then the keywords are extracted with the RAKE algorithm.
generateKeywords()
: RAKE is used to extract the keywords from every abstract.
The top 8 keywords are added to a list of all keywords and the keywords
that appear in more than two abstracts are preserved. Finally, plurals
are removed.
Module addRelatedEntries
This script generates related entries, using three metrics: * Sharing dependencies * Sharing keywords * Sharing keywords
These are weighted and used to find entries which are likely similar.
These are then added to the entries to improve site navigation.
addRelatedEntries()
: First three dictionaries are created as follows:
dependencies = {"dependency": [list-of-entries, ...], ...}
keywords = {"keyword": [list-of-entries, ...], ...}
topics = {"topic": [list-of-entries, ...], ...}
Keywords that feature in more than 10 entries are dropped. Then
a dictionary is created with the relatedness scores between each
entry. Finally, the top three related entries are chosen for each
entry.
populateRelated(dataSet, relatedEntries, modifier=1)
: This is a heavliy nested loop to create the relatedEntries dictionary.
For each of the categories, the list of entries associated with
each key is iterated over twice and, if the entries are not the
same, the modifier of that category is added to the relatedness
score between the two entries in the dictionary. As the loop
iterates twice over the value set, the resulting dictionary is
bijective — i.e., the value for A->B will be equal to B->A.
topThree(dictionary)
: Returns the highest three dictionary keys by value
Module exportJsonMetadata
This script creates a JSON release of the AFP's metadata
exportJsonMetadata()
: Iterates over each entry and builds the output list
processEntry(entryPath, entry)
: Removes the emails and related entries and returns the dictionary
Module addStatistics
Most the statistics for the site, are generated by Hugo. This script, generates other statistics like number of lines in the AFP using the scripts from the current AFP.
For this script to work, return data
needs to be added at
line 212 in templates.py
addStatistics()
: Creates the necessary objects to generates the statistics,
then outputs them to the data directory
Module getTheories
This script downloads and transforms the HTML documents for theory browsing.
By default this script only gets theories which do not have a theory file
i.e., new theories. The --all
flag can be passed to get all theories,
but this should be run sparingly as it is intensive on the upstream server.
A full run takes around 80 minutes.
defineArgParser()
: Creates parser for command line arguments
dependancyLink(link)
: Fixes dependency links to be internal
getTheories(all=False, entry='')
: Entry point, either downloads one entry or all of them
based on the passed flags
getTheory(url, name)
: The theories are then downloaded,
transformed, and concatenated together. The first transformation
is to keep the and change it to be a
processURL(entry, theoriesHtmlDir, theoriesJsonDir, entriesJsonDir)
: Gets the theories for an entry and writes it to the requisite files
theoryLinks(entry)
: Download the “Browse theories” page for an entry to get a
list of theories.
updateProgressBar(desc, t)
: