-
Notifications
You must be signed in to change notification settings - Fork 9
Extracting binomial species
The majority of species are mentioned in full once (e.g. in abstract or introduction) and then abbreviated. Example:
From ../CEVOpen/searches/oil186/PMC5132230/fulltext.pdf
Toxicity on Artemia salina essential Aeollanthus suaveolens oil The toxicity test on A. salina L. is widely used in bioassay due to be fast, reliable, and low cost. Furthermore, the A. salina toxicity, shows good correlation with antitumor activities [27], pesticide ... Table 3 shows the mean mortality readings held in the 24 hour period the cytotoxic activity of essential oil from A. suaveolens sheet.
The abbreviations A. salina and A. suaveolens have to be correctly expanded every time they are mentioned (The "A." is ambiguous and we have to replace it by the correct genus in each case. In this paper I think there are only 2 confusable species.
Another example:
Background: The plants belonging to the Ocimum genus of the Lamiaceae family are considered to be a rich source of essential oils which have expressed biological activity and use in different area of human activity. There is a great variety of chemotypes within the same basil species. Essential oils from three different cultivars of basil, O. basilicum var. purpureum, O. basilicum var. thyrsiflora, and O. citriodorum Vis. were the subjects of our investigations.
Here we have to translate O. basilicum and O. citriodorum to their full names, neither of which is given. The genus can be identified as Ocimum.
or
The yeasts (Candida albicans WT-174 isolated from infected vaginal microbiota of hospitalized patients (clinical strain) and Debariomyces hansenii ...
with later:
MIC of O. x citriodorum against D. hansenii and C. guillermondii were 1.56 and 3.125 μL/mL, respectively (see Fig. 1).
- extract all italic phrases , retaining the order.
- filter those which may be species or abbreviations.
- create heuristics to apply previous genus names to later abbreviations.
- use
pyami
to extract all italic phrases. Should be possible using:
-
filter
withXPath
(needs writing) -
output
to a concatenated list (preserve order)
- for each abbreviatjon note species and genera earlier
- make mapping of abbreviations to possible genera
- if mappings are 1:1 resolve ambiguities
- If not, look up possible binomials in wikidata. If absent, abort.
- write XPath filter in
pyami
- create namespace in
Ctree
for extracted species - create
CProject
aggregate of common species abbreviations (e.g.E. coli
)
- extract italic phrases (schematic only - the details will refine)
pyami -p ${oil26} \
--glob "*/sections/**/*_p.xml" \
--filter "xpath('.//italic') \
--combine list \
--output ${home/italic/list.xml} \