Parses taxonomic scientific name and breaks it into semantic elements.
Important: Biodiversity parser >= 4.0.0 uses binding to
https://github.com/gnames/gnparser
and
is not backward compatible with older versions. However it is much much faster
and better than previous versions.
This gem does not have a remote server or a command line executable anymore.
For such features use https://github.com/gnames/gnparser
.
sudo gem install biodiversity
The gem should work on Linux, Mac and Windows (64bit) machines
The fastest way to go through a massive amount of names is to use
Biodiversity::Parser.parse_ary([big array], simple: true)
function.
For example parsing a large file with one name per line:
#!/usr/bin/env ruby
require 'biodiversity'
P = Biodiversity::Parser
count = 0
File.open('all_names.txt').each_slice(50_000) do |sl|
count += 1
res = P.parse_ary(sl, true)
puts count * 50_000
puts res[0]
end
Here are comparative results of running parsers against a file with 24 million names on a 4CPU hyperthreaded laptop:
Program | Version | Full/Simple | Names/min |
---|---|---|---|
gnparser | 0.12.0 | Simple | 3,000,000 |
biodiversity | 4.0.1 | Simple | 2,000,000 |
biodiversity | 4.0.1 | Full JSON | 800,000 |
biodiversity | 3.5.1 | n/a | 40,000 |
You can use it as a library in Ruby:
require 'biodiversity'
#to find the gem version number
Biodiversity.version
# Note that the version in parsed output will correspond to the version of
# gnparser.
# to parse a scientific name into a simple Ruby hash
Biodiversity::Parser.parse("Plantago major", simple: true)
# to parse many scientific names using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ], simple: true)
# to parse a scientific name into a very detailed Ruby hash
Biodiversity::Parser.parse("Plantago major")
# to parse many scientific names with all details using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ])
#to get json representation
Biodiversity::Parser.parse("Plantago").to_json
# to clean name up
Biodiversity::Parser.parse(" Plantago major ")[:normalized]
# to get canonical form with or without infraspecies ranks, as well as
# stemmed version.
parsed = Biodiversity::Parser.parse("Seddera latifolia H. & S. var. latifolia")
parsed[:canonical][:full]
parsed[:canonical][:simple]
parsed[:canonical][:stem]
# to get detailed information about elements of the name
Biodiversity::Parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
Braun & Crous 2003")[:details]
# to parse a botanical cultivar
Biodiversity::Parser.parse("Sarracenia flava 'Maxima'", with_cultivars: true)
'Surrogate' is a broad group which includes 'Barcode of Life' names, and various undetermined names with cf. sp. spp. nr. in them:
parser.parse("Coleoptera BOLD:1234567")[:surrogate]
ID field contains UUID v5 hexadecimal string. ID is generated out of bytes from the name string itself, and identical id can be generated using any popular programming language. You can read more about UUID version 5 in a blog post
For example "Homo sapiens" should generate "16f235a0-e4a3-529c-9b83-bd15fe722110" UUID
Authors: Dmitry Mozzherin, Hernán Lucas Pereira
Contributors: Patrick Leary
Copyright (c) 2008-2024 Dmitry Mozzherin. See LICENSE for further details.