All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Make
crowsetta.SimpleSeq.from_file
work with an "empty" csv file, one that has no annotated segments (e.g. because no audio was above threshold for segmenting) #280. Fixes #264. - Add
default_label
argument tocrowsetta.SimpleSeq.from_file
that will add labels to segments in a csv file if there are none #280. Fixes #271. - Add example csv file from Jourjine et al. 2023 dataset #280. Fixes #274.
- Add how-to showing how to work with unannotated segmentation in a csv file, using the csv from the Jourjine et al. 2023 dataset #280. Fixes #275.
- Rename
crowsetta.data
tocrowsetta.examples
, simplify howcrowsetta.example
works (to be more likevocalpy.example
) and give all the example annotation files more specific names so that we can have multiple examples per annotation format #280. Fixes #278. - Import format classes at package level, so we can just type e.g.
crowsetta.SimpleSeq
instead ofcrowsetta.formats.seq.SimpleSeq
("flat is better than nested") #280. Fixes #273.
- Make
crowsetta.SimpleSeq.from_file
usecolumns_map
arg to rename only columns whose names are keys in the supplied dict, and ignore other columns in the csv file #280. Fixes #272.
- Replace deprecated
pandera.SchemaModel
withDataFrameModel
, fixesAttributeError
on import after new install #266. Fixes #265. - Change range of Python version supported from 3.9-3.11 to 3.10-3.12 #268. Fixes #267.
- Vendor code from evfuncs and birdsong-recognition-dataset packages, to reduce the number of dependencies and make sure the code is maintained #263. Fixes #262.
- Fix bug in "generic-seq" format; use validated dataframe returned by pandera schema, so that "label" column is coerced to strings #258. Fixes #257.
- Add information on contributing and setting up a development environment #212. Fixes #30.
- Add method to convert generic sequence format to a pandas DataFrame #216.
- Add additional vignettes to docs: on removing "silent" labels from TextGrid annotations, on converting to the simple sequence and generic sequence formats #216. Fixes #152 and #197.
- Add format class for Audacity extended label track format #226. Fixes #222 and #213.
- Add the ability for a crowsetta.Annotation to have multiple sequences #243. Fixes #42.
- Rewrite TextGrid class to better handle file formats: parse both "short" and default format in either UTF-8 or UTF-16 encoding; remove empty intervals from interval tiers by default; can convert multiple interval tiers to a single crowsetta.Annotation with multiple crowsetta.Sequences #243. Fixes #241
- Revise landing page of docs, and some vignettes. Make other changes to clean up the docs build process #216.
- Coerce path-like attributes of
GenericSeq
dataframe schema to be strings. This helps ensure these columns are always native Pandas types #237. - Fix how the
crowsetta.Segment
class converts onset sample and offset sample to int; correctly handle multiple numpy integer subtypes #238.
- c6ba100 Fix description and uri in pyproject.toml and crowsetta/about.py
- f70828f Make README images link to raw GitHub files so they render on PyPI
- add Raven format #164. Fixes #84.
- add example data #180. Fixes #90.
- add examples to docstrings, using example data #180. Fixes #158.
- import
register_format
at top level of package, to be able to just write@crowsetta.register_format
#181. Fixes #177. - add
'aud-txt'
format, for Audacity standard LabelTracks exported to .txt files #183. Fixes #96. - add ability to extract example data to local file system;
avoids need to use context manager returned by
importlib.resources
to access the example data files. #185. Fixes #184. - add logo #198. Fixes #17.
- change
Annotation
class to represent both sequence-like annotation formats and bounding box-like annotation formats #164. Resolves #149 and #150. - re-design API, and rewrite annotation formats as classes
#161.
- Re-writing as classes fixes #99.
- API re-design fixes #120.
- Adds an
interface
sub-package that specifies an interface for two types of annotations: sequence-like and bounding-box like. Fixes #105 - All existing annotation formats were sequence-like, and they now adhere to that interface; the classes are registered as sub-classes.
- Formats themselves are now in a
formats
sub-package, fixes #109 - Add better functions to list the formats in this sub-package
(fixes #92);
can call
crowsetta.formats.as_list
to get a list of shorthand string names, andcrowsetta.formats.by_name
with the shorthand string name to get back to the corresponding class. Transcriber.from_file
now returns an instance of an annotation format classes. Methods liketo_annot
can be called on this instance. This refactor greatly simplifies theTranscriber
class while maintaining mostly the same API (now need to chain calls likeTranscriber.from_file().to_annot()
, or capture the returned annotation instance in a variable and use it instead). Fixes #144.
- convert docs to markdown and use
myst-parser
#153. Fixes #151. - require Python >= 3.8 to adhere to NEP-29 #168. Fixes #166.
- rename
Annotation.audio_path
attribute tonotated_path
to be more general, e.g., because annotations can also annotate a spectrogram #169. Fixes #148. - rename
onset_ind
andoffset_ind
toonset_sample
andoffset_sample
for clarity #174. Fixes #156. - rename first parameter of
from_file
method for all format classes toannot_path
for consistency. #182. Fixes #178. - Revise documentation #191. Fixes #152 as well as #21, #35, #138, and #157.
- have
formats.as_list
return listsorted
(i.e., alphabetically) #194. Fixes #187.
- fix
crowsetta.formats.register_format
function added in #161 and rewrite example custom annotation formats to use it #176. Fixes #119.
- remove
Stack
class -- was not being used #172. Fixes #170. - remove deprecated
'csv'
format that was replaced by'generic-seq'
#173. Fixes #171. - remove
Meta
class -- no longer used #193. Fixes #190.
-
change format names 'simple-csv' and 'csv' to 'simple-seq' and 'generic-seq'. With goal of eventually having 'simple-seq' work on other file formats, e.g. .txt, and for 'csv' to be the "generic" sequence format that allow for converting between others. #140. Fixes #133.
-
deprecate the name 'csv' for the 'generic-seq' format; a FutureWarning is raised when creating a
Transcriber
withformat='csv'
. #143. Fixes #141. -
switch to using
nox
for development, instead ofmake
#137. Fixes [#132](#132.
- change dependency / format name
koumura
tobirdsong-recognition-dataset
because package was renamed #126. Fixes #124. - switch to using
flit
to build / publish. Removepoetry
. #127. Fixes #125. - move
textgrid
package into sub-package_vendor
, sinceflit
only works with a single top-level package. #127. This is the approachpip
takes, as discussed on pypa/flit#497. - rename attributes / variables
onsets_Hz
andoffsets_Hz
toonset_inds
andoffset_inds
#128. Fixes #87. - rename function
crowsetta.validation._parse_file
tovalidate_ext
#129. Fixes #123.
- add a CITATION.cff file #103.
- add
'yarden'
format, that parses the.mat
files saved bySongAnnotationGUI
, and is used with the canary song dataset that accompanies thetweetynet
paper. #122. Fixes #121.
- rewrite tests to use
pytest
#106 Fixes #89. - change compatible Python versions to >3.6 and <3.10 #111.
- switch from using Make to using nox for development tasks #137. As suggsted by Scikit-HEP. Fixes #132.
- Fix .TextGrid and .phn docstrings that referred to ".not.mat files" #118.
- add missing
packages
to pyproject.toml so thattextgrid
is included in build 857ba09 - add metadata to pyproject.toml so that README is used as "long description" and appears on PyPI e8b8209
- switch to using
poetry
for development #79 - raise minimum version of
evfuncs
to 0.3.1 #79 - raise minimum version of
koumura
to 0.2.0 #79 - change to using GitHub Actions for continuous integration #83
- fix dependencies and Python so they are not pinned to major version #83
- fix
phn2annot
function so it works with.PHN
and.WAV
files found in some versions of TIMIT dataset #75- needed to make extension checking case-insensitive, see issue #68
- and also switch to
soundfile
library to be able to parse the specific NIST format of .WAV files
- add missing comma in
ENTRY_POINTS
insetup.py
so that built-in formats are properly installed 599149f
- change name of
Transcriber
parameterannot_format
to justformat
#64 - change name of
Annotation
attributesannot_file
andaudio_file
toannot_path
andaudio_path
, for clarity and to match what's used in thevak
library #65
- add
phn
module that parses.phn
files from TIMIT dataset #59
- change types of
Annotation
attributesannot_file
andaudio_path
fromstr
(string) topathlib.Path
, to fix errors raised when passing inPath
objects (because the attribute validator requires a string), and because it's preferable to work withPath
objects over strings #52 - change default value for
koumura2annot
parameterwavpath
so that the function will work regardless of current working directory for user, instead of requiring them to be in the parent directory of the.wav
files thatwavpath
refers to #53
- fixed error that
koumura2annot
function threw whenannot_file
was apathlib.Path
and not a string #53
- modify functions for
.not.mat
annotation files (created by evsonganaly GUI) so they do not require other files such as.rec
files (created by evTAF data acquisition program)notmat.notmat2annot
no longer looks for.rec
files, which it used to get the sampling rate and convert onsets and offsets from seconds to Hz
- the
make_notmat
for creating.not.mat
files fromAnnotation
s also now expects onsets and offsets in seconds, not Hz.- the idea being that one can go from
.not.mat
toAnnotation
and back without doing any extra conversion. If user needs conversion to Hz for some other reason they can do this using theAnnotation
- the idea being that one can go from
- add
Annotation
class- which has 'audio_file' and 'annot_file' attributes, along with 'seq' attribute
- rewrite everything centered around
Annotation
class- meaning
Sequence
andSegment
lose their redundant 'file' attributes and all format modules convert to and fromAnnotations
and so does the csv module
- meaning
- single-source version
- now found in an
__about__.py
file insrc/crowsetta
that is used bysetup.py
.
- now found in an
segments
property of aSequence
is a tuple, not a list, so that class is immutable + hashable
__hash__
implementation forSequence
class- convert attributes that are
numpy.ndarray
s into tuples before hashing
- convert attributes that are
- tests for
Sequence
- no longer assert that calling
__hash__
raisesNotImplementedError
- test that
segments
attribute is atuple
not alist
- no longer assert that calling
- implement hashing and equality for
Sequence
class- this makes it possible to use with concurrency, e.g. with the Dask library
- entry point group
crowsetta.format
to make it possible to 'install' formats- removes special casing for built-in formats, they just get added via entry point
- instead of parsing a config.json file built into the package
- module for working with Praat Textgrid format
Meta
class which represents metadata about a format- such as file extension associated with it
- and the module / functions that a
Transcriber
instance should use to work with this format
- Each instance of
Transcriber
has only one vocal annotation format that it handles- because it's annoying to type
file_format
every time you call a method liketo_seq
- instead you just make an instance of
Transcriber
for each format you want - This also works better with
crowsetta.format
entry points andMeta
class; when you instantiate aTranscriber
for a givenvoc_format
, the__init__
uses theMeta
for that format to figure out which function to use forto_seq
,to_csv
, etc. - For this reason bumping to 1.0.0, new
Transcriber
not backwards compatible- although this will be inconvenient for millions of people
- because it's annoying to type
- Sequence instances have attributes: labels, onsets_s, offsets_s, onset_inds, offset_inds, and file.
- Explanation of default
to_csv
function for user formats inhowto-user-config
.
- Sequence class totally re-written
- no longer attrs-based
- because of somewhat complicated logic for validating arguments that was necessary in init (to prevent user from creating a 'bad' instance.)
- Sequences are immutable. Idea is they are just connectors between annotation and whatever user needs to do with it so you shouldn't need to change any attribute values after loading annotation
- Segment also immutable (by setting frozen=True in call to attr.s decorator)
- Transcriber.init uses config.json instead of config.ini to read defaults
- this makes init logic more readable since we don't have to convert user_config dict to strings and then back again; default config just loads as a dict from the .json file and we add the user_config dicts to it
data
module that downloads small example datasets for each annotation format- includes
formats
function that is imported at package level and prints formats built in tocrowsetta
- includes
to_seq_func_to_csv
that takes ayourformat2seq
function and returns a function that will convert the same format to csv files (just a wrapper around your function andseq2csv
)- for docs, Makefile that generates
./notebooks
folder from./doc/notebooks
- major revamp of docs
config_dict
s foruser_config
arg of Transcriber.init only requiremodule
andto_seq
keys;to_csv
andto_format
are optional, can be specified PythonNone
or a string'None'
- Transcriber raises
NotImplemented
error whento_csv
orto_format
are None for a specified format (instead of crashing mysteriously) seq2csv
andcsv2seq
can deal withNone
values for one pair of onsets and offsets
- fix failing tests
Segment
class, attrs-based- has
asdict
method (wrapper aroundattrs
function) - has class variable
_FIELDS
which is used in any place where we need to know how to go fromSegment
attributes to rows of a csv file, e.g. in src/crowsetta/csv.py and in tests
- has
Sequence
class is now attrs-based, has factory functions, is itself just a list ofSegment
s- now has
to_dict
method
- now has
Crowsetta
class is now calledTranscriber
- add Crowsetta class with simple interface for converting any annotation to
- add ability to work with user-defined functions
- user passes an
extra_config
dict when instantiating Crowsetta
- user passes an
- add docs
- change package name to Crowsetta
- change function names so they are all 'format2seq' or 'format2csv' or 'toformat' for consistency
- Initial version after excising from hvc (https://github.com/NickleDave/hybrid-vocal-classifier/blob/master/hvc/utils/annotation.py)
- Convert tests to Python unittest format (instead of using PyTest library)
- Write README.md with usage