Skip to content
Richard Crowder edited this page Sep 5, 2015 · 3 revisions

Welcome to the nupic.audio wiki!

This repository contains projects that investigate online learning of audio streaming data, taking advantage of NuPIC's hierarchical temporal memory (HTM). At the time of writing two main areas are being looked into; Audio Scene Analysis tasks, and Encoding and Temporal dependencies in sequence learning.

Scholar links - cortical learning algorithms (CLA), hierarchical temporal memory (HTM)

Repositories of interest

Note: These repositories currently are all deemed to be work-in-progress.

Potential areas of investigation

  • Genre and style classification
  • Musical prediction and composition
  • Acoustic correlation using canonical correlation analysis (CCA)
  • Transient analysis (harmonic tracking)
  • Motion derivative encoding (similar to optical flow)
  • Echo location and spatial positioning (e.g. Anterior Ventral Cochlea Nucleus)
  • Stream segmentation and separation (includes selective attention)
  • Cortical pathways and projections, 'What' and 'Where' pathways (belts?)
  • Auditory nerve spike firing (e.g. IHC to CN GBC integrators)
  • Dendritic micro-circuits and synaptic placement (temporal smoothing)
  • Spike-timing dependent plasticity
  • Acetylcholine inhibition enhancing discharge frequency but decreasing synaptic adaption
  • Acoustic related cell membrane and dendrite properties (cascading conductances, shunting)

An alternative for the encoding of audio signals is the modelling of spike firing of auditory-nerve fibers. A collection of models can be found in the EarLab @ Boston University (http://earlab.bu.edu/ See Modelling -> Downloadable Models).

As well as the Python Cochlea package (https://github.com/mrkrd/cochlea). A collection of inner ear models. All models are easily accessible as Python functions. They take sound signal as input and return spike trains of the auditory nerve fibers.The package contains state-of-the-art biophysical models, which give realistic approximation of the auditory nerve activity.

Datasets

Mocha-TIMIT

Free for research - http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html

A set of 460 sentences designed to include the main connected speech processes in English (e.g. assimilation's, weak forms ..). 2 speakers, 1 male and 1 female are currently available but another 38 are planned to be completed by May 2001. The subjects have a variety of accents of English. All recordings made in the same sound damped studio at the Edinburgh Speech Production Facility. All data were recorded direct to computer and carefully synchronized.

CSTR US KED Timit

http://www.festvox.org/dbs/dbs_kdt.html

This contains 453 utterances spoken by a US male speaker. This database was collected at University of Edinburgh's Centre for Speech Technology Research but is distributed here as it serves as good example of a general database for simple prosody modelling and unit selection. The speaker is the same speaker as in the Festival ked_diphone voice. This database is free for any use (see licence for details). The database was hand labelled and carefully corrected. It includes EGG recordings. Festival utterance structures are also included.

Online books and references

Background history

From Porr and Wörgötter's (2005) review entitled "Temporal Sequence Learning, Prediction, and Control - A Review of different models and their relation to biological mechanisms" [1], section 6.3 "The Time-gap problem" outlines issues relating to 'bridging the gap between the time-scales of correlation based synaptic plasticity and those of behavioral learning'. Further investigation is required to determine whether aspects of spike-timing dependent plasticity (STDP) is required in a HTM network. Recent work from Wang X.J. and co. (2014) [2] expands this to include learning of neuronal category representations.

The use of sparse coding is gaining traction in more fields. The use of alternative space-time transforms has also evolved. Taken together, sparse encoding has lead to successful applications, such as the MPEG 2/4 Advanced Audio Codec (i.e. MPEG-1 Layer III) and Wavelet image compression. Numerous papers outline sparse encoding strategies for audio and music, for example [3][4][5][6].

1 Temporal Sequence Learning, Prediction, and Control - A Review of different models and their relation to biological mechanisms
Wörgötter, F. and Porr, B. (2005)
http://www.berndporr.me.uk/tsl_woe_porr_nc2004/tsl_woe_porr_nc2004.pdf

2 Choice-correlated activity fluctuations underlie learning of neuronal category representation
Engel TA, Chaisangmongkon W, Freedman DJ , Wang X-J (2015)
Nature Communications 6: 6454. doi:10.1038/ncomms7454
http://www.cns.nyu.edu/wanglab/publications/

3 Sparse representations in audio and music: from coding to source separation
M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies (2010)
https://scholar.google.co.uk/citations?view_op=view_citation&hl=en&user=28TCymYAAAAJ&citation_for_view=28TCymYAAAAJ:Zph67rFs4hoC

4 Sparse time-frequency representations
T.J. Gardner, and M.O. Magnasco (2006)
http://www.pnas.org/content/103/16/6094.short

5 Sparse and Shift-Invariant Representations of Music & Sparse Representations of Polyphonic Music
T. Blumensath, and M. E. Davies (2010)
http://users.fmrib.ox.ac.uk/~tblumens/publications.html

6 On the use of sparse time-relative auditory codes for music
PA Manzagol, T Bertin-Mahieux, D Eck (2008)
https://scholar.google.co.uk/scholar?q=On+the+use+of+sparse+time-relative+auditory+codes+for+music&hl=en&as_sdt=0&as_vis=1&oi=scholart&sa=X&ei=RG0iVYSfCJH5aryZgrAB&ved=0CB4QgQMwAA

7 Reorganization of the cochleotopic map in the bat's auditory system by inhibition
Zhongju Xiao and Nobuo Suga doi:10.1073/pnas.242606699 (2002)
http://www.pnas.org/content/99/24/15743.full

8 Figure 4: STRFs at different levels of the auditory system. (from 'Neural processing of natural sounds')
Frédéric E. Theunissen & Julie E. Elie doi:10.1038/nrn3731 (2014)
http://www.nature.com/nrn/journal/v15/n6/fig_tab/nrn3731_F4.html