Codecs

VGM CODECS 101

Codec families are ordered depending on how different they are from the original (the closer to the top the more standard they are).

Several codecs don't have an official names per se and are internally called generic names (ex. PS1 ADPCM is just "ADPCM"), so used names may be just conventions.

Regarding audio quality, an important detail is that the encoder that creates audio does matter in final sound quality. Especially for MDCT/psychoacoustics codecs, but also for ADPCM (for example, Unreal Engine 4 uses MS-ADPCM, but the home-baked encoder they use is much worse vs Microsoft's).

ADPCM

Codecs are roughly defined by:

layout or frame format (how data is organized, frame size, how channels are interleaved, etc)
decoding function (how 4-bit becomes 16-bit)
headers to seek/reset state, or none

Decoding roughly transforms ("expands") a 4-bit ADPCM code (nibble) to some value, that depends on previous 16-bit history samples to become the final 16-bit sample.

Most codecs are mono with a few having stereo (or more) modes, rest just interleave channel data at configurable sizes.

IMA

Defined by Intel/DVI and recommended by the IMA (Interactive Multimedia Association).

Main decoding depends on:

step_table and step_index_table config
nibble expand with a series of ops, then modifying previous sample

The codec is fairly simple and tends to be a bit noisy (precision is not high). This is (theoretically) improved in variations with "headered frames".

Main codecs

IMA: base raw codec, no frames
- DVI IMA: same with different nibble order
- 3DS, SNDS, OTNS, WV6, ALP, FFTA2, etc: change main decoding slightly
- UBI IMA: custom file header and decoding
MS-IMA: headered frames, odd samples per frame
- Reflections IMA: custom data layout
- NDS IMA: even samples per frame
- RAD IMA, DAT4 IMA
- Apple/Quicktime IMA: smaller header
- XBOX-IMA: fixed-size frames, even samples per frame
  - FSB-IMA: different header format
  - Wwise IMA: machine endian
- H4M IMA: variable frame formats controlled by blocks
OKI (aka Dialogic ADPCM/VOX): smaller step table, 12-bit output
- PC-FX: modified decoding, buggy
- 'OKI 16': 16-bit output
YAMAHA: custom tables
- AICA: minor 'filtering' differences
- 'framed' YAMAHA
- NXAP: variation

BRR/XA

Defined by Sony's researchers (mid ~1980s) and patented (expired?), unsure of actual name. BRR = Bit Rate Reduction.

Main decoding depends on:

coefs/filter and shift/scale config
scaled nibble, adjusted with hist1 sample * coef1 + hist2 sample * coef2

Codec is a bit more complex and allows better 'fine tuning' vs IMA.

Codecs

BRR / SNES ADPCM: simple layout, 4-coef table, shifts
XA (PS1/CD-i): complex layout for CD format quirks
ADP/DTK: simple layout, int hist clamping
PS-ADPCM (aka VAG): simple layout, 5-coef table, SPU flags
- PS-ADPCM with bad flags: crafty devs reuse flags for other causes
- PS-ADPCM with configurable frame size, no flags
AFC/XMD/ASF/LSF/L5-555/PROCYON: quirky but close to XA
FADPCM: complex layout
EA-XA: slightly modified decoding
- MAXIS XA: modified layout
- EA-XA v2: has PCM blocks
  - EA-XAS v0: has header frames
    - EA-XAS v1: complex layout
MS-ADPCM: configurable frames, scales, complex fixed table (theoretically configurable)
- Cricket Audio MSADPCM: minor variation
ADX: scales, 2-coef table per file
GC-ADPCM (aka DSP): scales, 16-coef table per file
- DSP with subinterleave

OTHER ADPCM s

Custom or unique enough.

MTA: YAMAHA-like with multi tables
MTA2: EA-XAS v1-like with shift tables
HEVAG: PS-ADPCM-like with multi tables and 4 hist samples
MC3: 3-bit ADPCM
Westwood: VBR, multi-mode
ACM: multi-mode, unknown
ESS: Eugen Systems multi-pass ADPCM
8-bit XA
Circus/NWA/other Japanese makers' A/DPCM: often weird and non-useful variations

SPEECH

Speech codecs are different in that they use the characteristics of human speech (such as, more mids) to compress. Since human voice is more predictable and simpler than music you can do things that don't make sense in other codecs. Techniques to achieve this can be quite different.

EA-MT
Speex
ITU-T G.722.1 annex C (Polycom Siren14): MLT/IMLT based (somehwat MDCT-like)
ITU G.719 annex B (Polycom Siren22): improved Siren14, almost audio codec
SILK

MDCT/psychoacoustics

(bear in mind my understanding of those codecs is limited and there can be inaccuracies and errors)

Very roughly DCT/MDCT is math function that simplifies data (for audio or any kind of file), allowing further and better compression techniques to be applied. But rather than encoding all data 1:1 (=lossless), parts that could be removed and still sounds ok enough to human ears (analyzed with "psychoacoustics") are discarded (=lossy), so that compression improves.

When encoding, audio is (more or less):

divided into frames of a number of samples
simplified through DCT/MDCT or similar math
- converts samples ("time domain") to statistics ("frequency domain")
classified into "bands" (rough grouping of signals)
trimmed with pychoacoustics
stats simplified/compressed/codified using the least bits as possibly (codebooks)
put into a custom bitstream (where data is read is variable amounts, like 2-bits, then 6-bits, then 12-bits, etc).

Decoding reverses those steps.

Codecs under this family do similar steps, but each use their own collection of tricks and can be quite different. For example, since audio from L and R channels is often very similar, in can be partially grouped to improve compression (joint stereo), but is not always used. Or, audio volume could be scaled down first to get better compression, then scaled up when decompressing.

MPEG: CBR/VBR
- MPEG Audio Layer I (MP1): the original
- MPEG Audio Layer II (MP2): more complex, more samples per frame
  - AHX: fake 'deflated' frames
- MPEG Audio Layer III (MP3): hacky MP2 extension, even more samples per frame
  - EA-MP3: PCM blocks
  - EALayer3: PCM blocks, simplified bitstream, can output 576
RELIC: somewhat MPEG-like, mono, simplistic
Musepack (MPC): MPEG-like
AC3
AAC: robust, simplified vs MPEG
HCA: CBR, clean and simpler decoding bitstream vs others
Ogg Vorbis: VBR, weird Ogg layout, per-song codebooks (allows fine tuning compression)
- many simple encrypted/obfuscated variations just to make harder playing them outside the game
- FSB5 Vorbis: simplified layout, common codebooks (many)
- Wwise Vorbis: simplified layout, trimmed bitstream, common codebooks (few)
- OGL Vorbis: simplified layout
CELT: VBR, weird (never finalized so there are many Xiph variations)
- FSB CELT: simplified layout
- CELT (for audio) along with SILK (for speech) were absorbed into OPUS
Ogg Opus: CBR/VBR, CELT+SILK variable modes, complex
- Switch (NX) Opus: simplified layout
- EA Opus: simplified layout
- UE4 Opus: simplified layout
- Exient Opus: simplified layout
WMA: VBR, complex
WMA Pro: VBR, more complex, multichannel support
- XMA1/2: same as WMA Pro with fixed config/frames, stereo-pairs multichannel
  - EA-XMA: 'deflated' frames
ATRAC3: CBR, simple/weird
ATRAC3Plus: CBR, a bit less weird
ATRAC9: CBR, multi-pass, complex
Bink Audio: VBR

MP3 VS OGG?? Note that those codecs "trim with pychoacoustics" audio, but what is exactly trimmed is not specified by the format. This means one MP3 encoder may decide to trim some things, and other MP3 encoder other things, plus may use the MP3 format in slightly different ways (there is room to use variations of tricks). Add to that bit-rate settings (aka how much room the encoder has to play around). Same thing happens with OGG. So basically, comparing MP3 (the format) to OGG (the format) is not very useful, it's better to compare "MP3 encoded with X at Y bitrate" vs "OGG encoded with X at Y bitrate", since a crap encoder with sound like crap, no matter the format.

A problem common to all these codecs is that decoding depends on previous frames. This causes a "delay" (silence) before getting audible sound data of at least 1 frame. Since samples per frame can be somewhat high (like ~1000 samples), it's not great for small, immediate SFX or gapless tracks. To solve this, the encoder usually specifies how long is this silence, so the player skips it when decoding. If your player doesn't understand this though you don't get proper gapless audio (vgmstream tries its best to handle it, since it's very important for looped audio, while for example FFmpeg gets this wrong in several codecs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codecs

VGM CODECS 101

ADPCM

IMA

BRR/XA

OTHER ADPCM s

SPEECH

MDCT/psychoacoustics

Clone this wiki locally