-
Notifications
You must be signed in to change notification settings - Fork 0
Codecs
Codec families are ordered depending on how different they are from the original (the closer to the top the more standard they are).
Several codecs don't have an official names per se and are internally called generic names (ex. PS1 ADPCM is just "ADPCM"), so used names may be just conventions.
Regarding audio quality, an important detail is that the encoder that creates audio does matter in final sound quality. Especially for MDCT/psychoacoustics codecs, but also for ADPCM (for example, Unreal Engine 4 uses MS-ADPCM, but the home-baked encoder they use is much worse vs Microsoft's).
Codecs are roughly defined by:
- layout or frame format (how data is organized, frame size, how channels are interleaved, etc)
- decoding function (how 4-bit becomes 16-bit)
- headers to seek/reset state, or none
Decoding roughly transforms ("expands") a 4-bit ADPCM code (nibble) to some value, that depends on previous 16-bit history samples to become the final 16-bit sample.
Most codecs are mono with a few having stereo (or more) modes, rest just interleave channel data at configurable sizes.
Defined by Intel/DVI and recommended by the IMA (Interactive Multimedia Association).
Main decoding depends on:
- step_table and step_index_table config
- nibble expand with a series of ops, then modifying previous sample
The codec is fairly simple and tends to be a bit noisy (precision is not high). This is (theoretically) improved in variations with "headered frames".
Main codecs
- IMA: base raw codec, no frames
- DVI IMA: same with different nibble order
- 3DS, SNDS, OTNS, WV6, ALP, FFTA2, etc: change main decoding slightly
- UBI IMA: custom file header and decoding
- MS-IMA: headered frames, odd samples per frame
- Reflections IMA: custom data layout
- NDS IMA: even samples per frame
- RAD IMA, DAT4 IMA
- Apple/Quicktime IMA: smaller header
- XBOX-IMA: fixed-size frames, even samples per frame
- FSB-IMA: different header format
- Wwise IMA: machine endian
- H4M IMA: variable frame formats controlled by blocks
- OKI (aka Dialogic ADPCM/VOX): smaller step table, 12-bit output
- PC-FX: modified decoding, buggy
- 'OKI 16': 16-bit output
- YAMAHA: custom tables
- AICA: minor 'filtering' differences
- 'framed' YAMAHA
- NXAP: variation
Defined by Sony's researchers (mid ~1980s) and patented (expired?), unsure of actual name. BRR = Bit Rate Reduction.
Main decoding depends on:
- coefs/filter and shift/scale config
- scaled nibble, adjusted with hist1 sample * coef1 + hist2 sample * coef2
Codec is a bit more complex and allows better 'fine tuning' vs IMA.
Codecs
- BRR / SNES ADPCM: simple layout, 4-coef table, shifts
- XA (PS1/CD-i): complex layout for CD format quirks
- ADP/DTK: simple layout, int hist clamping
- PS-ADPCM (aka VAG): simple layout, 5-coef table, SPU flags
- PS-ADPCM with bad flags: crafty devs reuse flags for other causes
- PS-ADPCM with configurable frame size, no flags
- AFC/XMD/ASF/LSF/L5-555/PROCYON: quirky but close to XA
- FADPCM: complex layout
- EA-XA: slightly modified decoding
- MAXIS XA: modified layout
- EA-XA v2: has PCM blocks
- EA-XAS v0: has header frames
- EA-XAS v1: complex layout
- EA-XAS v0: has header frames
- MS-ADPCM: configurable frames, scales, complex fixed table (theoretically configurable)
- Cricket Audio MSADPCM: minor variation
- ADX: scales, 2-coef table per file
- GC-ADPCM (aka DSP): scales, 16-coef table per file
- DSP with subinterleave
Custom or unique enough.
- MTA: YAMAHA-like with multi tables
- MTA2: EA-XAS v1-like with shift tables
- HEVAG: PS-ADPCM-like with multi tables and 4 hist samples
- MC3: 3-bit ADPCM
- Westwood: VBR, multi-mode
- ACM: multi-mode, unknown
- ESS: Eugen Systems multi-pass ADPCM
- 8-bit XA
- Circus/NWA/other Japanese makers' A/DPCM: often weird and non-useful variations
Speech codecs are different in that they use the characteristics of human speech (such as, more mids) to compress. Since human voice is more predictable and simpler than music you can do things that don't make sense in other codecs. Techniques to achieve this can be quite different.
- EA-MT
- Speex
- ITU-T G.722.1 annex C (Polycom Siren14): MLT/IMLT based (somehwat MDCT-like)
- ITU G.719 annex B (Polycom Siren22): improved Siren14, almost audio codec
- SILK
(bear in mind my understanding of those codecs is limited and there can be inaccuracies and errors)
Very roughly DCT/MDCT is math function that simplifies data (for audio or any kind of file), allowing further and better compression techniques to be applied. But rather than encoding all data 1:1 (=lossless), parts that could be removed and still sounds ok enough to human ears (analyzed with "psychoacoustics") are discarded (=lossy), so that compression improves.
When encoding, audio is (more or less):
- divided into frames of a number of samples
- simplified through DCT/MDCT or similar math
- converts samples ("time domain") to statistics ("frequency domain")
- classified into "bands" (rough grouping of signals)
- trimmed with pychoacoustics
- stats simplified/compressed/codified using the least bits as possibly (codebooks)
- put into a custom bitstream (where data is read is variable amounts, like 2-bits, then 6-bits, then 12-bits, etc).
Decoding reverses those steps.
Codecs under this family do similar steps, but each use their own collection of tricks and can be quite different. For example, since audio from L and R channels is often very similar, in can be partially grouped to improve compression (joint stereo), but is not always used. Or, audio volume could be scaled down first to get better compression, then scaled up when decompressing.
-
MPEG: CBR/VBR
- MPEG Audio Layer I (MP1): the original
- MPEG Audio Layer II (MP2): more complex, more samples per frame
- AHX: fake 'deflated' frames
- MPEG Audio Layer III (MP3): hacky MP2 extension, even more samples per frame
- EA-MP3: PCM blocks
- EALayer3: PCM blocks, simplified bitstream, can output 576
-
RELIC: somewhat MPEG-like, mono, simplistic
-
Musepack (MPC): MPEG-like
-
AC3
-
AAC: robust, simplified vs MPEG
-
HCA: CBR, clean and simpler decoding bitstream vs others
-
Ogg Vorbis: VBR, weird Ogg layout, per-song codebooks (allows fine tuning compression)
- many simple encrypted/obfuscated variations just to make harder playing them outside the game
- FSB5 Vorbis: simplified layout, common codebooks (many)
- Wwise Vorbis: simplified layout, trimmed bitstream, common codebooks (few)
- OGL Vorbis: simplified layout
-
CELT: VBR, weird (never finalized so there are many Xiph variations)
- FSB CELT: simplified layout
- CELT (for audio) along with SILK (for speech) were absorbed into OPUS
-
Ogg Opus: CBR/VBR, CELT+SILK variable modes, complex
- Switch (NX) Opus: simplified layout
- EA Opus: simplified layout
- UE4 Opus: simplified layout
- Exient Opus: simplified layout
-
WMA: VBR, complex
-
WMA Pro: VBR, more complex, multichannel support
- XMA1/2: same as WMA Pro with fixed config/frames, stereo-pairs multichannel
- EA-XMA: 'deflated' frames
- XMA1/2: same as WMA Pro with fixed config/frames, stereo-pairs multichannel
-
ATRAC3: CBR, simple/weird
-
ATRAC3Plus: CBR, a bit less weird
-
ATRAC9: CBR, multi-pass, complex
-
Bink Audio: VBR
MP3 VS OGG?? Note that those codecs "trim with pychoacoustics" audio, but what is exactly trimmed is not specified by the format. This means one MP3 encoder may decide to trim some things, and other MP3 encoder other things, plus may use the MP3 format in slightly different ways (there is room to use variations of tricks). Add to that bit-rate settings (aka how much room the encoder has to play around). Same thing happens with OGG. So basically, comparing MP3 (the format) to OGG (the format) is not very useful, it's better to compare "MP3 encoded with X at Y bitrate" vs "OGG encoded with X at Y bitrate", since a crap encoder with sound like crap, no matter the format.
A problem common to all these codecs is that decoding depends on previous frames. This causes a "delay" (silence) before getting audible sound data of at least 1 frame. Since samples per frame can be somewhat high (like ~1000 samples), it's not great for small, immediate SFX or gapless tracks. To solve this, the encoder usually specifies how long is this silence, so the player skips it when decoding. If your player doesn't understand this though you don't get proper gapless audio (vgmstream tries its best to handle it, since it's very important for looped audio, while for example FFmpeg gets this wrong in several codecs).