Skip to content

Character encoding auto-detection in JavaScript (port of python's chardet)

License

Notifications You must be signed in to change notification settings

aadsm/jschardet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

AnkaAnka
Anka
and
Anka
Sep 30, 2024
9c59b6e · Sep 30, 2024
Mar 23, 2024
Jul 9, 2024
Mar 23, 2024
Jun 18, 2024
Jun 18, 2024
Feb 28, 2016
Apr 15, 2021
Jan 22, 2019
Apr 6, 2017
Jan 13, 2022
Jan 13, 2022
Mar 31, 2012
Sep 30, 2024
Sep 30, 2024

Repository files navigation

NPM

JsChardet

Port of python's chardet (https://github.com/chardet/chardet).

License

LGPL

How To Use It

Node

npm install jschardet
var jschardet = require("jschardet")

// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }

// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }

// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
//   {encoding: "windows-1252", confidence: 0.95},
//   {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
//   {encoding: "SHIFT_JIS", confidence: 0.01}
// ]

Browser

Copy and include jschardet.min.js in your web page.

This library is also available in cdnjs at https://cdnjs.cloudflare.com/ajax/libs/jschardet/1.4.1/jschardet.min.js

Options

// See all information related to the confidence levels of each encoding.
// This is useful to see why you're not getting the expected encoding.
jschardet.enableDebug();

// Default minimum accepted confidence level is 0.20 but sometimes this is not
// enough, specially when dealing with files mostly with numbers.
// To change this to 0 to always get something or any other value that can
// work for you.
jschardet.detect(str, { minimumThreshold: 0 });

// Lock down which encodings to detect, can be useful in situations jschardet
// is giving a higher probability to encodings that you never use.
jschardet.detect(str, { detectEncodings: ["UTF-8", "windows-1252"] });

Supported Charsets

  • Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese)
  • EUC-KR and ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian)
  • ISO-8859-2 and windows-1250 (Hungarian)
  • ISO-8859-5 and windows-1251 (Bulgarian)
  • windows-1252
  • ISO-8859-7 and windows-1253 (Greek)
  • ISO-8859-8 and windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)
  • UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM)
  • UTF-16 BE or LE (with a BOM)
  • UTF-8 (with or without a BOM)
  • ASCII

Technical Information

I haven't been able to create tests to correctly detect:

  • ISO-2022-CN
  • windows-1250 in Hungarian
  • windows-1251 in Bulgarian
  • windows-1253 in Greek
  • EUC-CN

Development

Use npm run dist to update the distribution files. They're available at https://github.com/aadsm/jschardet/tree/master/dist.

Authors

Ported from python to JavaScript by António Afonso (https://github.com/aadsm/jschardet)

Transformed into an npm package by Markus Ast (https://github.com/brainafk)