UTF Unknown

Detect character set for files, streams and other bytes.

Detection of character sets with a simple and redesigned interface.

This package is based on Ude and since version 2 also on uchardet, which are ports of the Mozilla Universal Charset Detector.

The interface and other classes has been resigned so it's easier to use and better object oriented design (OOD). Unit tests and CI has been added.

Features:

New API
Moved to .NET Standard
Added more unit tests
Builds on CI (AppVeyor)
Strong named
Documentation added
Multiple bugs from Ude fixed

Supported Platforms

.NET 5+
.NET Standard 1.0+
.NET Core 3.0+
.NET Framework 4.0+

Remarks: You can still register your EncodingProvider so that the Encoding.GetEncoding(...) method first tries to find in it.

Usage

Use the static detectX methods from CharsetDetector.

// Detect from File (NET standard 1.3+ or .NET 4+)
DetectionResult result = CharsetDetector.DetectFromFile("path/to/file.txt"); // or pass FileInfo

// Detect from Stream (NET standard 1.3+ or .NET 4+)
result = CharsetDetector.DetectFromStream(stream);

// Detect from bytes
results = CharsetDetector.DetectFromBytes(byteArray);

// Get the best Detection
DetectionDetail resultDetected = results.Detected;

// Get the alias of the found encoding
string encodingName = resultDetected.EncodingName;

// Get the System.Text.Encoding of the found encoding (can be null if not available)
Encoding encoding = resultDetected.Encoding;

// Get the confidence of the found encoding (between 0 and 1)
float confidence = resultDetected.Confidence;

// Get all the details of the result
IList<DetectionDetail> allDetails = result.Details;

Docs

The article "A composite approach to language/encoding detection" describes the charsets detection algorithms implemented by the library.

The following charsets are supported

Encodings with BOM: utf-7, utf-8, utf-16be/utf-16le, utf-32be/utf-32le, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431, gb18030.

Encodings without BOM are presented in the table, separated by languages:

Language	Encodings
International (Unicode)	`utf-8`
Arabic	`iso-8859-6`, `windows-1256`
Bulgarian	`iso-8859-5`, `windows-1251`
Chinese	`iso-2022-cn`, `big5`, `euc-tw`, `gb18030`, `hz-gb-2312`
Croatian	`iso-8859-2`, `iso-8859-13`, `iso-8859-16`, `windows-1250`, `ibm852`, `x-mac-ce`
Czech	`windows-1250`, `iso-8859-2`, `ibm852`, `x-mac-ce`
Danish	`iso-8859-1`, `iso-8859-15`, `windows-1252`
English	`ascii`
Esperanto	`iso-8859-3`
Estonian	`iso-8859-4`, `iso-8859-13`, `iso-8859-13`, `windows-1252`, `windows-1257`
Finnish	`iso-8859-1`, `iso-8859-4`, `iso-8859-9`, `iso-8859-13`, `iso-8859-15`, `windows-1252`
French	`iso-8859-1`, `iso-8859-15`, `windows-1252`
German	`iso-8859-1`, `windows-1252`
Greek	`iso-8859-7`, `windows-1253`
Hebrew	`iso-8859-8`, `windows-1255`
Hungarian	`iso-8859-2`, `windows-1250`
Irish Gaelic	`iso-8859-1`, `iso-8859-9`, `iso-8859-15`, `windows-1252`
Italian	`iso-8859-1`, `iso-8859-3`, `iso-8859-9`, `iso-8859-15`, `windows-1252`
Japanese	`iso-2022-jp`, `shift-jis`, `euc-jp`
Korean	`iso-2022-kr`, `euc-kr`/`uhc`, `cp949`
Lithuanian	`iso-8859-4`, `iso-8859-10`, `iso-8859-13`
Latvian	`iso-8859-4`, `iso-8859-10`, `iso-8859-13`
Maltese	`iso-8859-3`
Polish	`iso-8859-2`, `iso-8859-13`, `iso-8859-16`, `windows-1250`, `ibm852`, `x-mac-ce`
Portuguese	`iso-8859-1`, `iso-8859-9`, `iso-8859-15`, `windows-1252`
Romanian	`iso-8859-2`, `iso-8859-16`, `windows-1250`, `ibm852`
Russian	`iso-8859-5`, `koi8-r`, `windows-1251`, `x-mac-cyrillic`, `ibm855`, `ibm866`
Slovak	`windows-1250`, `iso-8859-2`, `ibm852`, `x-mac-ce`
Slovene	`iso-8859-2`, `iso-8859-16`, `windows-1250`, `ibm852`, `x-mac-ce`
Spanish	`iso-8859-1`, `iso-8859-15`, `windows-1252`
Swedish	`iso-8859-1`, `iso-8859-4`, `iso-8859-9`, `iso-8859-15`, `windows-1252`
Thai	`tis-620`, `iso-8859-11`
Turkish	`iso-8859-3`, `iso-8859-9`
Vietnamese	`viscii`, `windows-1258`
Others	`windows-1252`

Remarks: For some aliases of encoding not available: cp949, iso-2022-cn, euc-tw, iso-8859-10, iso-8859-16, viscii, X-ISO-10646-UCS-4-34121/X-ISO-10646-UCS-4-21431. Some of them have been offered a suitable replacement for the return result by DetectionDetail.Encoding:

cp949: use ks_c_5601-1987
iso-2022-cn: use x-cp50227

License

The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").

Test data has been extracted from Wikipedia and The Project Gutenberg books and is subject to their licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github		.github
example		example
license		license
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
UTF-unknown.sln		UTF-unknown.sln
UTF-unknown.sln.DotSettings		UTF-unknown.sln.DotSettings
appveyor.yml		appveyor.yml
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UTF Unknown

Supported Platforms

Usage

Docs

The following charsets are supported

License

About

Releases

Packages

Languages

igvk/UTF-unknown

Folders and files

Latest commit

History

Repository files navigation

UTF Unknown

Supported Platforms

Usage

Docs

The following charsets are supported

License

About

Resources

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages