-
-
Notifications
You must be signed in to change notification settings - Fork 23
How character set works
Firebird uses character encodings in two different areas:
- The database encoding defines a character set in which all
CHAR
,VARCHAR
andBLOB SUB_TYPE 1
fields are physically stored on the disk. There is a default database encoding that is specified during database creation. It is also possible to specify character sets on per column basis. - The client connection encoding defines a character set in which client will send and expects to receive character data. This encoding might or might not match the database default encoding.
The Firebird RDBMS performs translation between character sets of the client connection and the character set of the content. The list of allowed character sets as well as the allowed translations between them are specified in the fbintl shared library located in the intl/
directory of the Firebird installation. There is also a special character set NONE
that tells Firebird not to interpret the contents of the character field.
Firebird uses following algorithm when performing translations:
- If source and target character sets match, send the content unchanged.
- If the target character set is
NONE
, send source content unchanged. - If the source character set is
NONE
, send source content unchanged. - If there is a direct translation rule between source and target character sets, use that rule.
- If there is no direct translation rule, check if there is rule to translate the source character set into the
UTF8
(orUNICODE_FSS
) character set and a rule to translate fromUTF8
(orUNICODE_FSS
) into the target character set. If yes, use these two rules for translation. - If no suitable translation rule can be found, throw an exception.
This algorithm is simple and straightforward, but it can create unexpected results in more complex cases, where multiple translations are required for a single result set row.
Java introduces additional complexity when the NONE
character set is used. The reason for this is that Java internally stores all strings in Unicode format, and the application must define the character encoding for the byte contents to the JVM. When the NONE
character set is used, the driver does not know how to interpret the received data. The two choices that are left to the driver is to construct a string using the default character encoding of the JVM or using the hint specified in the connection. The default encoding usually matches the regional settings of the operating system and can be accessed from within the JVM through the file.encoding
system property. At this point the second conversion happens: JVM interprets the received content as if it was stored using the default VM character set.
It is clear that the interpretation that happens inside the JVM can lead to errors when the same content is accessed from two or more different Java Virtual Machines that have different configuration.
The most severe consequences of this phenomenon occur on Linux and other Unix platforms, as it is possible that regional settings are not configured and that the default "C" locale is used. If, for example, the default character set of the database is set to NONE
, and the connection encoding is also NONE
, no translation errors or problems occur between driver and database server. Jaybird will correctly receive contents of the database and keep it internally in byte array. However, when this content is accessed from the Java application, Jaybird converts that byte array into a java.lang.String
instance using the default JVM encoding, which in our case leads to the situation where all non-ASCII characters are replaced with question marks ("?").
Therefore, application should use NONE
character encoding as an encoding for a database and a connection only when at least one of the following is met:
- Database will contain only ASCII characters
- It is guaranteed that all Java Virtual Machines accessing the database will have the same default encoding that can correctly handle all characters stored in the database.
As a partial workaround one can specify the encoding that should be used to interpret bytes coming from the server in the charSet
connection property. The following rules are used when interpreting the encoding and charSet
properties:
- When only
encoding
property specified, driver uses the default mapping between server and Java encodings. Whenencoding
property is not set or set toNONE
, the default JVM encoding is used to interpret bytes coming from the server. - When only
charSet
property is specified, driver uses the reverse mapping to specify the connection encoding for the server and interprets byte stream according to the value of the property. - When both
encoding
andcharSet
property are specified, driver sets the connection encoding according to the value of theencoding
property, but interprets the byte stream according to thecharSet
property.
The last case is most powerful, but also is the most dangerous in use. When used properly, it can solve the problems with the legacy databases, when used incorrectly, one can trash the content of the database.
Unfortunately we cannot recommend using the UNICODE_FSS
character set as the database default encoding in all versions prior to Firebird 2.0 because of two limitations:
-
UNICODE_FSS
characters can occupy up to 3 bytes, that decreases the allowed maximum size of the column and the index. - The
UNICODE_FSS
character set does not define any collation and does not produce correct results when used in connection withORDER BY
clause.
Firebird 2.0 has better support for Unicode as well as increased maximum size of the index entry (it should be smaller than quarter of the database page size, e.g. max. size of indexed columns in case of 4k page is 1k).
See the list of the available character encodings and their mapping to the Java character set.