Qt wiki will be updated on October 12th 2023 starting at 11:30 AM (EEST) and the maintenance will last around 2-3 hours. During the maintenance the site will be unavailable.

Basics of Locales: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
[[Category:QtInternals]]
[[Category:QtInternals]]


[toc align_right="yes" depth="1"]
[toc align_right="yes" depth="1"]


Written By : Girish Ramakrishnan, ForwardBias Technologies
Written By : Girish Ramakrishnan, ForwardBias Technologies
Line 7: Line 7:
= Language id =
= Language id =


A language id identifies a human language and it's written form. It compromises of a language tag and one or more subtags that narrow the language down to a specific dialect/variation. For example, en-IN refers to the variation of english spoken in India. "BCP47":http://www.rfc-editor.org/rfc/bcp/bcp47.txt specifies the best current practice for specifying language codes. Per the document, the two or three letter language code is to be picked from "ISO639-1 or ISO639-2":http://www.loc.gov/standards/iso639-2/php/English_list.php. The subtag can be a country code picked from "ISO3166-1":http://www.iso.org/iso/english_country_names_and_code_elements. The language tag and subtag have to separated using a hyphen.
A language id identifies a human language and it's written form. It compromises of a language tag and one or more subtags that narrow the language down to a specific dialect/variation. For example, en-IN refers to the variation of english spoken in India. "BCP47":http://www.rfc-editor.org/rfc/bcp/bcp47.txt specifies the best current practice for specifying language codes. Per the document, the two or three letter language code is to be picked from "ISO639-1 or ISO639-2":http://www.loc.gov/standards/iso639-2/php/English_list.php. The subtag can be a country code picked from "ISO3166-1":http://www.iso.org/iso/english_country_names_and_code_elements. The language tag and subtag have to separated using a hyphen.


= Locale id =
= Locale id =
Line 13: Line 13:
A locale code identifies a set of user preferences that can help software represent data like numbers, currency symbols, date and time format, translated text and sort order. As opposed to a language code whose purpose is to specify a language/dialect, the purpose of a locale identifier is to ''also'' provide the cultural context.
A locale code identifies a set of user preferences that can help software represent data like numbers, currency symbols, date and time format, translated text and sort order. As opposed to a language code whose purpose is to specify a language/dialect, the purpose of a locale identifier is to ''also'' provide the cultural context.


Representation of locale ids is operating system/library dependent. As an example, following the &quot;CLDR specification&amp;quot;:http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers, the locale id &quot;en-IN_GB&amp;quot; specifies the english language variant spoken in India (en-IN) by a person living in Great Britain. &quot;Glibc&amp;quot;:http://www.gnu.org/software/gettext/manual/gettext.html#Locale-Names specifies locale ids using the &quot;languagecode_countrycode.charset<code>modifier&amp;quot; format.
Representation of locale ids is operating system/library dependent. As an example, following the "CLDR specification":http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers, the locale id "en-IN_GB" specifies the english language variant spoken in India (en-IN) by a person living in Great Britain. "Glibc":http://www.gnu.org/software/gettext/manual/gettext.html#Locale-Names specifies locale ids using the "languagecode_countrycode.charset@modifier" format.


= Locales =
= Locales =


A ''locale id'' serves to represent the customs and notations of a certain group of people. The term ''locale'' is used to designate the customs and notations of a specific user. For example, a user can prefer the application to be translated in swedish but prefer a 24 hour format for time and expect the thousands separator in numbers to be &quot;,&quot; as opposed to &quot;.&quot;.
A ''locale id'' serves to represent the customs and notations of a certain group of people. The term ''locale'' is used to designate the customs and notations of a specific user. For example, a user can prefer the application to be translated in swedish but prefer a 24 hour format for time and expect the thousands separator in numbers to be "," as opposed to ".".


Locale ids are thus provided for each &quot;category&amp;quot; - numbers, currency symbols, collation order etc. All these preferences put together form a user's locale.
Locale ids are thus provided for each "category" - numbers, currency symbols, collation order etc. All these preferences put together form a user's locale.


= The &quot;C&amp;quot; locale =
= The "C" locale =


Many standard C library functions provide locale support - strtoul, scanf, etc. For example, they need it to determine whether a &quot;.&quot; or &quot;,&quot; is the decimal separator. The ISO C standard defines a locale called the C locale which provides some neutral default settings for these functions to work. This is the same as english locales.
Many standard C library functions provide locale support - strtoul, scanf, etc. For example, they need it to determine whether a "." or "," is the decimal separator. The ISO C standard defines a locale called the C locale which provides some neutral default settings for these functions to work. This is the same as english locales.


All programs on startup use the C locale. One needs to figure out the locale (discussed in next section) and explicitly set it up in the application.
All programs on startup use the C locale. One needs to figure out the locale (discussed in next section) and explicitly set it up in the application.
Line 29: Line 29:
= Getting the locale information =
= Getting the locale information =


On Linux, a program gets the locale information by reading various environment variables. How these environment variables are setup are distribution specific. The 'locale' program prints out the locale information in the environment. In brief, the LANG variable provides the locale id for all categories in a single shot. One can, however, specify different locales for specific categories by setting &quot;LC_xxx&amp;quot;:http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 variables.
On Linux, a program gets the locale information by reading various environment variables. How these environment variables are setup are distribution specific. The 'locale' program prints out the locale information in the environment. In brief, the LANG variable provides the locale id for all categories in a single shot. One can, however, specify different locales for specific categories by setting "LC_xxx":http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 variables.


On Windows, GetUserDefaultLCID and GetLocaleInfo can be used to get the locale information.
On Windows, GetUserDefaultLCID and GetLocaleInfo can be used to get the locale information.
Line 37: Line 37:
= Local 8-bit encoding =
= Local 8-bit encoding =


In the pre-unicode world, countries, organizations and OEMs eager to support different languages extended the ASCII encoding to suit their purposes. In such schemes, the characters 0-127 were the exact same as that defined by ASCII. The characters 128-255 were defined to suite the OEMs needs. This meant that the interpretation of byte 250 in a file depends on the character set that was used during the creation time of the file. This character encoding information is referred to as the &quot;local 8-bit encoding&amp;quot;.
In the pre-unicode world, countries, organizations and OEMs eager to support different languages extended the ASCII encoding to suit their purposes. In such schemes, the characters 0-127 were the exact same as that defined by ASCII. The characters 128-255 were defined to suite the OEMs needs. This meant that the interpretation of byte 250 in a file depends on the character set that was used during the creation time of the file. This character encoding information is referred to as the "local 8-bit encoding".


The 8-bit encoding information is also used by devices like the console (cmd.exe) and terminals (konsole). These programs interpret the data sent to it based on the local 8-bit encoding.
The 8-bit encoding information is also used by devices like the console (cmd.exe) and terminals (konsole). These programs interpret the data sent to it based on the local 8-bit encoding.
Line 43: Line 43:
On linux, file names have no concept of character set. They are just a bunch of bytes which then get interpreted as strings based on the local 8-bit encoding (On Windows, the NTFS encodes file names as UTF-16). This is also true for the contents of the files, they are uninterpreted byte streams.
On linux, file names have no concept of character set. They are just a bunch of bytes which then get interpreted as strings based on the local 8-bit encoding (On Windows, the NTFS encodes file names as UTF-16). This is also true for the contents of the files, they are uninterpreted byte streams.


The local 8-bit encoding to be used is sometimes considered to be part of the locale information. For example, locale id &quot;en_IN.utf8&amp;quot; identifies UTF-8 as the local encoding to be used. All file contents, file names, terminal input/output are to be interpreted as UTF-8.
The local 8-bit encoding to be used is sometimes considered to be part of the locale information. For example, locale id "en_IN.utf8" identifies UTF-8 as the local encoding to be used. All file contents, file names, terminal input/output are to be interpreted as UTF-8.


When printing on the console, one should never write UTF-8 (QString::toUtf8), they should instead write local 8-bit data (QString::toLocal8Bit) - though it is very likely that the local 8-bit encoding is UTF-8.
When printing on the console, one should never write UTF-8 (QString::toUtf8), they should instead write local 8-bit data (QString::toLocal8Bit) - though it is very likely that the local 8-bit encoding is UTF-8.


= Further reading =
= Further reading =

Revision as of 10:22, 25 February 2015


[toc align_right="yes" depth="1"]

Written By : Girish Ramakrishnan, ForwardBias Technologies

Language id

A language id identifies a human language and it's written form. It compromises of a language tag and one or more subtags that narrow the language down to a specific dialect/variation. For example, en-IN refers to the variation of english spoken in India. "BCP47":http://www.rfc-editor.org/rfc/bcp/bcp47.txt specifies the best current practice for specifying language codes. Per the document, the two or three letter language code is to be picked from "ISO639-1 or ISO639-2":http://www.loc.gov/standards/iso639-2/php/English_list.php. The subtag can be a country code picked from "ISO3166-1":http://www.iso.org/iso/english_country_names_and_code_elements. The language tag and subtag have to separated using a hyphen.

Locale id

A locale code identifies a set of user preferences that can help software represent data like numbers, currency symbols, date and time format, translated text and sort order. As opposed to a language code whose purpose is to specify a language/dialect, the purpose of a locale identifier is to also provide the cultural context.

Representation of locale ids is operating system/library dependent. As an example, following the "CLDR specification":http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers, the locale id "en-IN_GB" specifies the english language variant spoken in India (en-IN) by a person living in Great Britain. "Glibc":http://www.gnu.org/software/gettext/manual/gettext.html#Locale-Names specifies locale ids using the "languagecode_countrycode.charset@modifier" format.

Locales

A locale id serves to represent the customs and notations of a certain group of people. The term locale is used to designate the customs and notations of a specific user. For example, a user can prefer the application to be translated in swedish but prefer a 24 hour format for time and expect the thousands separator in numbers to be "," as opposed to ".".

Locale ids are thus provided for each "category" - numbers, currency symbols, collation order etc. All these preferences put together form a user's locale.

The "C" locale

Many standard C library functions provide locale support - strtoul, scanf, etc. For example, they need it to determine whether a "." or "," is the decimal separator. The ISO C standard defines a locale called the C locale which provides some neutral default settings for these functions to work. This is the same as english locales.

All programs on startup use the C locale. One needs to figure out the locale (discussed in next section) and explicitly set it up in the application.

Getting the locale information

On Linux, a program gets the locale information by reading various environment variables. How these environment variables are setup are distribution specific. The 'locale' program prints out the locale information in the environment. In brief, the LANG variable provides the locale id for all categories in a single shot. One can, however, specify different locales for specific categories by setting "LC_xxx":http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 variables.

On Windows, GetUserDefaultLCID and GetLocaleInfo can be used to get the locale information.

On Mac OS X, CFLocaleGetIdentifier will return the locale id.

Local 8-bit encoding

In the pre-unicode world, countries, organizations and OEMs eager to support different languages extended the ASCII encoding to suit their purposes. In such schemes, the characters 0-127 were the exact same as that defined by ASCII. The characters 128-255 were defined to suite the OEMs needs. This meant that the interpretation of byte 250 in a file depends on the character set that was used during the creation time of the file. This character encoding information is referred to as the "local 8-bit encoding".

The 8-bit encoding information is also used by devices like the console (cmd.exe) and terminals (konsole). These programs interpret the data sent to it based on the local 8-bit encoding.

On linux, file names have no concept of character set. They are just a bunch of bytes which then get interpreted as strings based on the local 8-bit encoding (On Windows, the NTFS encodes file names as UTF-16). This is also true for the contents of the files, they are uninterpreted byte streams.

The local 8-bit encoding to be used is sometimes considered to be part of the locale information. For example, locale id "en_IN.utf8" identifies UTF-8 as the local encoding to be used. All file contents, file names, terminal input/output are to be interpreted as UTF-8.

When printing on the console, one should never write UTF-8 (QString::toUtf8), they should instead write local 8-bit data (QString::toLocal8Bit) - though it is very likely that the local 8-bit encoding is UTF-8.

Further reading