Qt wiki will be updated on October 12th 2023 starting at 11:30 AM (EEST) and the maintenance will last around 2-3 hours. During the maintenance the site will be unavailable.

QString: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
No edit summary
 
m (a small mistake in an octal code)
 
(10 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''English''' [[QtQString-Korean|한국어]]
{{LangSwitch}}
[[Category:QtInternals]]
''Written by Girish Ramakrishnan, ForwardBias Technologies''


Written By : Girish Ramakrishnan, ForwardBias Technologies
The fundamentals of encoding are covered in [[Basics of String Encoding]].
QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char*) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.


=QString=
Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. One main reason to use UTF-16 as the internal representation is that it makes it fast to use them with native unicode APIs' on the Mac OS X and Windows (which expect UTF-16).


The fundamentals of encoding are covered in [http://developer.qt.nokia.com/wiki/BasicsOfStringEncoding BasicsOfStringEncoding] ''[developer.qt.nokia.com]''.<br /> QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it’s contents represent. This is in contrast to a C-style string (char *) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.
For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See [[Using QByteArray]] for more details.


Internally, QString stores the string using the <span class="caps">UTF</span>-16 encoding. Each of the 2 bytes of <span class="caps">UTF</span>-16 is represented using a QChar. One main reason to use <span class="caps">UTF</span>-16 as the internal representation is that it makes it fast to use them with native unicode <span class="caps">API</span>s’ on the Mac OS X and Windows (which expect <span class="caps">UTF</span>-16).
== Using C-style strings with QString ==
<code>
QString string("Qt");
</code>


For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See [http://developer.qt.nokia.com/wiki/UsingQByteArray UsingQByteArray] ''[developer.qt.nokia.com]'' for more details.
The above code is saved in some file with encoding called the the ''input charset''. The compiler generates code that puts the C-style string "Qt" in memory with possibly some other encoding called the ''exec charset''. At run time, QString gets a pointer to this memory location and needs to interpret and convert the bytes to unicode.


=Using C-style strings with QString=
For converting the C-style string to Unicode, QString needs to know the exec charset. By default, Qt assumes that this is ASCII. Internally, this conversion uses the same code path as QString::fromAscii(). QString::fromAscii(), in turn, attempts to decode the characters as Latin-1 (since Ascii and Latin-1 are compatible). It is thus possible to get away with placing Latin-1 characters in C-strings.


The above code is saved in some file with encoding called the the ''input charset''. The compiler generates code that puts the C-style string “Qt” in memory with possibly some other encoding called the ''exec charset''. At run time, QString gets a pointer to this memory location and needs to interpret and convert the bytes to unicode.
QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn't decode ASCII anymore).


For converting the C-style string to Unicode, QString needs to know the exec charset. By default, Qt assumes that this is <span class="caps">ASCII</span>. Internally, this conversion uses the same code path as QString::fromAscii(). QString::fromAscii(), in turn, attempts to decode the characters as Latin-1 (since Ascii and Latin-1 are compatible). It is thus possible to get away with placing Latin-1 characters in C-strings.
The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char* pointers correctly as UTF-8.


QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn’t decode <span class="caps">ASCII</span> anymore).
Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only ASCII characters in source files. The C++ standard only mandates ASCII support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:
<code>
QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
QString euro2 = QChar(0x20AC);
static const char utf8_euro[] = "\342\202\254"; // Euro symbol
QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));
</code>


The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not <span class="caps">ASCII</span>. A common case this occurs is when source files contain non-<span class="caps">ASCII</span> characters. Such source files are saved as <span class="caps">UTF</span>-8 and the exec charset of the compiler is set to <span class="caps">UTF</span>-8. QTextCodec::setCodecForCStrings(“<span class="caps">UTF</span>-8”) can then be used to make Qt interpret all the char * pointers correctly as <span class="caps">UTF</span>-8.
All the above techniques require the source file to be only ASCII encoded.


Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only <span class="caps">ASCII</span> characters in source files. The C++ standard only mandates <span class="caps">ASCII</span> support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:<br />
== Unicode methods in QString ==


All the above techniques require the source file to be only <span class="caps">ASCII</span> encoded.
A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort *. Notice that the function is '''not named''' toUtf16() because there is no conversion involved since the internal representation of QString is UTF-16.
 
=Unicode methods in QString=
 
A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort '''. Notice that the function is *not named''' toUtf16() because there is no conversion involved since the internal representation of QString is <span class="caps">UTF</span>-16.


QString::normalized() can be used for Unicode composition and decomposition.
QString::normalized() can be used for Unicode composition and decomposition.
Line 35: Line 43:
QString::length() represents the number of QChars. Thus, it can be that the length does not actually refer to number of actual characters (when the string contains supplementary characters).
QString::length() represents the number of QChars. Thus, it can be that the length does not actually refer to number of actual characters (when the string contains supplementary characters).


QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in <span class="caps">UTF</span>-8 and <span class="caps">UTF</span>-32 conversion.
QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in UTF-8 and UTF-32 conversion.


=Disabling QString(char *)=
== Disabling QString(char *) ==


Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,<br />
Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,
<code>
void gitCallback(const char *data)
{
    QString string = data; // compile error. makes the author think about encoding of 'data'
    ….
}
</code>


Compile errors from above make the programmer rethink about using QString (maybe a QByteArray is a better option) and also try to figure out the encoding of the C-style string.
Compile errors from above make the programmer rethink about using QString (maybe a QByteArray is a better option) and also try to figure out the encoding of the C-style string.


By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to<br />
By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to
 
<code>
=Further reading=
if (fruit==QString::fromUtf8("banana")) { … } // make explicit mention of encoding
 
</code>
[http://developer.qt.nokia.com/wiki/UsingQStringEffectively Using Qt Strings Effectively] ''[developer.qt.nokia.com]''
 
===Categories:===
 
* [[:Category:QtInternals|QtInternals]]

Latest revision as of 11:54, 19 April 2019

En Ar Bg De El Es Fa Fi Fr Hi Hu It Ja Kn Ko Ms Nl Pl Pt Ru Sq Th Tr Uk Zh

Written by Girish Ramakrishnan, ForwardBias Technologies

The fundamentals of encoding are covered in Basics of String Encoding. QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char*) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.

Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. One main reason to use UTF-16 as the internal representation is that it makes it fast to use them with native unicode APIs' on the Mac OS X and Windows (which expect UTF-16).

For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See Using QByteArray for more details.

Using C-style strings with QString

QString string("Qt");

The above code is saved in some file with encoding called the the input charset. The compiler generates code that puts the C-style string "Qt" in memory with possibly some other encoding called the exec charset. At run time, QString gets a pointer to this memory location and needs to interpret and convert the bytes to unicode.

For converting the C-style string to Unicode, QString needs to know the exec charset. By default, Qt assumes that this is ASCII. Internally, this conversion uses the same code path as QString::fromAscii(). QString::fromAscii(), in turn, attempts to decode the characters as Latin-1 (since Ascii and Latin-1 are compatible). It is thus possible to get away with placing Latin-1 characters in C-strings.

QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn't decode ASCII anymore).

The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char* pointers correctly as UTF-8.

Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only ASCII characters in source files. The C++ standard only mandates ASCII support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:

QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
QString euro2 = QChar(0x20AC);
static const char utf8_euro[] = "\342\202\254"; // Euro symbol
QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));

All the above techniques require the source file to be only ASCII encoded.

Unicode methods in QString

A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort *. Notice that the function is not named toUtf16() because there is no conversion involved since the internal representation of QString is UTF-16.

QString::normalized() can be used for Unicode composition and decomposition.

A QChar is always 16-bit. Surrogate pairs are represented using multiple QChars. QChar::isHighSurrogate and QChar::isLowSurrogate can be used to get the surrogate order. QChar::unicode() will return the values. QChar::cell() and QChar::row() can be used to get the lower byte and the higher byte of the QChar.

QString::length() represents the number of QChars. Thus, it can be that the length does not actually refer to number of actual characters (when the string contains supplementary characters).

QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in UTF-8 and UTF-32 conversion.

Disabling QString(char *)

Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,

void gitCallback(const char *data)
{
    QString string = data; // compile error. makes the author think about encoding of 'data'
    .
}

Compile errors from above make the programmer rethink about using QString (maybe a QByteArray is a better option) and also try to figure out the encoding of the C-style string.

By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to

if (fruit==QString::fromUtf8("banana")) {  } // make explicit mention of encoding