使用UTF-8或Latin1编码将QString转换为QByteArray

我想将QString转换为utf8或latin1 QByteArray, 但今天我得到的一切都是utf8。 而我正在使用高于0x7f的latin1的较高段中的一些char来测试它, 德国ü就是一个很好的例子。 如果我喜欢这样:
QString name("u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();
我得到以下输出。
utf8 "ü" "c3bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc" 
正如你所看到的,我到处都得到了unicode 0xc3bc,我希望在第2步和第3步得到Latin1 0xfc。 我的猜测是我应该得到这样的东西:
utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc" 
这里发生了什么? /谢谢 链接到一些字符表: http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.html http://www.utf8-zeichentabelle.de/ 此代码是在基于Ubuntu 10.04的系统上构建和执行的。
$> uname -a
Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
$> env | grep LANG
LANG=en_US.utf8
如果我尝试使用
utf8.append(name.toUtf8());
我得到了这个输出
utf8 "ü" "c383c2bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc" 
所以latin1是unicode,utf8是双重编码的...... 这必须取决于一些系统设置? 如果我运行它(无法获取.name()来构建)
qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();
然后我明白了:
system name: "en_US" 
codecForCStrings: 0x0 
codecForLocale: "System" 
解 如果我指定它是UTF-8我正在使用,所以不同的类知道这个, 然后它工作。
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();

QString name("u00fc"); 
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("latin1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();
然后我得到这个输出:
system name: "en_US" 
codecForCStrings: "UTF-8" 
codecForLocale: "UTF-8" 
utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc" 
看起来应该是这样的。     
已邀请:
要知道的事情: 执行角色页面 在C ++标准中有一种称为执行字符集的东西,它描述了字符串和字符文字在编译器生成的二进制文件中的输出。您可以在http://gcc.gnu.org网站上的C预处理器手册的1概述部分的1.1字符集子部分中阅读它。 题: 由于
"u00fc"
字符串文字会产生什么? 回答: 这取决于执行字符集是什么。在gcc的情况下(这是你正在使用的)它默认为UTF-8,除非你用
-fexec-charset
选项指定不同的东西。您可以在http://gcc.gnu.org网站上的GCC手册中的3.11选项控制预处理器子部分的3 GCC命令选项部分中阅读有关控制预处理阶段的此选项和其他选项。现在,当我们知道执行字符集是UTF-8时,我们知道
"u00fc"
将被转换为
U+00FC
Unicode代码点的UTF-8编码,这是两个字节的序列
0xc3 0xbc
QString::QString ( const char * str )
QByteArray & QByteArray::append ( const QString & str )
取决于全球状态 QString的构造函数采用
char *
调用
QString QString::fromAscii ( const char * str, int size = -1 )
,它使用设置为
void QTextCodec::setCodecForCStrings ( QTextCodec * codec )
的编解码器(如果已设置任何编解码器)或与
QString QString::fromLatin1 ( const char * str, int size = -1 )
相同(如果没有设置编解码器)。 题: QString的构造函数将使用什么编解码器来解码它得到的两个字节序列(
0xc3 0xbc
)? 回答: 默认情况下,没有使用
QTextCodec::setCodecForCStrings()
设置编解码器,这就是为什么Latin1将用于解码字节序列的原因。因为
0xc3
0xbc
在拉丁语1中都是有效的,分别代表Ã和¼(这对你来说应该是熟悉的,因为它是从你之前的问题的答案中直接得到的)我们得到QString这两个字符。
qDebug()
不是8位清洁 您不应该使用
QDebug
类输出ASCII之外的任何内容。你不能保证得到什么。 测试程序:
#include <QtCore>

void dbg(char const * rawInput, QString s) {

    QString codepoints;
    foreach(QChar chr, s) {
        codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
    }

    qDebug() << "Input: " << rawInput
             << ", "
             << "Unicode codepoints: " << codepoints;
}

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);

    qDebug() << "system name:"
             << QLocale::system().name();

    for (int i = 1; i <= 5; ++i) {

        switch(i) {

        case 1:
            qDebug() << "nWithout codecForCStrings (default is Latin1)n";
            break;
        case 2:
            qDebug() << "nWith codecForCStrings set to UTF-8n";
            QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
            break;
        case 3:
            qDebug() << "nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
            break;
        case 4:
            qDebug() << "nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
            break;
        }

        qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
                                           ? QTextCodec::codecForCStrings()->name()
                                           : "NOT SET");
        qDebug() << "codecForLocale:"   << (QTextCodec::codecForLocale()
                                           ? QTextCodec::codecForLocale()->name()
                                           : "NOT SET");

        qDebug() << "n1. Using QString::QString(char const *)";
        dbg("\u00fc", QString("u00fc"));
        dbg("\xc3\xbc", QString("xc3xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));

        qDebug() << "n2. Using QString::fromUtf8(char const *)";
        dbg("\u00fc", QString::fromUtf8("u00fc"));
        dbg("\xc3\xbc", QString::fromUtf8("xc3xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));

        qDebug() << "n3. Using QString::fromLocal8Bit(char const *)";
        dbg("\u00fc", QString::fromLocal8Bit("u00fc"));
        dbg("\xc3\xbc", QString::fromLocal8Bit("xc3xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
    }

    return app.exec();
}
在Windows XP上的mingw 4.4.0输出:
system name: "pl_PL"

Without codecForCStrings (default is Latin1)

codecForCStrings: "NOT SET"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  u00fc ,  Unicode codepoints:  "102 13d "
Input:  xc3xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

With codecForCStrings set to UTF-8

codecForCStrings: "UTF-8"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

2. Using QString::fromUtf8(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  u00fc ,  Unicode codepoints:  "102 13d "
Input:  xc3xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8

codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"

1. Using QString::QString(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1

codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  u00fc ,  Unicode codepoints:  "fc "
Input:  xc3xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  u00fc ,  Unicode codepoints:  "c3 bc "
Input:  xc3xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
我要感谢来自#qt freenode.org IRC频道的thiago,cbreak,peppe和heinz,以展示和帮助我理解这里涉及的问题。     

要回复问题请先登录注册