HTML整理后网页上的奇怪字符

我通过Amazon Web Services获取内容(例如产品说明)。由于来自亚马逊的内容通常标记得非常糟糕,因此最终会弄乱我的网页布局。所以,我已经提出了一个使用HTML Tidy来“清理”内容的功能。 奇怪的是,当我将它与我的应用程序分开测试时,一切似乎都能正常工作。但在我的应用程序(在CodeIgniter上运行),该函数似乎返回奇数字符。 下面的代码是我的测试脚本。它正在输出我认为我需要的东西。 在我的应用程序中,我从我的数据库中获取描述,清理它,然后在我的网页上显示它。例如,在清理之后,
document’s
(您可以在下面的示例中看到这个词)变为
document’s
(同样,仅在实际应用中;不在测试代码中。两个函数都相同)。 有什么想法吗?这是我的测试功能:
    $amazon_content = <<<AMAZON
JavaScript is the brains of your Web page—it enables you to modify a document’s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br><br>Today’s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br><br>Scriptin' with JavaScript and Ajax will teach you how to:<br><ul><li>Start developing with JavaScript fast!</li></ul><ul><li>Write lightweight but powerful object-oriented code </li></ul><ul><li>Modify the Document Object Model </li></ul><ul><li>“Progressively enhance” your pages with JavaScript to provide the highest levels of accessibility to all users</li></ul><ul><li>Learn sophisticated techniques for making your pages respond to user actions</li></ul><ul><li>Use the downloadable Scriptin’ library of helper functions to speed development and ensure cross-browser compatibility</li></ul><ul><li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li></ul><ul><li>Create powerful interface interactions, such as sliding panels and tree menus</li></ul><ul><li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li></ul><ul><li>Build an online application that looks and responds like a regular desktop application</li></ul><ul><li>Easily adapt the Scriptin’ code examples for use in your own projects—download them at www.scriptinwithajax.com</li></ul><br>
AMAZON;

    echo '<textarea cols="150" rows="12">' . $amazon_content . '</textarea>';
    echo '<textarea cols="150" rows="12">' . get_sanitized_amazon_content($amazon_content) . '</textarea>';
    echo  get_sanitized_amazon_content($amazon_content);

    function get_sanitized_amazon_content($amazon_content)
    {
        $tidy_config             = array(
            'bare' => TRUE,
            'clean' => TRUE,
            'drop-empty-paras' => TRUE,
            'drop-font-tags' => TRUE,
            'drop-proprietary-attributes' => TRUE,
            'enclose-text' => TRUE,
            'fix-backslash' => TRUE,
            'fix-bad-comments' => TRUE,
            'fix-uri' => TRUE,
            'hide-comments' => TRUE,
            'hide-endtags' => TRUE,
            'logical-emphasis' => TRUE,
            'lower-literals' => TRUE,
            'merge-divs' => TRUE,
            'output-xhtml' => TRUE,
            'quote-ampersand' => TRUE,
            'quote-marks' => TRUE,
            'show-body-only' => TRUE,
            'word-2000' => TRUE
        );
        $tidy                    = new tidy();
        $sanitized_amazon_markup = $tidy->repairString($amazon_content, $tidy_config);

        // Replace carriage returns, line feeds, tabs with single space
        $sanitized_amazon_markup = preg_replace('/r|n|t/', ' ', $sanitized_amazon_markup);

        // Removes unnecessary tags
        // TODO: get complete list; put in an array
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'div');
        $sanitized_amazon_markup = strip_tag($sanitized_amazon_markup, 'span');

        // Replace double spaces with single space
        $sanitized_amazon_markup = preg_replace('/ {2,}/i', ' ', $sanitized_amazon_markup);

        // Remove leading and trailing space
        $sanitized_amazon_markup = trim($sanitized_amazon_markup);

        return $sanitized_amazon_markup;
    }

    function strip_tag($tagged_content, $tag_name)
    {
        return preg_replace('%<[ trn]*/?[ trn]*' . $tag_name . '.*?>%i', '', $tagged_content);
    }
更新: 这是我在我的应用程序中得到的:
<p>JavaScript is the brains of your Web page&acirc;&euro;&quot;it enables you to modify a document&acirc;&euro;&trade;s structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin&#39; with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today&acirc;&euro;&trade;s application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin&#39; with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>&acirc;&euro;&oelig;Progressively enhance&acirc;&euro; your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin&acirc;&euro;&trade; library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin&acirc;&euro;&trade; code examples for use in your own projects&acirc;&euro;&quot;download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>
这是我在申请之外得到的:
<p>JavaScript is the brains of your Web page-it enables you to modify a document's structure, styling, and content in response to user actions without requesting new pages from the server. Scriptin' with JavaScript and Ajax teaches you how to master this powerful and elegant language so you can develop intuitive user interactions that take the user experience to new levels of sophistication and responsiveness.<br /> <br /> Today's application-like Web experiences (such as Salesforce.com and Google Maps) and Web 2.0 sites (such as Flickr.com and Twitter) are powered by JavaScript and Ajax. Using the techniques shown in this book, you will be able to start creating similar experiences in the sites you design.<br /> <br /> Scriptin' with JavaScript and Ajax will teach you how to:<br /></p> <ul> <li>Start developing with JavaScript fast!</li> </ul> <ul> <li>Write lightweight but powerful object-oriented code</li> </ul> <ul> <li>Modify the Document Object Model</li> </ul> <ul> <li>"Progressively enhance" your pages with JavaScript to provide the highest levels of accessibility to all users</li> </ul> <ul> <li>Learn sophisticated techniques for making your pages respond to user actions</li> </ul> <ul> <li>Use the downloadable Scriptin' library of helper functions to speed development and ensure cross-browser compatibility</li> </ul> <ul> <li>Use Ajax scripting techniques to update specific areas of the page with data from the server</li> </ul> <ul> <li>Create powerful interface interactions, such as sliding panels and tree menus</li> </ul> <ul> <li>Evaluate frameworks such as jQuery and Prototype to find the best one for your needs</li> </ul> <ul> <li>Build an online application that looks and responds like a regular desktop application</li> </ul> <ul> <li>Easily adapt the Scriptin' code examples for use in your own projects-download them at www.scriptinwithajax.com</li> </ul> <p><br /></p>
    
已邀请:
“page”和“it”之间的
-
不是简单的减号(ascii 0x2d),而是长划线(特别是U + 2014 em破折号)。以UTF-8编码,它是一个三字节序列:0xe2 0x80 0x94。 如果您在Windows-1252编码中解释该序列,则可以:
0xe2 => â => &acirc;
0x80 => € => &euro;
0x94 => (some variant of) double quote => &quot;
所以你有一个编码问题。您将获得UTF-8作为输入,但将其解释为Windows-1252。你正在整理将非ASCII7部分转换为HTML实体,就像它应该的那样。 至于为什么这会发生在您的应用程序内部而不是外部,有一些可能性。一个是你在外部和内部没有相同的区域设置/编码配置。另一个原因是,当您在应用程序之外进行测试时,您并未获得与来自Web的数据完全相同的数据 - 即您获得的编码不同(可能已更改)。     

要回复问题请先登录注册