使用Java URL解析具有Unicode字符的Wikipedia URL时出错

|| 我在获取包括unicode的Wikipedia网址时遇到了麻烦！给定页面标题，例如：1992 \\ u201393_UE_Lleida_seasonnow 只是普通网址... http://en.wikipedia.org/wiki/1992\\u201393_UE_Lleida_seasonnow 使用URLEncoder（设置为UTF-8）...。 http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow 当我尝试解析任一URL时，我什么也没得到。如果将URL复制到浏览器中，我什么也没得到-仅当我真正复制Unicode字符时才得到页面。维基百科是否有某种奇怪的方式来在url中编码unicode？还是我只是在做一些愚蠢的事情？这是我正在使用的代码：

URL url = new URL(\"http://en.wikipedia.org/wiki/\"+x);
System.out.println(\"trying \"+url);  

// Attempt to open the wiki page
InputStream is;
        try{ is = url.openStream();
} catch(Exception e){ return null; }

已邀请:

4 个回复

琶竞捆栓

正确的URI是“ 1”。许多浏览器显示文字而不是百分比编码的转义序列。这被认为是更加用户友好的。但是，正确编码的URI必须对路径部分中不允许的字符使用百分比编码：

   path          = path-abempty    ; begins with \"/\" or is empty
                 / path-absolute   ; begins with \"/\" but not \"//\"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters
   path-abempty  = *( \"/\" segment )
   path-absolute = \"/\" [ segment-nz *( \"/\" segment ) ]
   path-noscheme = segment-nz-nc *( \"/\" segment )
   path-rootless = segment-nz *( \"/\" segment )
   path-empty    = 0<pchar>
   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / \"@\" )
                 ; non-zero-length segment without any colon \":\"
   pchar         = unreserved / pct-encoded / sub-delims / \":\" / \"@\"
   pct-encoded   = \"%\" HEXDIG HEXDIG
   unreserved    = ALPHA / DIGIT / \"-\" / \".\" / \"_\" / \"~\"
   sub-delims    = \"!\" / \"$\" / \"&\" / \"\'\" / \"(\" / \")\"
                 / \"*\" / \"+\" / \",\" / \";\" / \"=\"

URI类可以帮助您完成以下序列：只要RFC 2396允许转义的八位位组，即在用户信息，路径，查询和片段组成部分以及在权限组成部分（如果权限基于注册表）中，都可以使用其他类别的字符。这允许URI包含除US-ASCII字符集中的Unicode字符。

String literal = \"http://en.wikipedia.org/wiki/1992\\u201393_UE_Lleida_seasonnow\";
URI uri = new URI(literal);
System.out.println(uri.toASCIIString());

您可以在此处阅读有关URI编码的更多信息。

雄鞋谋塘

维基百科是否有某种奇怪的方式来在url中编码unicode？这并不奇怪，它是IRI的标准用法。 IRI：

http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season

其中包括Unicode破折号，它等效于URI：

http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

您可以在链接中包含IRI表单，该表单将在现代浏览器中运行。但是许多网络库（包括Java以及旧版浏览器）都需要纯ASCII URI。（即使您已使用编码后的URI版本链接到它，现代浏览器仍会在地址栏中显示漂亮的IRI版本。）通常，要将IRI转换为URI，请在主机名上使用IDN算法，并对所有其他非ASCII字符进行URL编码为UTF-8字节。在您的情况下，应为：

String urlencoded= URLEncoder.encode(x, \"utf-8\").replace(\"+\", \"%20\");
URL url= new URL(\"http://en.wikipedia.org/wiki/\"+urlencoded);

注意：要使x的值带有空格，必须将+替换为%20。 URLEncoder像在查询字符串中一样进行as11 query编码。但是在这样的路径URL段中，+-means-space规则不适用。路径中的空格必须使用Normal-URL-encoding编码为“ 8”。再说一次...在特定的Wikipedia案例中，出于可读性考虑，它们用下划线代替空格，因此最好将\"+\"替换为ѭ15。 %20版本仍然可以使用，因为它们从那里重定向到下划线版本。

才脊烽馈低

这是用于编码使用Unicode的URL的简单算法，因此您可以使用HttpURLConnection检索它们：

import static org.junit.Assert.*;

import java.net.URLEncoder;

import org.apache.commons.lang.CharUtils;
import org.junit.Test;

public class InternationalURLEncoderTest {

    static String encodeUrl(String urlToEncode) {
        String[] pathSegments = urlToEncode.split(\"((?<=/)|(?=/))\");
        StringBuilder encodedUrlBuilder = new StringBuilder();
        for (String pathSegment : pathSegments) {
            boolean needsEncoding = false;
            for (char ch : pathSegment.toCharArray()) {
                if (!CharUtils.isAscii(ch)) {
                    needsEncoding = true;
                    break;
                }
            }
            String encodedSegment = needsEncoding ? URLEncoder
                    .encode(pathSegment) : pathSegment;
            encodedUrlBuilder.append(encodedSegment);
        }
        return encodedUrlBuilder.toString();
    }

    @Test
    public void test() {
        assertEquals(
                \"http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401\",
                encodeUrl(\"http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401\"));
        assertEquals(\"http://www.ttv.com.tw/\",
                encodeUrl(\"http://www.ttv.com.tw/\"));
        assertEquals(\"http://www.ttv.com.tw\",
                encodeUrl(\"http://www.ttv.com.tw\"));
        assertEquals(\"http://www.rt-drive.com.tw/shopping/?st=16\",
                encodeUrl(\"http://www.rt-drive.com.tw/shopping/?st=16\"));
    }

}

该算法使用以下答案编写，用于字符串拆分和检测Unicode字符

氮顺

这是在Chi的答案中对URL进行编码的一种简单方法：

static String encodeUrl(String urlToEncode) throws URISyntaxException {
    return new URI(urlToEncode).toASCIIString();
}

请参阅此答案以进行澄清。

要回复问题请先登录或注册

使用Java URL解析具有Unicode字符的Wikipedia URL时出错

4 个回复

发起人

java

url

unicode

utf_8

wikipedia

问题状态

使用Java URL解析具有Unicode字符的Wikipedia URL时出错

与内容相关的链接

4 个回复

发起人

java

url

unicode

utf_8

wikipedia

问题状态