创建一个支持字符串的番石榴分割器

| 我想为Java创建一个Guava Splitter，可以将Java字符串作为一个块来处理。例如，我希望以下断言为真：

@Test
public void testSplitter() {
  String toSplit = \"a,b,\\\"c,d\\\\\\\"\\\",e\";
  List<String> expected = ImmutableList.of(\"a\", \"b\", \"c,d\\\"\",\"e\");

  Splitter splitter = Splitter.onPattern(...);
  List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));

  assertEquals(expected, actual);
}

我可以编写正则表达式来查找所有元素，并且不考虑\'，\'，但找不到用于分隔符的正则表达式。如果不可能，请这样说，然后从findAll正则表达式构建列表。

已邀请:

5 个回复

喷乡顾沥沪

似乎您应该使用CSV库（例如opencsv）作为对象。分隔值和处理带引号的块之类的情况就是它们的全部目的。

疏腔傻小雹

这是一个番石榴功能请求：http://code.google.com/p/guava-libraries/issues/detail?id=412

习让休堂溯

我有同样的问题（除非不需要支持转义引号字符）。我不喜欢为这样简单的事情添加另一个库。然后我想到，我需要一个可变的CharMatcher。与Bart Kiers的解决方案一样，它保留引号字符。

public static Splitter quotableComma() {
    return on(new CharMatcher() {
        private boolean inQuotes = false;

        @Override
        public boolean matches(char c) {
            if (\'\"\' == c) {
                inQuotes = !inQuotes;
            }
            if (inQuotes) {
                return false;
            }
            return (\',\' == c);
        }
    });
}

@Test
public void testQuotableComma() throws Exception {
    String toSplit = \"a,b,\\\"c,d\\\",e\";
    List<String> expected = ImmutableList.of(\"a\", \"b\", \"\\\"c,d\\\"\", \"e\");
    Splitter splitter = Splitters.quotableComma();
    List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
    assertEquals(expected, actual);
}

扇献隙

您可以按照以下模式进行拆分：

\\s*,\\s*(?=((\\\\[\"\\\\]|[^\"\\\\])*\"(\\\\[\"\\\\]|[^\"\\\\])*\")*(\\\\[\"\\\\]|[^\"\\\\])*$)

与(?x)标志看起来（比较友好）：

(?x)            # enable comments, ignore space-literals
\\s*,\\s*         # match a comma optionally surrounded by space-chars
(?=             # start positive look ahead
  (             #   start group 1
    (           #     start group 2
      \\\\[\"\\\\]   #       match an escaped quote or backslash
      |         #       OR
      [^\"\\\\]    #       match any char other than a quote or backslash
    )*          #     end group 2, and repeat it zero or more times
    \"           #     match a quote
    (           #     start group 3
      \\\\[\"\\\\]   #       match an escaped quote or backslash
      |         #       OR
      [^\"\\\\]    #       match any char other than a quote or backslash
    )*          #     end group 3, and repeat it zero or more times
    \"           #     match a quote
  )*            #   end group 1, and repeat it zero or more times
  (             #   open group 4
    \\\\[\"\\\\]     #     match an escaped quote or backslash
    |           #     OR
    [^\"\\\\]      #     match any char other than a quote or backslash
  )*            #   end group 4, and repeat it zero or more times
  $             #   match the end-of-input
)               # end positive look ahead

但是即使在此注释版本中，它仍然是一个怪物。用简单的英语来说，此正则表达式可以解释如下：匹配一个可选用空格字符包围的逗号，仅当向前看该逗号（一直到字符串末尾！）时，引号为零或偶数，而忽略转义的引号或反斜杠。因此，看到这种情况之后，您可能会同意ColinD（我同意！），在这种情况下，使用某种CSV解析器是可行的。请注意，上面的正则表达式将在标记周围留下qoutes，即字符串a,b,\"c,d\\\"\",e（作为字面量：\"a,b,\\\"c,d\\\\\\\"\\\",e\"）将被拆分为以下内容：

a
b
\"c,d\\\"\"
e

了驳

@ Rage-Steel \的答案有所改善。

final static CharMatcher notQuoted = new CharMatcher() {
     private boolean inQuotes = false;

     @Override
     public boolean matches(char c) {
        if (\'\"\' == c) {
        inQuotes = !inQuotes;
     }
     return !inQuotes;
};

final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(\" ,;|\"))).trimResults().omitEmptyStrings();

然后，

public static void main(String[] args) {
    final String toSplit = \"a=b c=d,kuku=\\\"e=f|g=h something=other\\\"\";

    List<String> sputnik = SPLITTER.splitToList(toSplit);
    for (String s : sputnik)
        System.out.println(s);
}

注意线程安全性（或者为了简化-没有任何安全性）

要回复问题请先登录或注册

创建一个支持字符串的番石榴分割器

5 个回复

发起人

java

regex

split

guava

问题状态

创建一个支持字符串的番石榴分割器

与内容相关的链接

5 个回复

发起人

java

regex

split

guava

问题状态