解析InputStream以获得多个模式

| 我正在解析InputStream的某些模式以从中提取值，例如我会有类似的东西

<span class=\"filename\"><a href=\"http://example.com/foo\">foo</a>

我不想使用完整的html解析器，因为我对文档结构不感兴趣，而只对一些定义明确的信息感兴趣。（只有他们的顺序很重要）当前，我正在使用一种非常简单的方法，对于每个模式，我都有一个对象，其中包含打开和关闭\'tag \'的char []（在示例中，打开将是closing1ѭ，然后关闭\"以获取url）和一个位置标记。对于InputStream读取的每个字符，我遍历所有Patterns并调用match(char)函数，一旦打开模式匹配，该函数将返回true，从那时起，我将以下字符收集到StringBuilder中，直到现在活动的模式再次匹配（）。然后，我调用具有模式ID和读取的String的函数，以对其进行进一步处理。尽管这在大多数情况下都能正常工作，但我想在模式中包括正则表达式，因此我也可以匹配类似

<span class=\"filename\" id=\"234217\"><a href=\"http://example.com/foo\">foo</a>

在这一点上，我确定我会重新发明轮子，因为这肯定是以前做过的，而且我真的不想写我自己的regex解析器。但是，我找不到任何可以满足我需求的东西。不幸的是，“ 5”类仅匹配一个模式，而不匹配模式列表，我可以使用哪些替代方法？它不应该很重并且可以与Android一起使用。

已邀请:

3 个回复

缝皋

您的意思是要与给定的class属性匹配任何<span>元素，而不考虑其可能具有的其他属性？这很容易：

Scanner sc = new Scanner(new File(\"test.txt\"), \"UTF-8\");
Pattern p = Pattern.compile(
    \"<span[^>]*class=\\\"filename\\\"[^>]*>\\\\s*<a[^>]*href=\\\"([^\\\"]+)\\\"\"
);
while (sc.findWithinHorizon(p, 0) != null)
{
  MatchResult m = sc.match();
  System.out.println(m.group(1));
}

文件“ test.txt”包含您的问题文本，输出为： http://example.com/foo 和关闭 http://example.com/foo

犀寺扦

您正在寻找Scanner.useDelimiter（Pattern）API。您将必须使用OR（|）分隔的模式字符串。这种模式很快就会变得非常复杂。

弦砂牧扁

您应该以为这已经做完了：)您所谈论的是令牌化和解析的问题，因此建议您考虑使用JavaCC。当您了解JavaCC的语法时，JavaCC会有一些学习曲线，因此以下是一个入门的实现。该语法是HTML的标准JavaCC语法的简化版本。您可以添加更多作品以匹配其他模式。

options {
  JDK_VERSION = \"1.5\";
  static = false;
}

PARSER_BEGIN(eg1)
import java.util.*;
public class eg1 {
  private String currentTag;
  private String currentSpanClass;
  private String currentHref;

  public static void main(String args []) throws ParseException {
    System.out.println(\"Starting parse\");
    eg1 parser = new eg1(System.in);
    parser.parse();
    System.out.println(\"Finishing parse\");
  }
}

PARSER_END(eg1)

SKIP :
{
    <       ( \" \" | \"\\t\" | \"\\n\" | \"\\r\" )+   >
|   <       \"<!\" ( ~[\">\"] )* \">\"            >
}

TOKEN :
{
    <STAGO:     \"<\"                 >   : TAG
|   <ETAGO:     \"</\"                >   : TAG
|   <PCDATA:    ( ~[\"<\"] )+         >
}

<TAG> TOKEN [IGNORE_CASE] :
{
    <A:      \"a\"              >   : ATTLIST
|   <SPAN:   \"span\"           >   : ATTLIST
|   <DONT_CARE: ([\"a\"-\"z\"] | [\"0\"-\"9\"])+  >   : ATTLIST
}

<ATTLIST> SKIP :
{
    <       \" \" | \"\\t\" | \"\\n\" | \"\\r\"    >
|   <       \"--\"                        >   : ATTCOMM
}

<ATTLIST> TOKEN :
{
    <TAGC:      \">\"             >   : DEFAULT
|   <A_EQ:      \"=\"             >   : ATTRVAL

|   <#ALPHA:    [\"a\"-\"z\",\"A\"-\"Z\",\"_\",\"-\",\".\"]   >
|   <#NUM:      [\"0\"-\"9\"]                       >
|   <#ALPHANUM: <ALPHA> | <NUM>                 >
|   <A_NAME:    <ALPHA> ( <ALPHANUM> )*         >

}

<ATTRVAL> TOKEN :
{
    <CDATA:     \"\'\"  ( ~[\"\'\"] )* \"\'\"
        |       \"\\\"\" ( ~[\"\\\"\"] )* \"\\\"\"
        | ( ~[\">\", \"\\\"\", \"\'\", \" \", \"\\t\", \"\\n\", \"\\r\"] )+
                            >   : ATTLIST
}

<ATTCOMM> SKIP :
{
    <       ( ~[\"-\"] )+         >
|   <       \"-\" ( ~[\"-\"] )+         >
|   <       \"--\"                >   : ATTLIST
}



void attribute(Map<String,String> attrs) :
{
    Token n, v = null;
}
{
    n=<A_NAME> [ <A_EQ> v=<CDATA> ]
    {
        String attval;
        if (v == null) {
            attval = \"#DEFAULT\";
        } else {
            attval = v.image;
            if( attval.startsWith(\"\\\"\") && attval.endsWith(\"\\\"\") ) {
              attval = attval.substring(1,attval.length()-1);
            } else if( attval.startsWith(\"\'\") && attval.endsWith(\"\'\") ) {
              attval = attval.substring(1,attval.length()-1);
            }
        }
        if( attrs!=null ) attrs.put(n.image.toLowerCase(),attval);
    }
}

void attList(Map<String,String> attrs) : {}
{
    ( attribute(attrs) )+
}


void tagAStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <A> [ attList(attrs) ] <TAGC>
    {
      currentHref=attrs.get(\"href\");    
      if( currentHref != null && \"filename\".equals(currentSpanClass) )
      {
        System.out.println(\"Found URL: \"+currentHref);
      }
    }
}

void tagAEnd() : {}
{
    <ETAGO> <A> <TAGC>
    {
      currentHref=null;
    }
}

void tagSpanStart() : {
  Map<String,String> attrs = new HashMap<String,String>();
}
{
    <STAGO> <SPAN> [ attList(attrs) ] <TAGC>
    {
      currentSpanClass=attrs.get(\"class\");
    }
}

void tagSpanEnd() : {}
{
    <ETAGO> <SPAN> <TAGC>
    {
      currentSpanClass=null;
    }
}

void tagDontCareStart() : {}
{
   <STAGO> <DONT_CARE> [ attList(null) ] <TAGC>
}

void tagDontCareEnd() : {}
{
   <ETAGO> <DONT_CARE> <TAGC>
}

void parse() : {}
{
    (
      LOOKAHEAD(2) tagAStart() |
      LOOKAHEAD(2) tagAEnd() |
      LOOKAHEAD(2) tagSpanStart() |
      LOOKAHEAD(2) tagSpanEnd() |
      LOOKAHEAD(2) tagDontCareStart() |
      LOOKAHEAD(2) tagDontCareEnd() |
      <PCDATA>
    )*
}

要回复问题请先登录或注册

解析InputStream以获得多个模式

3 个回复

发起人

java

regex

inputstream

pattern_matching

问题状态

解析InputStream以获得多个模式

与内容相关的链接

3 个回复

发起人

java

regex

inputstream

pattern_matching

问题状态