在正则表达式数组中搜索文本块的最有效方法是什么？

| 我正在寻找一种最有效的方法来搜索文本块（±1 / 2KB）中存储在数组中的许多正则表达式。示例代码：

patterns = [/patternA/i,/patternB/i,/patternC/m,...,/patternN/i]

content  = \"Lorem ipsum dolor sit amet, consectetur... officiam id est laborum.\"

r = patterns.collect{ |pattern|

  pattern unless ( content =~ pattern ).blank?

}.compact

其中，r现在包含与内容字符串匹配的模式。

已邀请:

4 个回复

盛虱

如果您仅对是否有任何模式与文本匹配感兴趣，则可以考虑使用regex'or \'运算符将所有模式组合到一个大的正则表达式中，并编译一次大型正则表达式。例如，如果您的模式是：A，B，C，则创建一个格式为ѭ4a的正则表达式抱歉，我不了解Ruby，但希望您可以将其转换为代码（：旁注：这是我上次处理Mercurial \ .hgignore文件的方式。在那种情况下，在一个大的正则表达式上抛出1000个文件名，这比在数百个较小的正则表达式中分别抛出那些文件名更有效。

诫商

解决方案1 做这个：

r = patterns.select{|pattern| content =~ pattern}

由于字符串很大，因此最好在String上实现此方法，而不是在其他方法上实现，因为传递大参数似乎很慢。

class String
  def filter_patterns patterns
    patterns.select{|r| self =~ pattern}
  end
end

并像这样使用它：

content.filter_patterns(patterns)

解决方案2 它有一个限制，即每个正则表达式均不包含命名/编号捕获。

combined_regex = Regexp.new(patterns.map{|r| \"(?=[.\\n]*(#{r.source}))?\"}.join)
content =~ combined_regex

如果ѭ10内的正则表达式包含命名/编号捕获，则以下部分会出现问题。如果有一种方法可以为每个正则表达式知道多少个潜在捕获，那么它将解决问题。

r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}

加成鉴于：

dogs = {
  \'saluki\' => \'Hounds\',
  \'russian wolfhound\' => \'Hounds\',
  \'italian greyhound\' => \'Hounds\',
   ..
}
content = \"Running in the fields at great speeds, the sleek saluki dog comes from...\"

你可以这样做：

combined_regex =
    Regexp.new(dogs.keys.map{|w| \"(?=[.\\n]*(#{w}))?\"}.join, Regexp::IGNORECASE)
content =~ combined_regex
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
\"This article talks about #{r.collect{|x| dogs[x]}.to_sentence}.\"
=> \"This article talks about Hounds.\"

为了避免像This article talks about Hounds, Hounds and Hounds.这样的输出，您可能需要在其中放入uniq。

\"This article talks about #{r.uniq.collect{|x| dogs[x]}.to_sentence}.\"

誓猎贰

怎么样：

text = \'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor magna\'
targets = [ /(am?et)/, /(ips.m)/, /(elit)/, /(magna)/, /([Ll]or[eu]m)/ ]

regex = Regexp.union(targets)

hits = []
text.scan(regex) { |a| hits += a.each_with_index.to_a }
r = hits.select{ |w,i| w }.map{ |w,i| targets[i]} # => [/([lL]or[eu]m)/, /(ips.m)/, /(am?et)/, /(elit)/, /(magna)/]

这样可以按照在文本中找到单词的顺序返回匹配的模式。也可能有一种使用命名捕获的方法。

届甸衬丝蚕

您所需的正是词法分析器的设计目的。仅需对所需输入进行一次遍历，即可从输入流中选择一组正则表达式。不幸的是，我无法为Ruby找到一个好的词法分析器gem，它无法让您定义自己的词法分析器。如果发现任何问题，我将更新答案。

要回复问题请先登录或注册

在正则表达式数组中搜索文本块的最有效方法是什么？

4 个回复

发起人

arrays

search

performance

ruby

regex

问题状态

在正则表达式数组中搜索文本块的最有效方法是什么？

与内容相关的链接

4 个回复

发起人

arrays

search

performance

ruby

regex

问题状态