将大列表与python中的字符串匹配的最佳方法

|| 我有一个python列表，其中包含大约700个术语，我想将它们用作Django中某些数据库条目的元数据。我想将列表中的术语与条目描述进行匹配，以查看是否有任何术语匹配，但是存在一些问题。我的第一个问题是列表中有一些多字词，其中包含来自其他列表条目的字。一个例子是：

Intrusion
Intrusion Detection

我对re.findall的了解还不是很广，因为它在上面的示例中可以同时满足Intrusion和Intrusion Detection的要求。我只想匹配“入侵检测”而不是“入侵”。有没有更好的方法来进行这种匹配？我以为也许要尝试NLTK，但似乎无法为这种类型的匹配提供帮助。编辑：因此，为了更加清楚起见，我列出了700个术语，例如防火墙或入侵检测。我想尝试将列表中的这些单词与我存储在数据库中的描述进行匹配，以查看是否有匹配项，我将在元数据中使用这些术语。因此，如果我有以下字符串：

There are many types of intrusion detection devices in production today.

并且如果我有一个带有以下术语的列表：

Intrusion
Intrusion Detection

我想匹配“入侵检测”，但不匹配“入侵”。确实，我也希望能够匹配单数/复数实例，但是我可能会超越自己。所有这些背后的想法是将所有比赛都放入列表中，然后进行处理。

已邀请:

2 个回复

鞘垒飘

如果您需要更大的灵活性来匹配条目说明，可以将nltk和ѭ4combine组合使用

from nltk.stem import PorterStemmer
import re

假设您对同一事件有不同的描述。重写系统。您可以使用ѭ6来捕获重写，重写，重写，单数和复数形式等。

master_list = [
    \'There are many types of intrusion detection devices in production today.\',
    \'The CTO approved a rewrite of the system\',
    \'The CTO is about to approve a complete rewrite of the system\',
    \'The CTO approved a rewriting\',
    \'Breaching of Firewalls\'
]

terms = [
    \'Intrusion Detection\',
    \'Approved rewrite\',
    \'Firewall\'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add \'match anything after it\' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [\'\'.join(stem + \'.*\' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print(\'\')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

输出：

[\'Intrus.*Detect.*\', \'Approv.*rewrit.*\', \'Firewal.*\']

[\'intrusion detection devices in production today.\']
[\'approved a rewrite of the system\']
[\'approve a complete rewrite of the system\']
[\'approved a rewriting\']
[\'Firewalls\']

编辑：要查看哪个terms引起了比赛：

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it\'s one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print(\'TERM: {0} FOUND IN: {1}\'.format(term, sentence))

输出：

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

俯乡骚钵皆

这个问题尚不清楚，但据我了解，您掌握了术语总表。每行说一学期。接下来，您将获得一个测试数据列表，其中一些测试数据将在主列表中，而另一些则不会。您想查看测试数据是否在主列表中，以及是否正在执行任务。假设您的主列表如下所示入侵检测防火墙功能 FooBar 您的测试数据如下所示入侵入侵检测富酒吧这个简单的脚本应该指引您正确的方向

#!/usr/bin/env python

import sys 

def main():
  \'\'\'useage tester.py masterList testList\'\'\'   


  #open files
  masterListFile = open(sys.argv[1], \'r\')
  testListFile = open(sys.argv[2], \'r\')

  #bulid master list
  # .strip() off \'\\n\' new line
  # set to lower case. Intrusion != intrusion, but should.
  masterList = [ line.strip().lower() for line in masterListFile ]
  #run test
  for line in testListFile:
    term = line.strip().lower()
    if term  in masterList:
      print term, \"in master list!\"
      #perhaps grab your metadata using a like %%
    else:
      print \"OH NO!\", term, \"not found!\"

  #close files
  masterListFile.close()
  testListFile.close()

if __name__ == \'__main__\':
  main()

样品输出不好了！找不到入侵！主列表中的入侵检测！不好了！找不到foo！不好了！没找到吧！还有其他几种方法可以这样做，但这应该为您指明正确的方向。如果您的清单很大（700个实际上不是那么大），请考虑使用字典，我认为它们会更快。特别是如果您计划查询数据库。字典结构可能看起来像{term：有关term的信息}

要回复问题请先登录或注册

将大列表与python中的字符串匹配的最佳方法

2 个回复

发起人

python

list

pattern_matching

问题状态

将大列表与python中的字符串匹配的最佳方法

与内容相关的链接

2 个回复

发起人

python

list

pattern_matching

问题状态