如何在Java中使用OpenNLP？

| 我想对英语句子加标签，并进行一些处理。我想使用openNLP。我已经安装了当我执行命令时

I:\\Workshop\\Programming\\nlp\\opennlp-tools-1.5.0-bin\\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\\en-pos-maxent.bin < Text.txt

它提供输出POSTagging Text.txt中的输入

    Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.


Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s

我希望它安装正确吗？现在如何从Java应用程序内部进行此POStagging？我已经将openNLPtools，jwnl，maxent jar添加到项目中，但是如何调用POStagging？

已邀请:

3 个回复

嗜蒂谷尘旱

这是我放在一起的一些（旧）示例代码，并附有现代化代码：

package opennlp;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;

public class OpenNlpTest {
public static void main(String[] args) throws IOException {
    POSModel model = new POSModelLoader().load(new File(\"en-pos-maxent.bin\"));
    PerformanceMonitor perfMon = new PerformanceMonitor(System.err, \"sent\");
    POSTaggerME tagger = new POSTaggerME(model);

    String input = \"Can anyone help me dig through OpenNLP\'s horrible documentation?\";
    ObjectStream<String> lineStream =
            new PlainTextByLineStream(new StringReader(input));

    perfMon.start();
    String line;
    while ((line = lineStream.read()) != null) {

        String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
        String[] tags = tagger.tag(whitespaceTokenizerLine);

        POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
        System.out.println(sample.toString());

        perfMon.incrementCounter();
    }
    perfMon.stopAndPrintFinalResult();
}
}

输出为：

Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP\'s_NNP horrible_JJ documentation?_NN

Average: 76.9 sent/s 
Total: 1 sent
Runtime: 0.013s

这基本上是从OpenNLP附带的POSTaggerTool类开始的。 sample.getTags()是一个String数组，其本身具有标签类型。这需要直接访问培训数据，这确实非常la脚。更新的代码库与此稍有不同（并且可能更有用）。首先，一个Maven POM：

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<project xmlns=\"http://maven.apache.org/POM/4.0.0\"
         xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"
         xsi:schemaLocation=\"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd\">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.javachannel</groupId>
    <artifactId>opennlp-example</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.opennlp</groupId>
            <artifactId>opennlp-tools</artifactId>
            <version>1.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>[6.8.21,)</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

这是作为测试编写的代码，因此位于ѭ7中：

package org.javachannel.opennlp.example;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;

public class POSTest {
    private void download(String url, File destination) throws IOException {
        URL website = new URL(url);
        ReadableByteChannel rbc = Channels.newChannel(website.openStream());
        FileOutputStream fos = new FileOutputStream(destination);
        fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
    }

    @DataProvider
    Object[][] getCorpusData() {
        return new Object[][][]{{{
                \"Can anyone help me dig through OpenNLP\'s horrible documentation?\"
        }}};
    }

    @Test(dataProvider = \"getCorpusData\")
    public void showPOS(Object[] input) throws IOException {
        File modelFile = new File(\"en-pos-maxent.bin\");
        if (!modelFile.exists()) {
            System.out.println(\"Downloading model.\");
            download(\"http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin\", modelFile);
        }
        POSModel model = new POSModel(modelFile);
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, \"sent\");
        POSTaggerME tagger = new POSTaggerME(model);

        perfMon.start();
        Stream.of(input).map(line -> {
            String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
            String[] tags = tagger.tag(whitespaceTokenizerLine);

            POSSample sample = new POSSample(whitespaceTokenizerLine, tags);

            perfMon.incrementCounter();
            return sample.toString();
        }).forEach(System.out::println);
        perfMon.stopAndPrintFinalResult();
    }
}

这段代码实际上并没有进行任何测试-它是冒烟测试（如果有的话）-但它应该作为起点。另一个（可能）不错的事情是，如果您尚未下载模型，它会为您下载模型。

钨蜡唤喉晤

URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html不再起作用。我在第14张幻灯片中找到以下内容，网址为http://www.slideshare.net/gagan1667/opennlp-demo

郸身

上面的答案确实提供了一种使用OpenNLP中现有模型的方法，但是如果您需要训练自己的模型，则以下内容可能会有所帮助：这是带有完整代码的详细教程： https://dataturks.com/blog/opennlp-pos-tagger-training-java-example.php 根据您的域，您可以自动或手动构建数据集。手动建立这样的数据集确实很痛苦，诸如POS标记器之类的工具可以帮助简化流程。训练数据格式训练数据作为文本文件传递，其中每一行都是一个数据项。该行中的每个单词都应以\“ word_LABEL \”之类的格式标记，单词和标签名称之间用下划线\'_ \'分隔。

anki_Brand overdrive_Brand
just_ModelName dance_ModelName 2018_ModelName
aoc_Brand 27\"_ScreenSize monitor_Category
horizon_ModelName zero_ModelName dawn_ModelName
cm_Unknown 700_Unknown modem_Category
computer_Category

火车模型此处的重要类是POSModel，它包含实际模型。我们使用POSTaggerME类进行模型构建。下面是从训练数据文件构建模型的代码

public POSModel train(String filepath) {
  POSModel model = null;
  TrainingParameters parameters = TrainingParameters.defaultParams();
  parameters.put(TrainingParameters.ITERATIONS_PARAM, \"100\");

  try {
    try (InputStream dataIn = new FileInputStream(filepath)) {
        ObjectStream<String> lineStream = new PlainTextByLineStream(new InputStreamFactory() {
            @Override
            public InputStream createInputStream() throws IOException {
                return dataIn;
            }
        }, StandardCharsets.UTF_8);
        ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

        model = POSTaggerME.train(\"en\", sampleStream, parameters, new POSTaggerFactory());
        return model;
    }
  }
  catch (Exception e) {
    e.printStackTrace();
  }
  return null;

}

使用模型进行标记。最后，我们可以看到该模型如何用于标记看不见的查询：

    public void doTagging(POSModel model, String input) {
    input = input.trim();
    POSTaggerME tagger = new POSTaggerME(model);
    Sequence[] sequences = tagger.topKSequences(input.split(\" \"));
    for (Sequence s : sequences) {
        List<String> tags = s.getOutcomes();
        System.out.println(Arrays.asList(input.split(\" \")) +\" =>\" + tags);
    }
}

要回复问题请先登录或注册

如何在Java中使用OpenNLP？

3 个回复

发起人

opennlp

nlp

pos_tagger

java

问题状态

如何在Java中使用OpenNLP？

与内容相关的链接

3 个回复

发起人

opennlp

nlp

pos_tagger

java

问题状态