国产精品一区二区正在播放,曰曰鲁夜夜免费播放视频

1. 什么是Lucence

Lucene提供了一個(gè)簡單卻強(qiáng)大的應(yīng)用程式接口，能夠做全文索引[把非結(jié)構(gòu)化的文件信息形成結(jié)構(gòu)化的數(shù)據(jù)(就像數(shù)據(jù)庫信息)]和搜尋。在 Java 開發(fā)環(huán)境里 Lucene 是一個(gè)成熟的免費(fèi)開源工具。就其本身而言，Lucene 是當(dāng)前以及最近幾年最受歡迎的免費(fèi) Java 信息檢索程序庫。

1.1 全文檢索

何為全文檢索？舉個(gè)例子，比如要在一個(gè)文件中查找某個(gè)字符串，最直接的想法就是從頭開始檢索，查到了就OK，這種對小數(shù)據(jù)量的文件來說，很簡單實(shí)用，但是對于大數(shù)據(jù)量的文件來說，就比較吃力了?；蛘哒f反過來查找包含某個(gè)字符串的文件(比如哪個(gè)文件中包含springboot)，也是這樣，如果在一個(gè)擁有幾十個(gè) G 的硬盤中找那效率可想而知，是非常低的。

文件中的數(shù)據(jù)是屬于非結(jié)構(gòu)化數(shù)據(jù)，也就是說它沒有什么結(jié)構(gòu)可言(不像我們數(shù)據(jù)庫中的信息，可以一行一行的去匹配查詢)，要解決上面提到的效率問題，首先我們得將非結(jié)構(gòu)化數(shù)據(jù)中的一部分信息提取出來，重新組織，使其變得有一定結(jié)構(gòu)(說白了，就是變成關(guān)系數(shù)據(jù)庫型一行一行的數(shù)據(jù))，然后對這些有一定結(jié)構(gòu)的數(shù)據(jù)進(jìn)行搜索，從而達(dá)到搜索相對較快的目的。這就叫全文搜索。即先建立索引(表結(jié)構(gòu)，把文件中的關(guān)鍵詞提取出來)，再對索引進(jìn)行搜索的過程。

1.2 Lucene 建立索引的方式

那么 Lucene 中是如何建立索引的呢？假設(shè)現(xiàn)在有兩篇文章，內(nèi)容如下：

文章1的內(nèi)容為：Tom lives in Guangzhou, I live in Guangzhou too.

文章2的內(nèi)容為：He once lived in Shanghai.

首先第一步是將文檔傳給分詞組件（Tokenizer），分詞組件會(huì)將文檔分成一個(gè)個(gè)單詞，并去除標(biāo)點(diǎn)符

號和停詞。所謂的停詞指的是沒有特別意義的詞，比如英文中的 a，the，too 等。經(jīng)過分詞后，得到詞

元（Token）。如下：

文章1經(jīng)過分詞后的結(jié)果： [Tom] [lives] [Guangzhou] [I] [live] [Guangzhou]

文章2經(jīng)過分詞后的結(jié)果： [He] [lives] [Shanghai]

然后將詞元傳給語言處理組件（Linguistic Processor），對于英語，語言處理組件一般會(huì)將字母變?yōu)樾?，將單詞縮減為詞根形式，如 ”lives” 到 ”live” 等，將單詞轉(zhuǎn)變?yōu)樵~根形式，如 ”drove” 到 ”drive”等。然后得到詞（Term）。如下：

文章1經(jīng)過處理后的結(jié)果： [tom] [live] [guangzhou] [i] [live] [guangzhou]

文章2經(jīng)過處理后的結(jié)果： [he] [live] [shanghai]

最后將得到的詞傳給索引組件（Indexer），索引組件經(jīng)過處理，得到下面的索引結(jié)構(gòu)：

關(guān)鍵詞	文章號[出現(xiàn)頻率]	出現(xiàn)位置
guangzhou	1[2]	3,6
he	2[1]	1
i	1[1]	4
live	1[2],2[1]	2,5,2
shanghai	2[1]	3
tom	1[1]	1

以上就是Lucene 索引結(jié)構(gòu)中最核心的部分。它的關(guān)鍵字是按字符順序排列的，因此 Lucene 可以用二元搜索算法快速定位關(guān)鍵詞。實(shí)現(xiàn)時(shí) Lucene 將上面三列分別作為詞典文件（Term Dictionary）、頻率文件（frequencies）和位置文件（positions）保存。其中詞典文件不僅保存有每個(gè)關(guān)鍵詞，還保留了指向頻率文件和位置文件的指針，通過指針可以找到該關(guān)鍵字的頻率信息和位置信息。

搜索的過程是先對詞典二元查找、找到該詞，通過指向頻率文件的指針讀出所有文章號，然后返回結(jié)果，然后就可以在具體的文章中根據(jù)出現(xiàn)位置找到該詞了。所以 Lucene 在第一次建立索引的時(shí)候可能會(huì)比較慢，但是以后就不需要每次都建立索引了，就快了

知道了Lucene的分詞及創(chuàng)建索引的原理，接下來通過Spring Boot中集成Lucene并實(shí)現(xiàn) 創(chuàng)建索引(可以理解為把各個(gè)文件中的信息通過分詞然后有序的存儲(chǔ)的數(shù)據(jù)庫表中)和搜索功能

2. Spring Boot 中集成 Lucence

首先需要導(dǎo)入 Lucene 的依賴，它的依賴有好幾個(gè)，如下：

<!-- Lucence核心包 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>5.3.1</version>
</dependency>
<!-- Lucene查詢解析包 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>5.3.1</version>
</dependency>
<!-- 常規(guī)的分詞（英文） -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>5.3.1</version>
</dependency>
<!--支持分詞高亮 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>5.3.1</version>
</dependency>
<!--支持中文分詞 -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-smartcn</artifactId>
    <version>5.3.1</version>
</dependency>

最后一個(gè)依賴是用來支持中文分詞的，因?yàn)槟J(rèn)是支持英文的。

2.2 快速入門

根據(jù)上文的分析，全文檢索有兩個(gè)步驟，先建立索引，再檢索。所以為了測試這個(gè)過程，我們這里創(chuàng)建兩個(gè)java 類，一個(gè)用來建立索引，另一個(gè)用來檢索。

2.2.1 建立索引

我們自己弄幾個(gè)文件，放到 F:\lucene\datas 目錄下，新建一個(gè) Indexer 類來實(shí)現(xiàn)建立索引功能。首

先在構(gòu)造方法中初始化標(biāo)準(zhǔn)分詞器并生成索引實(shí)例。

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.Paths;

public class Indexer {
    /*writer : 索引對象, 能夠建立索引(即能夠把文件中的詞提取出來，并標(biāo)注出現(xiàn)的次數(shù)及出現(xiàn)的位置及哪個(gè)文件)*/
    private IndexWriter writer;
    /*
     * 構(gòu)造方法，實(shí)例化IndexWriter
     * @param indexDir   //索引目錄(要搜索信息的目錄)
     * @throws Exception
     */
    public Indexer(String indexDir) throws IOException {
        // 構(gòu)造方法傳遞一個(gè)存儲(chǔ)建立索引的目錄(文件夾的路徑), 即要放建立的索引存儲(chǔ)在哪里
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        //打開索引文件夾
        StandardAnalyzer analyzer = new StandardAnalyzer();
        //標(biāo)準(zhǔn)分詞器,會(huì)自動(dòng)去掉空格, is a the等單詞

        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        //將標(biāo)準(zhǔn)分詞器配置到寫索引的配置中， 索引時(shí)將會(huì) 去掉空格， is, a, the等

        writer = new IndexWriter(dir, config);
        //創(chuàng)建實(shí)例化索引對象
    }

    /**
     * 獲取文檔，文檔里再設(shè)置每個(gè)字段，就類似于數(shù)據(jù)庫中的一行記錄
     * @param file
     * @return
     * @throws Exception
     */
    private Document getDocument(File file) throws Exception{
        Document doc = new Document();
        //開始添加字段
        // 把doc當(dāng)成數(shù)據(jù)庫中的表的一行記錄信息, 三個(gè)字段及對應(yīng)的值
        //字段一： contents:值(表中的內(nèi)容)
        //字段二： fileName:值(文件名)
        //字段三： fullPath:值(文件的路徑)
        //添加內(nèi)容
        doc.add(new TextField("contents", new FileReader(file)));
        //添加文件名，并把這個(gè)字段存到索引文件里
        doc.add(new TextField("fileName", file.getName(), Field.Store.YES));
        //添加文件路徑
        doc.add(new TextField("fullPath", file.getCanonicalPath(), Field.Store.YES));
        return doc;
        //doc: 文檔對象，有三個(gè)屬性contents,fileName,fullPath
    }

    /*索引指定的文件
    @param file
    @throws Exception
    */
    private void indexFile(File file) throws Exception{
        System.out.println("索引文件的路徑：" + file.getCanonicalPath());
        Document doc = getDocument(file);
        //調(diào)用上面的getDocument方法， 獲取該文件的document對象
        writer.addDocument(doc);
        //將doc添加到索引實(shí)例對象中
    }
   /* 索引指定目錄下的所有文件
    @param dataDir
    @return
    @throws Exception*/
    public int indexAll(String dataDir) throws Exception{
        File[] files = new File(dataDir).listFiles();
        //獲取dataDir目錄下的所有文件
                   int numDocs = 0;
                    if(null != files){
                        for(File file:files){
                            //調(diào)用上面的indexFile方法，對每個(gè)文件進(jìn)行索引
                            indexFile(file);
                            //理解為： 有多少個(gè)文件，在writer中就有多少行信息
                            //每行信息含有文件名，文件路徑，及文件內(nèi)容
                        }
                        numDocs = writer.numDocs();
                        writer.close();
                    }
                    return numDocs;
        //返回索引的文件數(shù)
    }

}

生成索引：

public class MakeIndexer {
    public static void main(String[] args) {
        String indexDir = "F:\\java\\lucence";
        //索引保存到的路徑
        String dataDir = "F:\\java\\lucence\\data";

        Indexer indexer = null;
        int indexedNum = 0;
        //記錄索引開始時(shí)間
        long startTime = System.currentTimeMillis();
        try{
            //開始構(gòu)建索引
            indexer = new Indexer(indexDir);
            indexedNum = indexer.indexAll(dataDir);
        }
        catch (Exception e){
            e.printStackTrace();
        }

        long endTime = System.currentTimeMillis();
        System.out.println("索引耗時(shí)" + (endTime - startTime) + "毫秒");
        System.out.println("共索引了" + indexedNum + "個(gè)文件");
    }
}

建立搜索索引類:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.nio.file.Paths;

public class Searcher {
    public static void search(String indexDir,String q) throws Exception{
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        //獲取要查詢的路徑， 也就是索引所在的位置
        IndexReader reader = DirectoryReader.open(dir);
        //構(gòu)建IndexSearcher
        IndexSearcher searcher = new IndexSearcher(reader);
        //標(biāo)準(zhǔn)分詞器， 會(huì)自動(dòng)去掉空格， is a the等單詞
        Analyzer analyzer = new StandardAnalyzer();
        //查詢解析器   查詢的字段為contents(建立索引時(shí)生成的表字段)
        QueryParser parser = new QueryParser("contents",analyzer);
        //通過解析要查詢的String, 獲取查詢對象, q為傳赤來的待查的字符串
        Query query = parser.parse(q);
        //記錄索引開始時(shí)間
        long startTime = System.currentTimeMillis();
        //開始查詢，查詢前10條數(shù)據(jù)， 將記錄保存在docs中
        TopDocs docs = searcher.search(query,10);
        //記錄索引結(jié)束時(shí)間
        long endTime = System.currentTimeMillis();
        System.out.println("匹配" + q + "共耗時(shí)" + (endTime - startTime) + "毫秒");
        System.out.println("查詢到" + docs.totalHits + "條記錄");

        //取出每條查詢結(jié)果
        for(ScoreDoc scoreDoc : docs.scoreDocs){
            //scoreDoc.doc相當(dāng)于docId, 根據(jù)這個(gè)docID來獲取文檔
            Document doc = searcher.doc(scoreDoc.doc);
            //fullPath是剛剛建立索引時(shí)候我們定義的一個(gè)字段，表示路徑。也可以取其它的內(nèi)容，只要我們在建立索引時(shí)有定義即可.
            System.out.println(doc.get("fullPath"));
        }
        reader.close();
    }
}

搜索測試操作：

public class SerchIndexer {
    public static void main(String[] args) {
        String indexDir = "F:\\java\\lucence";
        //查詢這個(gè)字符串
        String q = "thank";
        try{
            Searcher.search(indexDir,q);
        }
        catch (Exception e){
            e.printStackTrace();
        }
    }
}

執(zhí)行搜索結(jié)果如下：
匹配thank共耗時(shí)35毫秒
查詢到1條記錄
F:\java\lucence\data\2.txt

久久久久久AV无码免费看大片,亚洲一区精品人人爽人人躁,国产成人片无码免费爱线观看,亚洲AV成人无码精品网站,为什么晚上搞的时候要盖被子

Spring Boot中集成Lucence并實(shí)現(xiàn)全文檢索