基于hbase的搜索引擎的设计与实现论文

网络资源丰富，但如何有效地搜索信息是一件困难的事情，建立搜索引擎是解决这一问题的最佳途径。搜索引擎通常指的是一个全文搜索引擎，它收集了互联网上数千万的网页，并对网页中的每一个单词(即关键字)进行索引。当用户搜索关键字时，将检索包含该关键字的所有页面，并将其显示为搜索结果。本文主要研究搜索引擎的基本实现。首先，数据抓取，基于Nutch爬虫系统的部署和实现，完成爬行命令的执行，后台将自动抓取网页，并存储在伪分布式HBASE数据库中。其次，本文采用基于NoSQL的伪分布式存储来部署伪分布式环境，实现了Hadoop、HBASE和动物园管理员。Nutch捕获的数据存储在伪分布式NoSQL数据库HBASE中，索引器对存储的数据进行索引。也有关键词索引，目前的关键词索引工作很少，但基本上实现了基本索引的数据，而不是真正的倒排索引；而是通过http请求模式提供外部请求接口，使组件之间的耦合程度低，各组件维护的自由度和独立性得到提升。请求将在内部使用关键字进行，获取数据将显示在前端。此外，实现引擎所需的伪分布式运行环境的部署、NoSQL伪分布和开发环境的配置、调试环境和测试环境是所有组件正常运行的基础。本文实现了一个基于NoSQL技术的搜索引擎框架。通过爬行网页数据，将其存储在伪分布式NoSQL数据库中.java类为前端提供一个查询接口，并获取关键字索引的数据。并向用户展示。

关键词：搜索引擎；网络爬虫；检索； NoSQL； HBase

Design and implementation of search engine based on HBase

Abstract:There are plenty of resources in the network, but how to search information effectively is a difficult thing, and building a search engine is the best way to solve this problem. A search engine usually refers to a full-text search engine that collects tens of millions of web pages on the Internet and indexes every word (i.e. keyword) in the web page. When a user searches for a keyword, all pages containing the keyword are retrieved and displayed as search results. This paper mainly studies the basic implementation of search engine. First of all, data fetching, based on the deployment and implementation of nutch crawler system, the completion of the execution of the crawl command, the background will automatically crawl the web

page, and stored in the pseudo-distributed hbase database. Secondly, the pseudo-distributed storage based on nosql is used to deploy the pseudo-distributed environment for the implementation of, hadoop, hbase, zookeeper in this paper. The data captured by nutch is stored in the pseudo-distributive nosql database hbase, and the indexer indexes the stored data. There are also keywords indexing, the current keyword indexing work is very little, but basically realized the basic index of the data, not the real inverted index; But the http request mode to provide the external request interface, so that the components are low coupling, each component maintenance to promote freedom and independence. The request will be made internally with keywords, and the fetching data will be displayed at the front end. In addition, the deployment of the pseudo-distributed running environment required for the implementation of the engine, the configuration of the nosql pseudo-distributed and development environment, the debugging environment, and the test environment are the basis for the normal operation of all components. This paper implements a search engine framework based on nosql technology. Through crawling web page data, it is stored in pseudo-distributed nosql database. The java class provides a query interface to the front end and gets the data of keyword index. And show it to the user.

Keywords: Search engine; Index; Retrieval; NoSQL;HBase

摘要…………………………………………………………………………………1

Abstract……………………………………………………………………………2

目录………………………………………………………………………………2

1绪论……………………………………………………………………………………3

1.1选题背景及意义…………………………………………………………………………3

1.2国内外发展状况……………………………………………………………………………4

1.3论文使用的研究方法与工具……………………………………………………………5

1.4论文的基本思路与逻辑结构………………………………………………………………5

2可行性研究………………………………………………………………………………5

2.1概述…………………………………………………………………………………………5

2.2可行性分析……………………………………………………………………………5

2.3结论………………………………………………………………………6

3搜索引擎分析……………………………………………………………………………….6

3.1 搜索引擎的体系结构………………………………………………………………………5

3.2 搜索引擎的工作流程…………………………………………………………………….8

3.3 搜索引擎分析的遗留问题………………………………………………………………9

4搜索引擎设计……………………………………………………………………………………8

4.1爬虫系统………………………………………………………………………………10

4.2存储器…………………………………………………………………………………11

5 搜索引擎实施…………………………………………………………………………………18

5.1 Ububtu及应用配置……………………………………………………………………18

5.2 Hadoop配置………………………………………………………………………………21

5.3 HBase配置………………………………………………………………………………22

5.4服务器启动脚本…………………………………………………………………………24

5.5 Nutch配置………………………………………………………………………………25

参考文献………………………………………………………………….…………………26

致谢………………………………………………………………………………………27

附录 A……………………………………………………………………………………28

首页 > 毕业论文 > 正文

快捷导航

最近更新

热门作品

首页 > 毕业论文 > 正文