关键词: 搜索引擎; 网络爬虫; 检索; NoSQL; HBase
Design and implementation of search engine based on HBase
Abstract:There are plenty of resources in the network, but how to search information effectively is a difficult thing, and building a search engine is the best way to solve this problem. A search engine usually refers to a full-text search engine that collects tens of millions of web pages on the Internet and indexes every word (i.e. keyword) in the web page. When a user searches for a keyword, all pages containing the keyword are retrieved and displayed as search results. This paper mainly studies the basic implementation of search engine. First of all, data fetching, based on the deployment and implementation of nutch crawler system, the completion of the execution of the crawl command, the background will automatically crawl the web
page, and stored in the pseudo-distributed hbase database. Secondly, the pseudo-distributed storage based on nosql is used to deploy the pseudo-distributed environment for the implementation of, hadoop, hbase, zookeeper in this paper. The data captured by nutch is stored in the pseudo-distributive nosql database hbase, and the indexer indexes the stored data. There are also keywords indexing, the current keyword indexing work is very little, but basically realized the basic index of the data, not the real inverted index; But the http request mode to provide the external request interface, so that the components are low coupling, each component maintenance to promote freedom and independence. The request will be made internally with keywords, and the fetching data will be displayed at the front end. In addition, the deployment of the pseudo-distributed running environment required for the implementation of the engine, the configuration of the nosql pseudo-distributed and development environment, the debugging environment, and the test environment are the basis for the normal operation of all components. This paper implements a search engine framework based on nosql technology. Through crawling web page data, it is stored in pseudo-distributed nosql database. The java class provides a query interface to the front end and gets the data of keyword index. And show it to the user.
Keywords: Search engine; Index; Retrieval; NoSQL;HBase
目录
摘要…………………………………………………………………………………1
Abstract……………………………………………………………………………2
目录………………………………………………………………………………2
1绪论……………………………………………………………………………………3
1.1选题背景及意义…………………………………………………………………………3
1.2国内外发展状况……………………………………………………………………………4
1.3论文使用的研究方法与工具……………………………………………………………5
1.4论文的基本思路与逻辑结构………………………………………………………………5
2可行性研究………………………………………………………………………………5
2.1概述…………………………………………………………………………………………5
2.2可行性分析……………………………………………………………………………5
2.3结论………………………………………………………………………6
3搜索引擎分析……………………………………………………………………………….6
3.1 搜索引擎的体系结构………………………………………………………………………5
3.2 搜索引擎的工作流程…………………………………………………………………….8
3.3 搜索引擎分析的遗留问题………………………………………………………………9
4搜索引擎设计……………………………………………………………………………………8
4.1爬虫系统………………………………………………………………………………10
4.2存储器…………………………………………………………………………………11
5 搜索引擎实施…………………………………………………………………………………18
5.1 Ububtu及应用配置……………………………………………………………………18
5.2 Hadoop配置………………………………………………………………………………21
5.3 HBase配置………………………………………………………………………………22
5.4服务器启动脚本…………………………………………………………………………24
5.5 Nutch配置………………………………………………………………………………25
参考文献………………………………………………………………….…………………26
致谢………………………………………………………………………………………27
附录 A……………………………………………………………………………………28