您的位置:首页  > 论文页面

一种基于主题的网页实时分类模型研究

发表时间:2014-02-28  浏览量:1804  下载量:525
全部作者: 马建红,张晨光,邱继颖
作者单位: 河北工业大学计算机科学与软件学院
摘 要: 对一般分类模型进行研究,并且分析了该模型对于网页实时分类的不足之处。在此基础上,为更适合网页的实时分类,提出基于主题的网页分类模型。首先,通过Nutch构造垂直搜索引擎的主题爬虫,可以一直对互联网上的网页进行抓取,保证网页的实时性;然后,通过主题去噪对Nutch抓取的结果进行处理,去除一部分与分类无关的页面;最后,对抓取到的网页进行分类。实验证明:通过此模型,可以使网页分类的速度和准确率都得到很大提高。对于网页实时分类的大数据要求,此模型可以有效优化输入样本,节省计算时间。
关 键 词: 计算机应用;主题;分类;实时分类
Title: Research on real-time webpage classification model based on the theme
Author: MA Jianhong, ZHANG Chenguang, QIU Jiying
Organization: School of Computer Science and Software Engineering, Hebei University of Technology
Abstract: In this paper, the general classification model is studied, and the inadequacies of the general model for real-time classification of the webpage is analysized. On this basis, for more suitable for real-time classification, this paper presents a classification model based on the theme. Firstly, the theme of vertical search engine crawlers is constructed through Nutch, which can crawl the webpage in all the time , so it can ensure the real-time web. Secondly, part of the pages which has nothing to do with the classification are removed by processing the crawling results of Nutch through theme denoising. In the end, the crawled webpages are classfied. The experiments show that the speed and accuracy can be improved with this model. For the requirement of big data of the webpage classification on real-time, this model can effectively optimize the input sample and save computing time.
Key words: computer application; theme; classification; real-time classification
发表期数: 2014年2月第4期
引用格式: 马建红,张晨光,邱继颖. 一种基于主题的网页实时分类模型研究[J]. 中国科技论文在线精品论文,2014,7(4):339-344.
 
0 评论数 0
暂无评论
友情链接