您的位置:首页 > 论文页面
基于URL模式路径的通用主题爬虫
发表时间:2012-10-31 浏览量:1683 下载量:619
全部作者: | 柳明海,张铭,刘金宝 |
作者单位: | 北京大学信息科学技术学院 |
摘 要: | 提出一个不需要人工参与的、能够自动生成URL pattern构建爬取路径的主题爬虫。不同于其他基于聚类生成URL pattern的方法,提出一种新的基于URL pattern树的方法。首先生成候选URL pattern集,然后利用网页结构特征选择URL pattern,在选择时,采用信息论中的经典模型最小描述长度(minimum description length, MDL)原则。实验表明:这种基于URL pattern构建路径的主题爬虫能够有效爬取站点中所有与给定样本网页同类型的网页,并且对大部分站点均适用。 |
关 键 词: | 计算机网络;主题爬取;URL pattern;最小描述长度 |
Title: | A general focused crawler based on URL pattern path |
Author: | LIU Minghai, ZHANG Ming, LIU Jinbao |
Organization: | School of Eletronics Engineering and Computer Science, Peking University |
Abstract: | In this paper, a novel focused crawler which doesn’t require human intervention was presented based on URL pattern path. Different from existing clustering-based URL pattern construction approaches, a novel URL pattern tree construction approach was proposed. Firstly, a URL pattern tree was constructed based on URL syntax, and then the best URL pattern set was found using the famous model selection algorithm of minimum description length (MDL) in information theory. The experiments showed that, URL pattern path based focused crawler was able to collect all pages that match the samples given, and was applicative to most websites. |
Key words: | computer network; focused crawler; URL pattern; minimum description length |
发表期数: | 2012年10月第20期 |
引用格式: | 柳明海,张铭,刘金宝. 基于URL模式路径的通用主题爬虫[J]. 中国科技论文在线精品论文,2012,5(20):1955-1962. |

请您登录
暂无评论