[转载]使用HtmlAgilityPack实现简单的博客园主页内容抓取（2014-03-31） – DavionKnight

[转载]使用HtmlAgilityPack实现简单的博客园主页内容抓取（2014-03-31） – DavionKnight – 博客园.

一、时间：

2014-03-31 18:08:32，又到了下班的时间了，忙了一天，也累了，中午都没吃饭。。。。。

二、事件：

win8刚出来那会，有个想法，想做一个第三方的博客园软件应用，奈何技术太渣，琐事良多，只能不了了之，最近想自己做个网站，于是就想抓取园子里面的内容，因为每天看博客都会让我成长，学到很多！

三、实现方法：

本来想自己存取网页，利用正则解析页面，奈何到解析标题时各种问题，而且自己想想也知道效率不是很高，于是就有了使用HtmlAgilityPack（下载地址：http://htmlagilitypack.codeplex.com/，开源的dll，很不错！）来实现网页解析，也可以用微软自己的mshtml，相对而言HtmlAgilityPack更好，于是就用了！

四、代码简单实现：

using System;
using System.Collections;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace 测试
{
class Program
{
static void Main(string[] args)
{
////指定请求
//HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.cnblogs.com/#p2");

////得到返回
//HttpWebResponse response = (HttpWebResponse)request.GetResponse();

////得到流
//Stream recStream = response.GetResponseStream();

////编码方式
//Encoding gb2312 = Encoding.UTF8;

////指定转换为gb2312编码
//StreamReader sr = new StreamReader(recStream, gb2312);

////以字符串方式得到网页内容
//String content = sr.ReadToEnd();

WebClient wc = new WebClient();
wc.BaseAddress = "http://www.cnblogs.com/sitehome/p/2";
wc.Encoding = Encoding.UTF8;
HtmlDocument doc = new HtmlDocument();
string html = wc.DownloadString("http://www.cnblogs.com/sitehome/p/2");
doc.LoadHtml(html);
string listNode = "/html/body/div[1]/div[4]/div[6]";
string[] title = new string[20];
string[] digest = new string[20];
string[] time = new string[20];
string[] uriList = new string[20];
string str;
HtmlNode node;
for(int i=0;i&lt;20;i++)
{
str = listNode + "/div[" + (i+1) + "]/div[2]/h3[1]";
node = doc.DocumentNode.SelectSingleNode(str);
title[i]=node.InnerText;
str = listNode + "/div[" + (i+1) + "]/div[2]/p[1]";
node = doc.DocumentNode.SelectSingleNode(str);
digest[i] = node.InnerText;
str = listNode + "/div[" + (i+1) + "]/div[2]/div[1]";
node = doc.DocumentNode.SelectSingleNode(str);
Regex r = new Regex("\\s20\\d\\d-\\d\\d-\\d\\d\\s\\d\\d:\\d\\d");
Match m = r.Match(node.InnerText);
time[i] = m.ToString();

str = listNode + "/div[" + (i + 1) + "]/div[2]/h3[1]/a[1]";
node = doc.DocumentNode.SelectSingleNode(str);
uriList[i] = node.Attributes["href"].Value;
}
foreach(string str2 in title)
{
Console.WriteLine(str2);
}
foreach (string str2 in uriList)
{
Console.WriteLine(str2);
}
foreach (string str2 in time)
{
Console.WriteLine(str2);
}
Console.ReadKey();
}

五、结果：

六、简单说明：

前20行博客主题

中间20行博客地址

后面20行时间

七、注意：

1. http://www.cnblogs.com/sitehome/p/2 ：主页第二页，前面使用http://www.cnblogs.com/#p2一直不行，看了源码部分的js才明白，

2.本人技术有限，不喜勿喷，但是欢迎提意见交流，

3.本人新建一群：交友&&知识学习&&职业交流，希望：互帮互助，互相学习。。。。闲人勿进，有意请冒泡一下！

[转载]使用HtmlAgilityPack实现简单的博客园主页内容抓取（2014-03-31） - DavionKnight - 博客园

相关推荐

热门标签

分类

链接表

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏