[转载]用python提取百度贴吧的小说-Mikel

[转载]用python提取百度贴吧的小说 – ToddNet2012 – 博客园.

这个博客主要是发表些关于Windows Phone开发相关方面的文章的，和 http://blog.sina.com.cn/u/2391033251 同步更新。不过今天不务正业一下，发表一段python代码。这个程序主要功能是从百度贴吧，获得html文件，然后用Beautiful Soup解析html文件，提取贴吧的帖子。干什么的？其实是来看小说的，想做的更自动化一些的，但是python开始学没多长时间，暂时做到这个程度了，以后有机会会考虑增强功能的。

代码如下：

#-*- encoding: utf-8 -*-
import urllib2
import re
from BeautifulSoup import BeautifulSoup

def stripHTMLTags (html):
'''strip html tags; from http://goo.gl/EaYp5'''
return re.sub('&lt;([^!&gt;]([^&gt;]|\n)*)&gt;', '', html)

def fetch_tieba(url,localfile,ignoreFansReq=False):
'''fetch the url resource and save to localfile'''

# fetch the url resource and encode to utf-8
html = urllib2.urlopen(url).read()
html = unicode(html,'gb2312','ignore').encode('utf-8','ignore')

# extract the main content
content = BeautifulSoup(html).findAll(attrs={'class':'d_post_content'})

# write the content to localfile
myfile = open(localfile,'w')
for item in content:
item_formatted = stripHTMLTags(str(item).replace('
','\r\n'))
if ignoreFansReq == True :
if len(item_formatted) &lt; 100:
continue
myfile.write(item_formatted)
myfile.write('\r\n')
print item_formatted
myfile.close()

def main():
urlTarget = "http://tieba.baidu.com/p/1234371208"
localfileTarget = './xiaoshuo2.txt'
fetch_tieba(url=urlTarget,localfile=localfileTarget,ignoreFansReq=True)

if __name__ == "__main__":
main()

简单说明：
1 本人使用的开发环境是 Windows 7 Ultimate 32bit + Python 2.7；依赖 Beautiful Soup (地址) ，使用版本是 BeautifulSoup-3.2.0；
2 fetch_tieba函数参数的含义：url，贴吧资源目标地址；localfile，保存到本地文件的路径；ignoreFansReq，忽略插楼、求粉等灌水信息（只是根据文字数判断，很简陋）；
3 代码具有时效性，如果百度贴吧的页面DOM发生改变，程序可能失效。

[转载]用python提取百度贴吧的小说

相关推荐

热门标签

分类

链接表

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏