爬虫学习"beautiful soup"

LKZRnT.jpg

使用教材:《Python网络爬虫权威指南(第2版)》

目标:爬取bangumi的排行榜

已知问题:由于排行榜后507页都是空白,没做页数检测。

源码

import urllib.request
import urllib.error
from bs4 import BeautifulSoup
import re


def askurl(url):
head = { # 模拟浏览器头部信息,这段我抄的https://blog.csdn.net/bookssea/article/details/107309591
"User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122 Safari / 537.36"
}

request = urllib.request.Request(url, headers=head)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode("utf-8")
except urllib.error.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)

# print("URL内容请求成功")
return html


def get_page():
# 取前10页
urls = ['https://bgm.tv/anime/browser?sort=rank&page={}'.format(str(i)) for i in range(1, 277)]
# 输出验证
# print(urls)
return urls


def clearblankline():
global i
with open('1.txt', 'r', encoding='utf-8') as fr, open('bgm排行榜.txt', 'w', encoding='utf-8') as fd:
for text in fr.readlines():
if text.split():
fd.write(text)
i += 1
if i % 4 == 0:
fd.write('\n')
print('输出成功....')


def show():
global a
print(a)
a += 1


if __name__ == "__main__":
a = 1
i = 0
for url in get_page():
html = askurl(url) # 保存获取到的网页源码
bs = BeautifulSoup(html, 'html.parser')

for name in bs.find_all('li', class_=re.compile('(item )(odd|even)( clearit)')):
# print(name.get_text())
txtfile = open("1.txt", 'a', encoding='utf-8')
txtfile.write(name.get_text())
txtfile.close()
show()
print("爬取完成")
clearblankline()

分析源码

askurl()函数

请求网页源码,因为是照搬了网上找到的一段代码,就不多做分析,直接使用即可。

get_page()函数

功能:获取多个页面

clearblankline()函数

功能:打开源文件清除空行并以每4行一空行的格式输出新文件

split()判断该行是否有内容,如果有内容则值为真,而fd.write()用来写入源文本,每4行插入换行符“\n”

show()函数

功能:显示进程速度

主函数

bs.find_all('li', class_=re.compile('(item )(odd|even)( clearit)'))

使用了正则表达式来进行查找

附源码地址:

https://github.com/zero617/Crawler_Learning/tree/main/bgm