爬虫学习 "beautiful soup"

Zero6172021-11-172024-10-13

使用教材：《Python 网络爬虫权威指南（第 2 版）》

目标：爬取 bangumi 的排行榜

已知问题：由于排行榜后 507 页都是空白，没做页数检测。

源码

python

import urllib.request
import urllib.error
from bs4 import BeautifulSoup
import re


def askurl(url):
    head = {  # 模拟浏览器头部信息，这段我抄的https://blog.csdn.net/bookssea/article/details/107309591
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }

    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)

    #    print("URL内容请求成功")
    return html


def get_page():
    #    取前10页
    urls = ['https://bgm.tv/anime/browser?sort=rank&page={}'.format(str(i)) for i in range(1, 277)]
    #    输出验证
    #    print(urls)
    return urls


def clearblankline():
    global i
    with open('1.txt', 'r', encoding='utf-8') as fr, open('bgm排行榜.txt', 'w', encoding='utf-8') as fd:
        for text in fr.readlines():
            if text.split():
                fd.write(text)
                i += 1
                if i % 4 == 0:
                    fd.write('\n')
        print('输出成功....')


def show():
    global a
    print(a)
    a += 1


if __name__ == "__main__":
    a = 1
    i = 0
    for url in get_page():
        html = askurl(url)  # 保存获取到的网页源码
        bs = BeautifulSoup(html, 'html.parser')

        for name in bs.find_all('li', class_=re.compile('(item )(odd|even)( clearit)')):
            # print(name.get_text())
            txtfile = open("1.txt", 'a', encoding='utf-8')
            txtfile.write(name.get_text())
            txtfile.close()
            show()
    print("爬取完成")
    clearblankline()

分析源码

askurl () 函数

请求网页源码，因为是照搬了网上找到的一段代码，就不多做分析，直接使用即可。

get_page () 函数

功能：获取多个页面

clearblankline () 函数

功能：打开源文件清除空行并以每 4 行一空行的格式输出新文件

split () 判断该行是否有内容，如果有内容则值为真，而 fd.write () 用来写入源文本，每 4 行插入换行符 “\n”

show () 函数

功能：显示进程速度

主函数

python

bs.find_all('li', class_=re.compile('(item )(odd|even)( clearit)'))

使用了正则表达式来进行查找

附源码地址：

https://github.com/zero617/Crawler_Learning/tree/main/bgm

Nickname

Website

0/500

OωO
|´・ω・) ノ
ヾ (≧∇≦*) ゝ
(☆ω☆)
（╯‵□′）╯︵┴─┴
￣﹃￣
(/ω＼)
∠( ᐛ 」∠)＿
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ °ο°) ノ
(´இ皿இ｀)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ(￣∇￣o)
ヾ (´･･｀｡) ノ "
(ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò｡)
Σ(っ °Д °;) っ
(,,´･ω･)ﾉ"(´ っ ω･｀｡)
╮(╯▽╰)╭
o(*////▽////*)q
＞﹏＜
( ๑´•ω•) "(ㆆᴗㆆ)

😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣

颜文字
Emoji
Bilibili

0 comments

No comment