BeautifulSoup(bs4)细致讲解

当前位置:

首页 > Python基础教程 >

BeautifulSoup(bs4)细致讲解

BeautifulSoup是python的一个库,最主要的功能是从网页爬取数据,官方是这样解释的:BeautifulSoup提供一些简单,python式函数来处理导航,搜索,修改分析树等功能,其是一个工具库,通过解析文档为用户提供需要抓取的数据,因为简单,所有不需要多少代码就可以写出一个完整的程序

bs4安装
直接使用pip install命令安装

pip install beautifulsoup4
lxml解析器
lxml是一个高性能的Python库,用于处理XML与HTML文档,与bs4相比之下lxml具有更强大的功能与更高的性能,特别是处理大型文档时尤为明显.lxml可以与bs4结合使用,也可以单独使用

lxml安装
同样使用pip install 安装

pip install lxml
其用于在接下来会结合bs4进行讲解

BeautifulSoup浏览浏览器结构化方法
.title:获取title标签

html_doc="""....
""""
# 创建beautifulsoup对象 解析器为lxml
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
#output-><title>The Dormouse's story</title>

.name获取文件或标签类型名称

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.name)
print(soup.name)
#output->title
#[document]

.string/.text:获取标签中的文字内容

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.string)
print(soup.title.text)
#output->The Dormouse's story
#The Dormouse's story

.p:获取标签

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p)
#output-><p class="title"><b>The Dormouse's story</b></p>
.find_all(name,attrs={}):获取所有标签,参数:标签名,如’a’a标签,’p’p标签等等,attrs={}:属性值筛选器字典如attrs={'class': 'story'}

# 创建beautifulsoup对象 解析器为lxml
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find_all('p'))
print(soup.find_all('p', attrs={'class': 'title'}))

.find(name,attrs={}):获取第一次匹配条件的元素

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.find(id="link1"))
#output-><a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

.parent:获取父级标签

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title.parent)
#output-><head><title>The Dormouse's story</title></head>
.p['class'] :获取class的值

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.p["class"])
#output->['title']

.get_text():获取文档中所有文字内容

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

从文档中找到所有标签的链接

a_tags = soup.find_all('a')
for a_tag in a_tags:
    print(a_tag.get("href"))
#output->https://example.com/elsie
#https://example.com/lacie
#https://example.com/tillie

BeautifulSoup的对象种类
当你使用BeautifulSoup 解析一个HTML或XML文档时,BeautifulSoup会整个文档转换为一个树形结构,其中每个结点(标签,文本,注释)都被表示为一个python对象

BeautifulSoup的树形结构
在HTML文档中,根结点通常是标签,其余的标签和文本内容则是其子结点

若有以下一个HTML文档:

<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <h1>The Dormouse's story</h1>
        <p>Once upon a time...</p>
    </body>
</html>

经过BeautifulSoup的解析后,是根结点,与相邻的与是其子结点,同理可得

栏目列表