Introduction

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating.

I’ll be using as an example throughout this document.

1
2
3
4
5
6
7
8
9
10
11
12
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

BeautifulSoup Object

Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>

Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

Tag

A Tag object corresponds to an XML or HTML tag in the original document:

.p can return the first <p> tag

1
tag=soup.p

Name

Every tag has a name, accessible as .name:

1
2
print(tag.name)
#p

Attributes

You can access that dictionary directly as .attrs:

1
2
print(tag.attrs)
#{'class': ['title']}

Also you can access the value throught adding the key:

1
2
print(tag.attrs['class'])
#['title']

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

1
2
3
4
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string
# 'Extremely bold'

BeautifulSoup

The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

Comments

There are a few leftover bits. The main one you’ll probably encounter is the comment. The Comment object is just a special type of NavigableString.

1
2
3
4
5
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

Navigating the tree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

soup.get_text()
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

Going down

.contents and .children

.contents is a children list of a tag.

1
2
3
4
5
6
print(tag)
print(tag.contents)
print(tag.contents[0])
# <p class="title"><b>The Dormouse's story</b></p>
# [<b>The Dormouse's story</b>]
# <b>The Dormouse's story</b>

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

1
2
for child in tag.children:
print(child)

.descendants

The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

1
2
3
4
5
6
7
8
p_tag=soup.p
print(p_tag)
for child in p_tag.descendants:
print(child)

# <p class="title"><b>The Dormouse's story</b></p>
# <b>The Dormouse's story</b>
# The Dormouse's story

.string

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

1
2
3
4
5
tilte_tag=soup.title
print(tilte_tag.contents[0])
print(tilte_tag.string)
# The Dormouse's story
# The Dormouse's story

Going up

The parent of a top-level tag like is the BeautifulSoup object itself:

1
2
3
html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

And the .parent of a BeautifulSoup object is defined as None:

1
2
print(soup.parent)
# None

.parents

You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an tag buried deep within the document, to the very top of the document:

1
2
3
4
5
6
7
8
9
link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
print(parent.name)
# p
# body
# html
# [document]

Going sideways

The <b> tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings.

1
2
3
4
5
6
7
8
9
10
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())
# <a>
# <b>
# text1
# </b>
# <c>
# text2
# </c>
# </a>

.next_sibling and .previous_sibling

You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

1
2
3
4
5
sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: find() and find_all().

I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

find_all()

A string

1
2
soup.find_all('b')
# [<b>The Dormouse's story</b>]

A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method.This code finds all the tags whose names start with the letter “b”; in this case, the <body> tag and the <b> tag:

1
2
3
4
5
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b

A list

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a> tags and all the <b> tags:

1
2
3
4
5
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:

1
2
3
4
5
6
7
8
9
10
11
12
13
for tag in soup.find_all(True):
print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

The name argument

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names.

1
2
soup.find_all("title")
# [<title>The Dormouse's story</title>]

The keyword arguments

If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:

1
2
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:

1
2
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

This code finds all tags whose id attribute has a value, regardless of what the value is:

1
2
3
4
soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The string argument

With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:

1
2
3
4
5
6
7
8
soup.find_all(string="Elsie")
# ['Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# ['Elsie', 'Lacie', 'Tillie']

soup.find_all(string=re.compile("Dormouse"))
# ["The Dormouse's story", "The Dormouse's story"]

find()

The find_all() method scans the entire document looking for results, but sometimes you only want to find one result.

These two lines of code are nearly equivalent:

1
2
3
4
5
soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

The only difference is that find_all() returns a list containing the single result, and find() just returns the result.

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:

1
2
print(soup.find("nosuchtag"))
# None