BeautifulSoup
Introduction
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating.
I’ll be using as an example throughout this document.
1 | html_doc = """<html><head><title>The Dormouse's story</title></head> |
BeautifulSoup Object
Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:
1 | from bs4 import BeautifulSoup |
Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.
Tag
A Tag object corresponds to an XML or HTML tag in the original document:
.p can return the first <p> tag
1 | tag=soup.p |
Name
Every tag has a name, accessible as .name:
1 | print(tag.name) |
Attributes
You can access that dictionary directly as .attrs:
1 | print(tag.attrs) |
Also you can access the value throught adding the key:
1 | print(tag.attrs['class']) |
NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:
1 | soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser') |
BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.
Comments
There are a few leftover bits. The main one you’ll probably encounter is the comment. The Comment object is just a special type of NavigableString.
1 | markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" |
Navigating the tree
1 | soup.title |
Going down
.contents and .children
.contents is a children list of a tag.
1 | print(tag) |
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
1 | for child in tag.children: |
.descendants
The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:
1 | p_tag=soup.p |
.string
If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
1 | tilte_tag=soup.title |
Going up
The parent of a top-level tag like is the BeautifulSoup object itself:
1 | html_tag = soup.html |
And the .parent of a BeautifulSoup object is defined as None:
1 | print(soup.parent) |
.parents
You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an tag buried deep within the document, to the very top of the document:
1 | link = soup.a |
Going sideways
The <b> tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings.
1 | sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser') |
.next_sibling and .previous_sibling
You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:
1 | sibling_soup.b.next_sibling |
Searching the tree
Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: find() and find_all().
I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
find_all()
A string
1 | soup.find_all('b') |
A regular expression
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method.This code finds all the tags whose names start with the letter “b”; in this case, the <body> tag and the <b> tag:
1 | import re |
A list
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the <a> tags and all the <b> tags:
1 | soup.find_all(["a", "b"]) |
True
The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:
1 | for tag in soup.find_all(True): |
The name argument
Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names.
1 | soup.find_all("title") |
The keyword arguments
If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:
1 | soup.find_all(id='link2') |
If you pass in a value for href, Beautiful Soup will filter against each tag’s ‘href’ attribute:
1 | soup.find_all(href=re.compile("elsie")) |
This code finds all tags whose id attribute has a value, regardless of what the value is:
1 | soup.find_all(id=True) |
The string argument
With string you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:
1 | soup.find_all(string="Elsie") |
find()
The find_all() method scans the entire document looking for results, but sometimes you only want to find one result.
These two lines of code are nearly equivalent:
1 | soup.find_all('title', limit=1) |
The only difference is that find_all() returns a list containing the single result, and find() just returns the result.
If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None:
1 | print(soup.find("nosuchtag")) |
