What we’re going to be doing

I’m going to walk you through step-by-step how to assemble a full list of all pages on a website’s sitemap using Python with BeautifulSoup4. I will assume some basic knowledge of Python on the part of the reader, so I won’t go into great detail on some basics.

This will only work on sites with XML sitemaps. If a site doesn’t have a sitemap, then this approach can’t work. If you need to get all (or as many) links as possible from a site that doesn’t have a sitemap then stay tuned, as I’ll have a post on this shortly.

Setting things up

First things first, load up a Python file and make sure you install the following packages via pip:

Once this is done, we’ll then import these modules at the head of our file.

Now that we have our imports set up, we’re going to create our function for getting sitemaps. This function will accept a URL as a parameter and it will return a list of URLs that belong to that sitemap. I will then call the function from an if statement that ensures the code is only run if the file is called directly.

from bs4 import BeautifulSoup
import requests
import lxml

def get_urls_from_sitemap(url) -> list:
    return([])

if __name__ == '__main__':
    url = 'http://jackwhitworth.com/sitemap.xml'
    links = get_urls_from_sitemap(url)
    print(links)

Get the XML table data

Now that we have everything set up nicely. We’re going to use the requests module to send GET requests to the URL provided. Once done, we’re going to use BeautifulSoup4 to parse the text we received from our request and print the links to the console.

from bs4 import BeautifulSoup
import requests
import lxml

def get_urls_from_sitemap(url) -> list:
    #Send our GET requests and parse the response with BS4
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'xml')
    
    #Find all <loc> tags and print their contents if contains .xml
    for item in soup.find_all('loc'):
        if '.xml' in item.text:
            print(item.text)
    
    return([])

if __name__ == '__main__':
    url = 'http://jackwhitworth.com/sitemap.xml'
    links = get_urls_from_sitemap(url)
    print(links)

Now, if you run this, you’ll find that you get a small list of what looks like 3 extra sitemaps. This is because most sitemaps like this are split into categories. So, if you look at my website sitemap, you’ll see that it provides links for nested sitemaps, and they contain the data we want.

Moving forward, once we get these 3 links, we’ll run a similar piece of code again to load the new sitemaps with GET requests, parse them, and then assemble a list of links from the sitemaps.

from bs4 import BeautifulSoup
import requests
import lxml

def get_urls_from_sitemap(url) -> list:
    #Send our GET requests and parse the response with BS4
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'xml')
    
    #Set up list for all links
    website_links = []
    
    #Find all <loc> tags that have a .xml extension
    for item in soup.find_all('loc'):
        if '.xml' in item.text:
            
            #Send another GET request to the .xml link
            r = requests.get(item.text)
            new_soup = BeautifulSoup(r.text, 'xml')
            
            for new_item in new_soup.find_all('loc'):
                website_links.append(new_item.text)
    
    return(website_links)

if __name__ == '__main__':
    url = 'http://jackwhitworth.com/sitemap.xml'
    links = get_urls_from_sitemap(url)
    [print(link) for link in links]

Exception handling and final touches

I’ve found that, when running this code across hundreds of websites, there are a few different exceptions you can encounter. The first is, that if the page 404’s, then there is no sitemap to parse. The second is that, if you try to append to the list with something that isn’t a string, then that gives you a type error. To handle these issues, we can just put some simple try/except blocks in place. Another thing to be prepared for is that, if the site doesn’t have sub-sitemaps like mine, then you’ll want to just add them to the list straight away. Finally, for good measure, I also like to add print statements to provide feedback to the user so they’re aware of what the code is currently doing while running.

Final Code

from bs4 import BeautifulSoup
import requests
import lxml

def get_urls_from_sitemap(url) -> list:
    print("Getting Sitemap for " + url)
    #Send our GET requests and parse the response with BS4
    try:
        r = requests.get(url)
    except requests.exceptions.RequestException:
        return([])
    soup = BeautifulSoup(r.text, 'xml')
    #Set up list for all links
    website_links = []
    #Find all <loc> tags that have a .xml extension
    for item in soup.find_all('loc'):
        try:
            if '.xml' in item.text:
                #Send another GET request to the .xml link
                r = requests.get(item.text)
                new_soup = BeautifulSoup(r.text, 'xml')
                for new_item in new_soup.find_all('loc'):
                    website_links.append(new_item.text)
            #If the link doesn't have a .xml extension, add it to the list
            else:
                website_links.append(item.text)
        except TypeError:
            pass
    print("Found " + str(len(website_links)) + " links")
    return(website_links)

if __name__ == '__main__':
    url = 'http://jackwhitworth.com/sitemap.xml'
    links = get_urls_from_sitemap(url)
    [print(link) for link in links]

Leave a Reply

Your email address will not be published. Required fields are marked *