Overcoming pandas.read_html Limitations using Python web scraping
Pandas read_html
function is an extremely useful tool for quickly extracting HTML tables from webpages.
It allows you to pull tabular data from HTML content in just one line of code.
However, read_html
has some limitations.
This tutorial will guide you through some of these challenges and provide solutions to overcome them.
For the purpose of this tutorial, we’ll use this sample HTML file to extract data from it.
Web pages often contain dynamic content, where the structure of the web page changes over time.
In this case, we can use a combination of BeautifulSoup and Selenium to interact with this dynamic content.
from selenium import webdriver from bs4 import BeautifulSoup import pandas as pd url = 'http://localhost/test.html' driver = webdriver.Firefox() driver.get(url) button = driver.find_element('id', 'loadMoreButton') button.click() html_content = driver.page_source soup = BeautifulSoup(html_content, "lxml") table = soup.find_all('table')[0] dfs = pd.read_html(str(table)) print(dfs[0])
Output
Column1 Column2 Column3 0 a1 b1 c1 1 a2 b2 c2 2 a3 b3 c3 3 a4 b4 c4 4 a5 b5 c5
In this script, Selenium is not just used to open the webpage, but also to interact with it.
The loadMoreButton
element is clicked to load additional data into the table, which is then extracted using BeautifulSoup and pandas.read_html
.
Form Submission and Authentication
Another limitation of pandas.read_html
is that it does not support form submissions or authentication. Both of these tasks can be performed using Selenium and BeautifulSoup.
Here is an example of form submission and authentication:
from selenium import webdriver from bs4 import BeautifulSoup import pandas as pd url = 'http://localhost/test.html' driver = webdriver.Firefox() driver.get(url) username = driver.find_element('id', 'username') password = driver.find_element('id', 'password') username.send_keys('test_user') password.send_keys('test_password') login_button = driver.find_element('id', 'login') login_button.click() html_content = driver.page_source soup = BeautifulSoup(html_content, "lxml") table = soup.find_all('table')[0] dfs = pd.read_html(str(table)) print(dfs[0])
Output
Column1 Column2 Column3 0 a1 b1 c1 1 a2 b2 c2 2 a3 b3 c3
In the above script, Selenium is used to automate the process of logging into the website. The script finds the elements for the username and password fields, inputs the login credentials, then clicks the login button.
Once the script is authenticated, it fetches the page’s HTML content and extracts the HTML table data.
Complex CSS Selectors
While pandas.read_html
works great with well-structured tables, it falls short when dealing with complex CSS selectors.
BeautifulSoup comes in handy for this, allowing us to use CSS selectors to navigate and search the parse tree.
Here’s an example of using CSS selectors with BeautifulSoup:
from bs4 import BeautifulSoup import requests import pandas as pd url = 'http://localhost/test.html' html_content = requests.get(url).text soup = BeautifulSoup(html_content, "lxml") table = soup.select('div.content > table.table-class')[0] dfs = pd.read_html(str(table)) print(dfs[0])
Output
Column1 Column2 Column3 0 a1 b1 c1 1 a2 b2 c2 2 a3 b3 c3
In this script, BeautifulSoup’s select
function is used with a CSS selector that targets the first table (table.table-class
) within a div
with the class content
.
This level of precision in targeting specific parts of the HTML content is not achievable with pandas.read_html
alone.
Multi-page Scraping
A common issue with web scraping is dealing with paginated content. pandas.read_html
does not have built-in support for automatically navigating through multiple pages.
We can use Scrapy, a powerful Python scraping library, to handle this:
import scrapy import pandas as pd import pandas as pd class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['http://localhost/test.html'] def parse(self, response): dfs = pd.read_html(response.text) print(dfs[0]) next_page = response.css('a.next-page::attr(href)').get() if next_page is not None: yield response.follow(next_page, self.parse)
Output
Page 1 Column1 Column2 Column3 0 a1 b1 c1 1 a2 b2 c2 2 a3 b3 c3 ... Page 2 Column1 Column2 Column3 0 a4 b4 c4 1 a5 b5 c5 2 a6 b6 c6 ... ...
In the above script, a Scrapy spider is created to navigate through multiple pages. The parse
method is called for the initial URL and for each subsequent page.
It extracts the HTML table from the current page using pandas.read_html
and then finds the link to the next page using a CSS selector. If a next page is found, the spider follows the link and the process repeats.
Handle Non-tabular Data
Another limitation of pandas.read_html
is that it only extracts tabular data. If the data you are interested in is stored in another HTML structure, like a list, you will need another tool.
Here’s an example of using BeautifulSoup to extract a list of items:
from bs4 import BeautifulSoup import requests url = 'http://localhost/test.html' html_content = requests.get(url).text soup = BeautifulSoup(html_content, "lxml") items = soup.select('ul.item-list > li') items_text = [item.get_text() for item in items] print(items_text)
Output
['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
In this script, BeautifulSoup’s select
function is used with a CSS selector that targets all li
elements within a ul
with the class item-list
. The text of each item is extracted into a list.
HTTP Headers and Cookies
One more limitation of pandas.read_html
is that it doesn’t allow controlling HTTP headers or cookies, which are often necessary for accessing certain web pages. We can use the requests
library to handle this:
import requests from bs4 import BeautifulSoup import pandas as pd url = 'http://localhost/test.html' headers = {'User-Agent': 'Mozilla/5.0'} cookies = {'session_id': '1234567890'} response = requests.get(url, headers=headers, cookies=cookies) soup = BeautifulSoup(response.text, "lxml") table = soup.find_all('table')[0] dfs = pd.read_html(str(table)) print(dfs[0])
Output
Column1 Column2 Column3 0 a1 b1 c1 1 a2 b2 c2 2 a3 b3 c3
In this script, requests.get
is used to fetch the HTML content, but this time we pass a dictionary of HTTP headers and a dictionary of cookies.
This allows us to, for example, pretend to be a certain type of browser or maintain a session across multiple requests.
The HTML content is then parsed with BeautifulSoup and the table is extracted with pandas.read_html
.
Respecting robots.txt
Web scraping should always be performed in a respectful and ethical manner. Most websites include a robots.txt
file that states which parts of the website should not be crawled or scraped.
To respect these rules, you can manually check the robots.txt
file (typically found at the website’s root, like http://localhost/robots.txt
), or use a library like robotexclusionrulesparser
to automatically respect the rules:
from robotexclusionrulesparser import RobotExclusionRulesParser url = 'http://localhost' rp = RobotExclusionRulesParser() rp.fetch(url + '/robots.txt') can_fetch_page = rp.can_fetch('*', url + '/test.html') print(can_fetch_page)
Output
True
In this script, we use the RobotExclusionRulesParser
to fetch and parse the robots.txt
file from the target website.
The can_fetch
method is used to check if a specific page can be scraped (according to the robots.txt
rules).
The '*'
argument means that we’re checking the rules that apply to all web crawlers. If the output is True
, it means that we’re allowed to scrape the page.
Remember, it’s crucial to respect these rules not only out of respect for the website’s operators, but also to avoid potential legal issues.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.
Very Good !!!
Thanks.
getting HTTP error 403: forbidden when scrapping likegeeks – Scrape HTML Tags using Class Attribute
You should try it on a different website.
This is because of the tight security on the server.
I’ve changed the example to another URL.
I was, thanks for the amazingly fast response. likegeeks.com is awesome!
You are welcome.
It’s likegeeks not livegeeks
I think if you try to add a ‘user-agent’ while using the ‘requests’ library such as:
REQUEST_HEADER = {‘User-Agent’:”Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36″},
you can avoid the HTTP 403 error.
Is this a Python3 tutorial?
from urllib.request import urlopen()
ImportError Traceback (most recent call last)
in
—-> 1 from urllib.request import urlopen
ImportError: No module named request
It’s sad you have to use Windows for this tutorial. Sad.
I’ve tested on Windows, but you should use Python 3.x unless you know the code changes so you can update it.
import from urllib.request import urlopen
Does work in python3. You need to specify python3 in your instructions.
Yes, it’s a python 3.x code.
If you are using Python 2.x, you can import it like this:
And by the way, NONE of this is going to work unless you have Chrome browser installed.
I’ve mentioned the Chrome driver installation steps.
dear this is very informative but how to solve reCaptcha have any code or trick to bypass reCaptch. hope reply
There is no legal way to bypass ReCaptcha.
Even illegal ways which cost more money get caught.
nice!!
Thanks!!
I am getting an exception:
bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: html5lib. Do you need to install a parser library?
What do I need to do to make it work?
Please, help.
BeautifulSoup by default supports HTML parser. All written code is tested very well and it’s working perfectly.
However, try to use html.parser instead of html5lib.
So your code will be like this:
res = BeautifulSoup(html.read(),"html.parser");
Regards,
thanks
You’re welcome!