· Data Science · 2 min read
📋 Prerequisites
- Basic Python knowledge
- Familiarity with HTML structure
🎯 What You'll Learn
- Understand the basics of web scraping
- Use requests and BeautifulSoup to collect data
- Parse HTML content and extract relevant information
- Store collected data for analysis
Introduction
Data is essential for machine learning projects, and often, the required data isn’t readily available in clean datasets. Web scraping allows you to collect data from websites for your analysis and projects.
This tutorial will teach you the basics of web scraping using Python’s requests and BeautifulSoup libraries.
What is Web Scraping?
Web scraping is the process of programmatically extracting data from websites by sending HTTP requests, parsing HTML, and retrieving the required data points.
Libraries We Will Use
- requests: To send HTTP requests and fetch webpage content.
- BeautifulSoup: To parse HTML and extract data.
Example: Scraping Quotes from a Website
1️⃣ Install Required Libraries
pip install requests beautifulsoup4
2️⃣ Import Libraries
import requests
from bs4 import BeautifulSoup
3️⃣ Fetch and Parse Webpage
We will scrape quotes from http://quotes.toscrape.com:
url = "http://quotes.toscrape.com"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify()[:500]) # Preview HTML
else:
print("Failed to retrieve the page")
4️⃣ Extract Quotes
quotes = soup.find_all('span', class_='text')
for quote in quotes:
print(quote.text)
5️⃣ Save Data for Analysis
quote_list = [quote.text for quote in quotes]
import pandas as pd
df = pd.DataFrame({'Quotes': quote_list})
df.to_csv('quotes.csv', index=False)
print("Quotes saved to quotes.csv")
Ethical Considerations
✅ Always check the website’s robots.txt
before scraping.
✅ Avoid making too many requests to avoid overloading servers.
✅ Use scraping responsibly and adhere to website policies.
Conclusion
🎉 You have learned:
✅ What web scraping is and when to use it.
✅ How to fetch and parse HTML content using requests and BeautifulSoup.
✅ How to extract and save scraped data for further analysis.
What’s Next?
- Explore scraping more complex data structures and paginated data.
- Learn to handle dynamic content using Selenium or Playwright for JavaScript-heavy websites.
- Use scraped data to power your data analysis and machine learning projects.
Join our SuperML Community to share your scraping projects, get feedback, and learn collaboratively!
Happy Scraping! 🎉