apify-sdk-python/docs/03_guides/01_beautifulsoup_httpx.mdx at c8ddfc953b690befd456519b71b5eedb2171b413 · apify/apify-sdk-python

id	beautifulsoup-httpx
title	Use BeautifulSoup with HTTPX
description	Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX.

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import BeautifulSoupHttpxExample from '!!raw-loader!roa-loader!./code/01_beautifulsoup_httpx.py';

In this guide, you'll learn how to use the BeautifulSoup library with the HTTPX library in your Apify Actors.

Introduction

BeautifulSoup is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.

HTTPX is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.

To create an Actor which uses those libraries, start from the BeautifulSoup & Python Actor template. This template includes the BeautifulSoup and HTTPX libraries preinstalled, allowing you to begin development immediately.

Example Actor

Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses HTTPX for fetching pages and BeautifulSoup for parsing their content to extract titles and links to other pages.

{BeautifulSoupHttpxExample}

Conclusion

In this guide, you learned how to use the BeautifulSoup with the HTTPX in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction

Example Actor

Conclusion

Additional resources

FilesExpand file tree

01_beautifulsoup_httpx.mdx

Latest commit

History

01_beautifulsoup_httpx.mdx

File metadata and controls

Introduction

Example Actor

Conclusion

Additional resources