The secret to efficiently scraping React apps without a headless browser

2024-01-22

If you’ve ever tried to build a web-scraping project, you’ve probably run into issues with dynamically rendered content, common in things like single-page applications (SPAs). Powered by technologies like Next.js and React, these SPAs offer seamless user experiences but pose unique challenges for web scrapers.

I’ll walk you through how I solved this challenge without using a headless browser, allowing me to keep resource costs low and continue web-scraping at scale. I use dead-simple tooling - Python, the requests library, and some string manipulation - to make the magic happen.

Why are SPAs harder to scrape?

Traditional websites load all of their content synchronously and reload the entire page for each user navigation. Unlike these static sites, SPAs are a breed of modern web applications that load a single HTML page and dynamically update content with Javascript as users interact with them. Even non-SPA modern websites render more and more of their content through AJAX requests nowadays, from APIs and content management systems like Contentful.

The challenge with scraping SPAs stems from the fact that traditional web scraping tools, such as BeautifulSoup often struggle to capture data from dynamically rendered content, as they do not execute the Javascript embedded in <script> tags.

The usual approach is to use a headless browser like Selenium, Puppeteer, or Playwright and scrape after the page has rendered completely. These tools load the entire web page, including executing JavaScript, rendering styles, and making AJAX requests. This approach can be resource-intensive and sluggish, particularly when dealing with unoptimized React code.

The Solution

We can use a simple GET request to the page’s URL. This initial request fetches the HTML response, which often has the data we’re looking for embedded in the bundled JS.

import requests

url = "https://example.com/spa"
response = requests.get(url)
html_content = response.text

Extracting React/Next.js Data from HTML

Here’s where the magic happens. Our goal is to extract React data from the plaintext response. SPAs frequently load data within script tags, and we’ll employ regular expressions (regex) to locate and extract these script tags.

import re
import json

def extractor(html_content: str) -> str:
    react_app_script = re.search(
        "<script>document.getElementById.*</script>", html_content
    )
    return (
        "{"
        + react_app_script.group()
        .split("{", 2)[2]
        .split("};window.initilizeAppWithHandoffState", 1)[0]
        + "}"
    )

react_data = json.loads(extractor(html_content))

Handling Authentication and Cookies

In certain scenarios, SPAs may necessitate authentication or rely on cookies to access data. You can adeptly manage authentication and cookies within your scraping script using the session management features provided by the requests library. This enables you to maintain a persistent session and send authenticated requests.

# Create a session
session = requests.Session()

# Authenticate (if needed)
login_data = {
    'username': 'your_username',
    'password': 'your_password'
}
session.post('https://example.com/login', data=login_data)

# Send requests with the session
response = session.get('https://example.com/protected_data')

Conclusion

A little string manipulation goes a long way. Harnessing the power of regex you can create an efficient and lightweight approach to extracting data from SPA sites.

Chrome Extension

When I was building an ingestion engine for an app I was working on, I wanted a quick way to view the hidden JSON blob I for the page I was viewing to see if it held the data I wanted. I built a chrome extension that displays this data and lets you easily copy or export it to help when writing scraping applications. It’s available on the Chrome Web Store, check it out!

Disclaimer

Always respect website terms of service and robots.txt files when scraping data from the web.

#engineering

Reply to this post by email ↪