Playwright Web Scraping in Node.js: Render, Extract, Store

For a complete overview, see our web scraping API guide.

Playwright web scraping helps you crawl modern websites that rely on complex JavaScript rendering. Instead of reverse-engineering every parameter, you can load the page in a real browser engine and extract the final DOM. As a result, you spend less time fighting front-end complexity and more time building a stable data pipeline.

In this tutorial, you will learn Playwright web scraping on Node.js: installation, dynamic rendering, data extraction, login handling, pagination and infinite scroll, plus basic data cleaning and storage.

Internal reading:

Outbound references:

Playwright official docs: https://playwright.dev/
Playwright GitHub repo (stars): https://github.com/microsoft/playwright

Headless browser options (updated snapshot)

If you compare tools by ecosystem maturity and release cadence, Playwright remains a top choice for cross-browser automation.

Tool	Supported Languages	Supported Browsers	GitHub Stars	Latest Release Date
Playwright	JavaScript, Python, C#, Java	Chromium-based browsers, Firefox, WebKit-based browsers	60,300	March 3, 2024
Selenium	Java, Python, JavaScript, C#, Ruby	Chromium-based browsers, Firefox, WebKit-based browsers	29,000	February 18, 2024
Puppeteer	JavaScript	Chrome, Chromium, Firefox (experimental)	86,400	March 15, 2024
Cypress	JavaScript	Chrome, Chromium, Edge, Firefox	45,900	March 13, 2024
chromedp	Go	Chrome	10,200	August 5, 2023
Splash	Python	Custom engine	4,000	June 16, 2020
Headless Chrome	Rust	Chrome, Chromium	2,000	January 27, 2024
HTMLUnit	Java	Rhino engine	806	March 13, 2024

Web scraping engineers most commonly use Selenium because it has the longest history, having been released in 2004 and still being updated to date, with a wealth of version fixes and tutorials accumulated. However, currently, some websites on the market have implemented corresponding anti-scraping strategies targeting Selenium scrapers. For example, when using Selenium to browse web pages, it may trigger Cloudflare’s human-machine verification or Google’s human-machine verification. On the other hand, Playwright, as a new headless browser, has underlying fingerprints that can avoid being fully detected, thus not triggering anti-scraping mechanisms. Moreover, in terms of performance, Playwright has significantly improved compared to the older Selenium:
Selenium: Based on the WebDriver client-server model, it requires installing browser drivers and has communication delays.
Playwright: Interacts directly with the browser kernel, with a speed increase of 30%-50% and a 40% reduction in resource consumption for concurrent testing. Playwright is a cross-language automated testing tool officially supporting four programming languages: JavaScript/TypeScript, Python, Java, and C#. This article will introduce Playwright’s web scraping, data cleaning, and common anti-scraping methods from the Node.js platform. Installation
You need to install the latest version of Node.js 18, 20, or 22.

npm install playwright@latest
npx playwright install

Then check the version:

npx playwright --version

The current latest version is: Version 1.53.1

Create a Project

Initialize the project using npm

npm init playwright-projet

Install dependencies

npm install playwright
npm install fs-extra

Use Playwright to get the content of the GitHub Trending page:

Content of the main.js file

const { chromium } = require('playwright');
const path = require('path');

(async () => {
  // Initialize browser
  const browser = await chromium.launch({
    headless: false, // Set to true for headless mode
    slowMo: 500,     // Slow down operations for observation
  });
  const page = await browser.newPage();

  try {
    // Navigate to GitHub Trending page
    await page.goto('https://github.com/trending', { waitUntil: 'networkidle' });

    // Wait for dynamic content to load
    await page.waitForSelector('article.Box-row');

  } catch (error) {
    console.error('Error during scraping:', error);
  } finally {
    // Close browser
    await browser.close();
  }
})();

Run it:

node main.js

Wait a moment, and the system will invoke the Chrome browser to open the github.com/trending page.

If you don’t want the browser window to pop up during program execution, you can set headless to true

  const browser = await chromium.launch({
    headless: true, // Set to true for headless mode
    slowMo: 500,     // Slow down operations for observation
  });

Then we can start to obtain the data on the page.

page.$$eval() can be used to fetch HTML page source code，$$ is equivalent to a CSS selector`document.querySelectorAll()

await page.$$eval('article.Box-row', (articles) => {
  return articles.map((article) => {
    // Process each article element
    // ...
  });
});

articles is an array with a length of 12, and each element represents a repository.

Next, start parsing the content of each repository.

return articles.map((article) => {
        // Extract project name and link
        const titleElement = article.querySelector('h2 a');
        const title = titleElement?.textContent.trim().replace(/\s+/g, ' ');
        const repoUrl = `https://github.com${titleElement?.getAttribute('href')}`;

        // Extract description
        const description = article.querySelector('p.col-9')?.textContent.trim();

        // Extract programming language and star count
        const lang = article.querySelector('span[itemprop="programmingLanguage"]')?.textContent;
        const stars = article.querySelector('a[href$="/stargazers"]')?.textContent.trim();

        // Extract stars gained today
        const todayStars = article.querySelector('span.float-right')?.textContent.trim();

        return { title, repoUrl, description, lang, stars, todayStars };
      });
    });

The above code parses the fields title, repoUrl, description, lang, stars, and todayStars.

For simplicity, we use a local JSON file to save the above data:

    console.log(`Successfully scraped ${projects.length} projects`);

    // Save data to JSON file
    const outputPath = path.join(__dirname, 'github_trending.json');
    await fs.writeJSON(outputPath, projects, { spaces: 2 });

Login Verification

If the content to be scraped requires login to access, then you need to use Playwright to automatically fill in the username and password.

// login GitHub
await page.goto('https://github.com/login');
await page.fill('input[name="login"]', 'your_username');
await page.fill('input[name="password"]', 'your_password');
await page.click('input[type="submit"]');

The above code helps you automatically log in to GitHub.

Handling Pagination

If there are many pages of paginated content on the page.

// click "Next" button
const nextButton = await page.$('a[rel="next"]');
if (nextButton) {
  await nextButton.click();
  await page.waitForLoadState();
}

Handling Dynamically Loaded Content (Infinite Scrolling)

Some web page content is not paginated by clicking buttons but automatically refreshes the content of the next page through the mouse wheel. Playwright can handle this situation easily.

// Simulate scrolling to the bottom
await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000); // Wait for the content to load

Playwright Anti-Scraping Mechanisms

Modify the User-Agent. If you don’t modify the User-Agent, Playwright’s default User-Agent is:

   User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/138.0.7204.23 Safari/537.36

The string HeadlessChrome in it can easily expose its traces, so it’s best to modify the User-Agent.

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch();

     // create custom User-Agent context
     const context = await browser.newContext({
       userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
     });

     const page = await context.newPage();
     await page.goto('https://httpbin.org/user-agent');
     await page.screenshot({ path: 'user-agent.png' });

     await browser.close();
   })();

Add random delays to avoid overly frequent requests. If you access the same URL rapidly in a short period, the server can easily detect it as a scraping program.

   await page.waitForTimeout(Math.random() * 1000 + 500);

Use proxy IPs in Playwright. If you access a website too many times within a certain period, the same IP address can easily be detected and blocked by the server. In this case, you can use proxy IPs to solve this problem. Example code for using an HTTP proxy IP in Playwright:

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch({
       proxy: {
         server: 'http://proxy.example.com:8080',  // Proxy IP address
         // username: 'user',  // option：proxy username
         // password: 'pass',  // option：proxy password
       }
     });

     const page = await browser.newPage();
     await page.goto('https://api.ipify.org?format=json');
     console.log(await page.content());  // check proxy IP whether

     await browser.close();
   })();

If you are using a Socks5 proxy IP, the example code is as follows:

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch({
       proxy: {
         server: 'socks5://socks.example.com:1080',  // SOCKS5 proxy
         username: 'username',
         password: 'password'
       }
     });

     const page = await browser.newPage();
     await page.goto('https://example.com');

     await browser.close();
   })();

Playwright can also use different proxy IP addresses for different websites.

For example, use proxy IP-A for website A, proxy IP-B for website B, or don’t use a proxy for website B to save proxy IP access traffic.

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch();

     // Context 1: Use proxy A
     const context1 = await browser.newContext({
       proxy: { server: 'http://proxy-a.example.com:8080' }
     });
     const page1 = await context1.newPage();
     await page1.goto('https://example-a.com');

     // Context 2: Use proxy B
     const context2 = await browser.newContext({
       proxy: { server: 'http://proxy-b.example.com:8080' }
     });
     const page2 = await context2.newPage();
     await page2.goto('https://example-b.com');

     await browser.close();
   })();

Modify Playwright fingerprint information.

Modify the screen resolution and window size.

Sometimes, front-end code checks the current window size, and if the window is too small, it may be identified as a scraping program. You can configure the Playwright browser window to the size of a normal browser window.

const context = await browser.newContext({
  viewport: { width: 1920, height: 1080 },  // simulate desktop browser
  // simulate mobile browser
  viewport: { width: 375, height: 812 }     // iPhone X szie
});

WebRTC

WebRTC can expose the real IP address, which can be disabled through extensions or custom parameters:

const { webkit } = require('playwright');

(async () => {
  const browser = await webkit.launch();
  const context = await browser.newContext({
    // inject JavaScript code to ban WebRTC
    javaScriptEnabled: true,
    permissions: ['clipboard-read', 'clipboard-write']
  });

  await page.addInitScript(() => {
    // override RTCPeerConnection
    window.RTCPeerConnection = function() {
      return {
        createDataChannel: () => {},
        createOffer: () => Promise.resolve({}),
        createAnswer: () => Promise.resolve({}),
        setLocalDescription: () => Promise.resolve({}),
        setRemoteDescription: () => Promise.resolve({}),
        addIceCandidate: () => Promise.resolve({})
      };
    };
  });
  // ...
})();

Canvas fingerprint

Modify the Canvas fingerprint

await page.addInitScript(() => {
  // modify Canvas finger print
  const originalGetImageData = CanvasRenderingContext2D.prototype.getImageData;
  CanvasRenderingContext2D.prototype.getImageData = function(x, y, width, height) {
    // add random noise
    const result = originalGetImageData.apply(this, arguments);
    const data = result.data;
    for (let i = 0; i < data.length; i += 4) {
      data[i] = data[i] + (Math.random() - 0.5) * 10;     // Red
      data[i + 1] = data[i + 1] + (Math.random() - 0.5) * 10; // Green
      data[i + 2] = data[i + 2] + (Math.random() - 0.5) * 10; // Blue
    }
    return result;
  };
});

hardware fingerprint

Modify the hardware fingerprint

await page.addInitScript(() => {
  // change CPU core number
  Object.defineProperty(navigator, 'hardwareConcurrency', {
    get: () => 4 // simulate 4 core CPU
  });

  // change member info
  Object.defineProperty(navigator, 'deviceMemory', {
    get: () => 8 // simulate 8GB memory
  });
});

Use extensions (Chromium only). You can further modify fingerprints by loading Chrome extensions:

const browser = await chromium.launch({
  headless: false,  // need to set headless false
  args: [
    '--disable-blink-features=AutomationControlled', // hide automation tool finger print
    '--load-extension=./path/to/your/extension'       // load google chrome extension
  ]
});

With this, the basic tutorial on Playwright comes to an end, adding another powerful tool to the web scraping arsenal.