post thumbnail

Playwright Web Scraping in Node.js: Render Pages, Extract Data, and Store Results

Bypass JavaScript protections effortlessly using Playwright - the fastest headless browser for scraping. Learn: Auto-login & pagination handling Proxy rotation & fingerprint spoofing 30-50% faster than Selenium Anti-detection techniques Perfect for scraping modern JS-heavy sites!

2025-10-15

For a complete overview, see our web scraping API guide.

Playwright web scraping helps you crawl modern websites that rely on complex JavaScript rendering. Instead of reverse-engineering every parameter, you can load the page in a real browser engine and extract the final DOM. As a result, you spend less time fighting front-end complexity and more time building a stable data pipeline.

In this tutorial, you will learn Playwright web scraping on Node.js: installation, dynamic rendering, data extraction, login handling, pagination and infinite scroll, plus basic data cleaning and storage.

Internal reading:

Outbound references:


Headless browser options (updated snapshot)

If you compare tools by ecosystem maturity and release cadence, Playwright remains a top choice for cross-browser automation. 

ToolSupported LanguagesSupported BrowsersGitHub StarsLatest Release Date
PlaywrightJavaScript, Python, C#, JavaChromium-based browsers, Firefox, WebKit-based browsers60,300March 3, 2024
SeleniumJava, Python, JavaScript, C#, RubyChromium-based browsers, Firefox, WebKit-based browsers29,000February 18, 2024
PuppeteerJavaScriptChrome, Chromium, Firefox (experimental)86,400March 15, 2024
CypressJavaScriptChrome, Chromium, Edge, Firefox45,900March 13, 2024
chromedpGoChrome10,200August 5, 2023
SplashPythonCustom engine4,000June 16, 2020
Headless ChromeRustChrome, Chromium2,000January 27, 2024
HTMLUnitJavaRhino engine806March 13, 2024
npm install playwright@latest
npx playwright install

Then check the version:

npx playwright --version

The current latest version is: Version 1.53.1

Create a Project

Initialize the project using npm

npm init playwright-projet

Install dependencies

npm install playwright
npm install fs-extra

Use Playwright to get the content of the GitHub Trending page:

Content of the main.js file

const { chromium } = require('playwright');
const path = require('path');

(async () => {
  // Initialize browser
  const browser = await chromium.launch({
    headless: false, // Set to true for headless mode
    slowMo: 500,     // Slow down operations for observation
  });
  const page = await browser.newPage();

  try {
    // Navigate to GitHub Trending page
    await page.goto('https://github.com/trending', { waitUntil: 'networkidle' });

    // Wait for dynamic content to load
    await page.waitForSelector('article.Box-row');

  } catch (error) {
    console.error('Error during scraping:', error);
  } finally {
    // Close browser
    await browser.close();
  }
})();

Run it:

node main.js

Wait a moment, and the system will invoke the Chrome browser to open the github.com/trending page.

If you don’t want the browser window to pop up during program execution, you can set headless to true

  const browser = await chromium.launch({
    headless: true, // Set to true for headless mode
    slowMo: 500,     // Slow down operations for observation
  });

Then we can start to obtain the data on the page.

page.$$eval() can be used to fetch HTML page source code,$$ is equivalent to a CSS selector`document.querySelectorAll()

await page.$$eval('article.Box-row', (articles) => {
  return articles.map((article) => {
    // Process each article element
    // ...
  });
});

articles is an array with a length of 12, and each element represents a repository.

Next, start parsing the content of each repository.

return articles.map((article) => {
        // Extract project name and link
        const titleElement = article.querySelector('h2 a');
        const title = titleElement?.textContent.trim().replace(/\s+/g, ' ');
        const repoUrl = `https://github.com${titleElement?.getAttribute('href')}`;

        // Extract description
        const description = article.querySelector('p.col-9')?.textContent.trim();

        // Extract programming language and star count
        const lang = article.querySelector('span[itemprop="programmingLanguage"]')?.textContent;
        const stars = article.querySelector('a[href$="/stargazers"]')?.textContent.trim();

        // Extract stars gained today
        const todayStars = article.querySelector('span.float-right')?.textContent.trim();

        return { title, repoUrl, description, lang, stars, todayStars };
      });
    });

The above code parses the fields title, repoUrl, description, lang, stars, and todayStars.

For simplicity, we use a local JSON file to save the above data:

    console.log(`Successfully scraped ${projects.length} projects`);

    // Save data to JSON file
    const outputPath = path.join(__dirname, 'github_trending.json');
    await fs.writeJSON(outputPath, projects, { spaces: 2 });

Login Verification

If the content to be scraped requires login to access, then you need to use Playwright to automatically fill in the username and password.

// login GitHub
await page.goto('https://github.com/login');
await page.fill('input[name="login"]', 'your_username');
await page.fill('input[name="password"]', 'your_password');
await page.click('input[type="submit"]');

The above code helps you automatically log in to GitHub.

Handling Pagination

If there are many pages of paginated content on the page.

// click "Next" button
const nextButton = await page.$('a[rel="next"]');
if (nextButton) {
  await nextButton.click();
  await page.waitForLoadState();
}

Handling Dynamically Loaded Content (Infinite Scrolling)

Some web page content is not paginated by clicking buttons but automatically refreshes the content of the next page through the mouse wheel. Playwright can handle this situation easily.

// Simulate scrolling to the bottom
await page.evaluate(() => {
  window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000); // Wait for the content to load

Playwright Anti-Scraping Mechanisms

  1. Modify the User-Agent. If you don’t modify the User-Agent, Playwright’s default User-Agent is:
   User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/138.0.7204.23 Safari/537.36

The string HeadlessChrome in it can easily expose its traces, so it’s best to modify the User-Agent.

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch();

     // create custom User-Agent context
     const context = await browser.newContext({
       userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
     });

     const page = await context.newPage();
     await page.goto('https://httpbin.org/user-agent');
     await page.screenshot({ path: 'user-agent.png' });

     await browser.close();
   })();
  1. Add random delays to avoid overly frequent requests. If you access the same URL rapidly in a short period, the server can easily detect it as a scraping program.
   await page.waitForTimeout(Math.random() * 1000 + 500);
  1. Use proxy IPs in Playwright. If you access a website too many times within a certain period, the same IP address can easily be detected and blocked by the server. In this case, you can use proxy IPs to solve this problem. Example code for using an HTTP proxy IP in Playwright:
   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch({
       proxy: {
         server: 'http://proxy.example.com:8080',  // Proxy IP address
         // username: 'user',  // option:proxy username
         // password: 'pass',  // option:proxy password
       }
     });

     const page = await browser.newPage();
     await page.goto('https://api.ipify.org?format=json');
     console.log(await page.content());  // check proxy IP whether

     await browser.close();
   })();

If you are using a Socks5 proxy IP, the example code is as follows:

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch({
       proxy: {
         server: 'socks5://socks.example.com:1080',  // SOCKS5 proxy
         username: 'username',
         password: 'password'
       }
     });

     const page = await browser.newPage();
     await page.goto('https://example.com');

     await browser.close();
   })();

Playwright can also use different proxy IP addresses for different websites.

For example, use proxy IP-A for website A, proxy IP-B for website B, or don’t use a proxy for website B to save proxy IP access traffic.

   const { chromium } = require('playwright');

   (async () => {
     const browser = await chromium.launch();

     // Context 1: Use proxy A
     const context1 = await browser.newContext({
       proxy: { server: 'http://proxy-a.example.com:8080' }
     });
     const page1 = await context1.newPage();
     await page1.goto('https://example-a.com');

     // Context 2: Use proxy B
     const context2 = await browser.newContext({
       proxy: { server: 'http://proxy-b.example.com:8080' }
     });
     const page2 = await context2.newPage();
     await page2.goto('https://example-b.com');

     await browser.close();
   })();
  1. Modify Playwright fingerprint information.

Modify the screen resolution and window size.

Sometimes, front-end code checks the current window size, and if the window is too small, it may be identified as a scraping program. You can configure the Playwright browser window to the size of a normal browser window.

const context = await browser.newContext({
  viewport: { width: 1920, height: 1080 },  // simulate desktop browser
  // simulate mobile browser
  viewport: { width: 375, height: 812 }     // iPhone X szie
});

WebRTC

WebRTC can expose the real IP address, which can be disabled through extensions or custom parameters:

const { webkit } = require('playwright');

(async () => {
  const browser = await webkit.launch();
  const context = await browser.newContext({
    // inject JavaScript code to ban WebRTC
    javaScriptEnabled: true,
    permissions: ['clipboard-read', 'clipboard-write']
  });

  await page.addInitScript(() => {
    // override RTCPeerConnection
    window.RTCPeerConnection = function() {
      return {
        createDataChannel: () => {},
        createOffer: () => Promise.resolve({}),
        createAnswer: () => Promise.resolve({}),
        setLocalDescription: () => Promise.resolve({}),
        setRemoteDescription: () => Promise.resolve({}),
        addIceCandidate: () => Promise.resolve({})
      };
    };
  });
  // ...
})();

Canvas fingerprint

Modify the Canvas fingerprint

await page.addInitScript(() => {
  // modify Canvas finger print
  const originalGetImageData = CanvasRenderingContext2D.prototype.getImageData;
  CanvasRenderingContext2D.prototype.getImageData = function(x, y, width, height) {
    // add random noise
    const result = originalGetImageData.apply(this, arguments);
    const data = result.data;
    for (let i = 0; i < data.length; i += 4) {
      data[i] = data[i] + (Math.random() - 0.5) * 10;     // Red
      data[i + 1] = data[i + 1] + (Math.random() - 0.5) * 10; // Green
      data[i + 2] = data[i + 2] + (Math.random() - 0.5) * 10; // Blue
    }
    return result;
  };
});

hardware fingerprint

Modify the hardware fingerprint

await page.addInitScript(() => {
  // change CPU core number
  Object.defineProperty(navigator, 'hardwareConcurrency', {
    get: () => 4 // simulate 4 core CPU
  });

  // change member info
  Object.defineProperty(navigator, 'deviceMemory', {
    get: () => 8 // simulate 8GB memory
  });
});

Use extensions (Chromium only). You can further modify fingerprints by loading Chrome extensions:

const browser = await chromium.launch({
  headless: false,  // need to set headless false
  args: [
    '--disable-blink-features=AutomationControlled', // hide automation tool finger print
    '--load-extension=./path/to/your/extension'       // load google chrome extension
  ]
});

With this, the basic tutorial on Playwright comes to an end, adding another powerful tool to the web scraping arsenal.