post thumbnail

Data Cleaning and Processing Libraries in the Node.js Ecosystem – A Counterpart to Python’s pandas

Explore data cleaning and analytics in the Node.js ecosystem. Use TypeScript, Streams and backpressure, csv-parse/Papa Parse, Cheerio/Playwright for scraping, Ajv for validation, Danfo.js/Apache Arrow/DuckDB for tabular processing, and Prisma/PostgreSQL pipelines. Orchestrate jobs with BullMQ/Redis and node-cron, deploy on serverless, and build reliable, scalable ETL with testing, logging, and observability.

2025-12-01

Node.js has become a popular choice for building efficient web crawlers due to its non-blocking I/O and event-driven nature.

It is particularly well-suited for handling high-concurrency network requests, as it can initiate multiple requests simultaneously through asynchronous operations within a single thread. This avoids the waiting and blocking issues inherent in traditional synchronous programming, significantly improving data crawling efficiency. Additionally, the rich ecosystem of third-party libraries in Node.js (such as axios for HTTP requests, cheerio for HTML parsing, and puppeteer for dynamic page rendering) further lowers the barrier to crawler development, making it widely used in the field of data collection.

Typical Application Scenarios

  1. Data Collection and Analysis:
  1. Content Synchronization and Migration:
  1. Monitoring and Early Warning:
  1. Search Engines and Vertical Domain Applications:

Node.js can rival Python in the field of web crawling, with an increasing number of third-party libraries and an increasingly mature ecosystem.

In the field of data analysis, Python has a powerful tool – pandas. So, is there a comparable library in Node.js?

In the Node.js ecosystem, Danfo.js is currently the library most similar to Python’s pandas. Its design philosophy, API style, and core functionalities (such as the DataFrame data structure) are highly modeled after pandas, making it especially suitable for developers familiar with pandas to get started quickly.

Core Features of Danfo.js (Compared with pandas)

  1. Consistent Core Data Structures: It provides DataFrame (two-dimensional tabular data) and Series (one-dimensional sequences), which correspond exactly to the core structures of pandas, supporting row indexes, column names, and data types (such as int32, float64, string, etc.).
  2. Highly Similar API Design: The naming and usage of common data processing methods are almost identical to those in pandas, reducing the learning curve:
  1. Support for Mainstream Formats and Data Operations: It supports multiple data sources such as CSV, JSON, and arrays, allowing direct operations on DataFrame like adding/deleting rows and columns, sorting, and slicing, covering the core data processing scenarios of pandas.

Example Code (Compared with pandas)

Suppose we have a CSV file with the following content:

name,age,department,salary,hire_date
Alice,28,Engineering,75000,$2020-03-15
Bob,35,Marketing,62000,2018-07-22
Charlie,,Sales,58000,2021-01-30  
David,42,Engineering,90000,2016-09-05
Eve,31,Marketing,68000,2019-11-10
Frank,35,Sales,55000,2020-05-18
Grace,29,Engineering,82000,2019-02-28
Bob,35,Marketing,62000,2018-07-22  
Henry,forty,HR,48000,2022-04-01  
Ivy,27,Marketing,,2021-08-12 
Jack,33,Engineering,88000,2017-12-03

The headers in the CSV data are name, age, department, salary, and hire_date.

Then we use danfojs-node in Node.js for data analysis. The current version of danfojs-node is 1.2.0, as some functions only support the latest version. I encountered many issues here because many existing examples online are based on older versions of danfojs-node, which are incompatible with the new version of danfojs.

For example, exceptions like this may occur: “An error occurred: TypeError: noNaDf.dropDuplicates is not a function.”

const dfd = require("danfojs-node");

async function main() {
  // Read CSV and create DataFrame (similar to pandas.read_csv)
  const df = await dfd.readCSV("data.csv");

  // View basic data information (similar to df.info())
  console.log("Data shape:", df.shape);
  console.log("Data columns:", df.columns);
  console.log("Data types:");
  df.ctypes.print();

  console.log("Data preview:");
  df.head().print(); // Display the first few rows of data, similar to df.head()

  // Data cleaning: drop missing values, remove duplicates, convert column types
  const cleanedDf = df.dropNa().resetIndex({ drop: true }); // Reset index after dropNa
  cleanedDf.head().print();
  const filteredDf = cleanedDf.query(cleanedDf["age"].gt(30).and(cleanedDf["salary"].gt(50000)));
  console.log("Filtered Data:");
  filteredDf.print();
  // Group aggregation (similar to df.groupby("department").mean())
  const groupedDf = cleanedDf.groupby(["department"]).col(["salary"]).mean();
  // Save the result (similar to df.to_csv())
  dfd.toCSV(groupedDf, { filePath: "result.csv" });
  console.log("Result saved to result.csv");
}

main()
  .then(() => {
    console.log("Data processing complete");
  })
  .catch((err) => {
    console.error("An error occurred:", err);
  });

Output results:

Data shape: [ 11, 5 ]
Data columns: [ 'name', 'age', 'department', 'salary', 'hire_date' ]
Data types:
╔════════════╤═════════╗
║ name       │ string  ║
╟────────────┼─────────╢
║ age        │ string  ║
╟────────────┼─────────╢
║ department │ string  ║
╟────────────┼─────────╢
║ salary     │ float32 ║
╟────────────┼─────────╢
║ hire_date  │ string  ║
╚════════════╧═════════╝

Data preview:
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Charlie           │ null              │ Sales             │ 58000             │ 2021-01-30        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Filtered Data:
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Result saved to result.csv
Data processing complete

It can be seen that the functions of danfojs-node are quite similar to those of Python’s pandas.

For example, dropNa() deletes rows with empty elements. If old rows are deleted, the indexes will be discontinuous, and directly using query for querying will result in an error.

Such as: “An error occurred: TypeError: Cannot read properties of undefined (reading ‘0’)”

In this case, it is necessary to reset the index: .resetIndex({ drop: true });

  const groupedDf = cleanedDf.groupby(["department"]).col(["salary"]).mean();

Here, we aggregate based on the “department” column, grouping data from the same department together, and then calculate the average of the “salary” to obtain the average salary for each department.

To sort by “salary” in descending order, use the DataFrame.sortValues function:

const sortResult = cleanedDf.sortValues("salary", { ascending: false });

Running result:

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 5          │ Grace             │ 29                │ Engineering       │ 82000             │ 2019-02-28        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 7          │ Henry             │ forty             │ HR                │ 48000             │ 2022-04-01        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Apply

Danfojs also uses the apply function for batch processing of column data.

If we want to add a new column, for example, categorizing salary into three groups (low, medium, high) based on its amount, denoted by the column name “salaryLevel”.

For instance, salaries less than 50,000 are “Low”, between 50,000 and 80,000 are “Medium”, and above 80,000 are “High”.

  function markSalaryLevel(col) {
    // return col+10;
    if (col > 80000) {
      return "High";
    } else if (col > 50000) {
      return "Medium";
    } else {
      return "Low";
    }
  }

const colD = cleanedDf["salary"].apply(markSalaryLevel);
cleanedDf.addColumn("salaryLevel", colD, { inplace: true });

Output results:

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤════════════
║            │ name              │ age               │ department        │ salary            │ hire_date         │ salaryLevel
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        │ High       
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 5          │ Grace             │ 29                │ Engineering       │ 82000             │ 2019-02-28        │ High
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 7          │ Henry             │ forty             │ HR                │ 48000             │ 2022-04-01        │ Low         
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        │ High       
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧══════════

The last column is the newly added “salaryLevel”, whose values are determined based on the “salary”.

It should be noted here:

Danfojs locates rows and indexes using iloc and loc as well. iloc is based on row numbers, while loc is based on row indexes.

Summary:

Danfo.js is the most comprehensive library in Node.js that is closest to pandas, especially suitable for scenarios requiring complex tabular data cleaning and analysis in a Node.js environment. If you are familiar with pandas, you can almost seamlessly migrate to Danfo.js for development.