Data Cleaning and Processing Libraries in the Node.js Ecosystem – A Counterpart to Python's pandas

Node.js has become a popular choice for building efficient web crawlers due to its non-blocking I/O and event-driven nature.

It is particularly well-suited for handling high-concurrency network requests, as it can initiate multiple requests simultaneously through asynchronous operations within a single thread. This avoids the waiting and blocking issues inherent in traditional synchronous programming, significantly improving data crawling efficiency. Additionally, the rich ecosystem of third-party libraries in Node.js (such as axios for HTTP requests, cheerio for HTML parsing, and puppeteer for dynamic page rendering) further lowers the barrier to crawler development, making it widely used in the field of data collection.

Typical Application Scenarios

Data Collection and Analysis:

E-commerce platforms: Crawling product prices, reviews, sales volume, etc., for competitor analysis and price monitoring (e.g., fetching product information from JD.com and Taobao);
Social media: Collecting user comments and trending topics for sentiment analysis (e.g., crawlingng Weibo hot searches and Zhihu Q&A);
News and information: Grabbing news headlines, content, and publication times to build news aggregation platforms or conduct content analysis.

Content Synchronization and Migration:

Blogs/forums: Batch migrating historical articles and posts to new platforms (e.g., crawling content from WordPress to sync with self-built websites);
Document aggregation: Crawling technical documents and manuals to integrate into internal knowledge bases (e.g., crawling official API documents to generate offline query tools).

Monitoring and Early Warning:

Information monitoring: Real-time crawling of job postings on recruitment websites and sending notifications when qualified positions appear;
Price tracking: Monitoring fluctuations in airfare and hotel prices, triggering alerts when prices drop;
Compliance detection: Crawling corporate website information to check for non-compliant content or links.

Search Engines and Vertical Domain Applications:

Vertical search: Crawling data in specific fields (such as academic papers and patents) to build professional search engines;
Machine learning training: Crawling large amounts of image and text data for training AI models (e.g., crawling image datasets to train image recognition models).

Node.js can rival Python in the field of web crawling, with an increasing number of third-party libraries and an increasingly mature ecosystem.

In the field of data analysis, Python has a powerful tool – pandas. So, is there a comparable library in Node.js?

In the Node.js ecosystem, Danfo.js is currently the library most similar to Python’s pandas. Its design philosophy, API style, and core functionalities (such as the DataFrame data structure) are highly modeled after pandas, making it especially suitable for developers familiar with pandas to get started quickly.

Core Features of Danfo.js (Compared with pandas)

Consistent Core Data Structures: It provides DataFrame (two-dimensional tabular data) and Series (one-dimensional sequences), which correspond exactly to the core structures of pandas, supporting row indexes, column names, and data types (such as int32, float64, string, etc.).
Highly Similar API Design: The naming and usage of common data processing methods are almost identical to those in pandas, reducing the learning curve:

Data reading: read_csv(), read_json() (corresponding to pandas.read_csv)
Data cleaning: dropna() (removing missing values), fillna() (filling missing values), dropDuplicates() (removing duplicates)
Filtering and transformation: query() (conditional filtering), where() (conditional replacement), astype() (type conversion)
Grouping and aggregation: groupby(), sum(), mean(), and other statistical methods
Merging and concatenation: merge(), concat() (similar to the merging logic in pandas)

Support for Mainstream Formats and Data Operations: It supports multiple data sources such as CSV, JSON, and arrays, allowing direct operations on DataFrame like adding/deleting rows and columns, sorting, and slicing, covering the core data processing scenarios of pandas.

Example Code (Compared with pandas)

Suppose we have a CSV file with the following content:

name,age,department,salary,hire_date
Alice,28,Engineering,75000,$2020-03-15
Bob,35,Marketing,62000,2018-07-22
Charlie,,Sales,58000,2021-01-30  
David,42,Engineering,90000,2016-09-05
Eve,31,Marketing,68000,2019-11-10
Frank,35,Sales,55000,2020-05-18
Grace,29,Engineering,82000,2019-02-28
Bob,35,Marketing,62000,2018-07-22  
Henry,forty,HR,48000,2022-04-01  
Ivy,27,Marketing,,2021-08-12 
Jack,33,Engineering,88000,2017-12-03

The headers in the CSV data are name, age, department, salary, and hire_date.

Then we use danfojs-node in Node.js for data analysis. The current version of danfojs-node is 1.2.0, as some functions only support the latest version. I encountered many issues here because many existing examples online are based on older versions of danfojs-node, which are incompatible with the new version of danfojs.

For example, exceptions like this may occur: “An error occurred: TypeError: noNaDf.dropDuplicates is not a function.”

const dfd = require("danfojs-node");

async function main() {
  // Read CSV and create DataFrame (similar to pandas.read_csv)
  const df = await dfd.readCSV("data.csv");

  // View basic data information (similar to df.info())
  console.log("Data shape:", df.shape);
  console.log("Data columns:", df.columns);
  console.log("Data types:");
  df.ctypes.print();

  console.log("Data preview:");
  df.head().print(); // Display the first few rows of data, similar to df.head()

  // Data cleaning: drop missing values, remove duplicates, convert column types
  const cleanedDf = df.dropNa().resetIndex({ drop: true }); // Reset index after dropNa
  cleanedDf.head().print();
  const filteredDf = cleanedDf.query(cleanedDf["age"].gt(30).and(cleanedDf["salary"].gt(50000)));
  console.log("Filtered Data:");
  filteredDf.print();
  // Group aggregation (similar to df.groupby("department").mean())
  const groupedDf = cleanedDf.groupby(["department"]).col(["salary"]).mean();
  // Save the result (similar to df.to_csv())
  dfd.toCSV(groupedDf, { filePath: "result.csv" });
  console.log("Result saved to result.csv");
}

main()
  .then(() => {
    console.log("Data processing complete");
  })
  .catch((err) => {
    console.error("An error occurred:", err);
  });

Output results:

Data shape: [ 11, 5 ]
Data columns: [ 'name', 'age', 'department', 'salary', 'hire_date' ]
Data types:
╔════════════╤═════════╗
║ name       │ string  ║
╟────────────┼─────────╢
║ age        │ string  ║
╟────────────┼─────────╢
║ department │ string  ║
╟────────────┼─────────╢
║ salary     │ float32 ║
╟────────────┼─────────╢
║ hire_date  │ string  ║
╚════════════╧═════════╝

Data preview:
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ Charlie           │ null              │ Sales             │ 58000             │ 2021-01-30        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Filtered Data:
╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Result saved to result.csv
Data processing complete

It can be seen that the functions of danfojs-node are quite similar to those of Python’s pandas.

For example, dropNa() deletes rows with empty elements. If old rows are deleted, the indexes will be discontinuous, and directly using query for querying will result in an error.

Such as: “An error occurred: TypeError: Cannot read properties of undefined (reading ‘0’)”

In this case, it is necessary to reset the index: .resetIndex({ drop: true });

  const groupedDf = cleanedDf.groupby(["department"]).col(["salary"]).mean();

Here, we aggregate based on the “department” column, grouping data from the same department together, and then calculate the average of the “salary” to obtain the average salary for each department.

To sort by “salary” in descending order, use the DataFrame.sortValues function:

const sortResult = cleanedDf.sortValues("salary", { ascending: false });

Running result:

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ name              │ age               │ department        │ salary            │ hire_date         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 5          │ Grace             │ 29                │ Engineering       │ 82000             │ 2019-02-28        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        ║
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 7          │ Henry             │ forty             │ HR                │ 48000             │ 2022-04-01        ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Apply

Danfojs also uses the apply function for batch processing of column data.

If we want to add a new column, for example, categorizing salary into three groups (low, medium, high) based on its amount, denoted by the column name “salaryLevel”.

For instance, salaries less than 50,000 are “Low”, between 50,000 and 80,000 are “Medium”, and above 80,000 are “High”.

  function markSalaryLevel(col) {
    // return col+10;
    if (col > 80000) {
      return "High";
    } else if (col > 50000) {
      return "Medium";
    } else {
      return "Low";
    }
  }

const colD = cleanedDf["salary"].apply(markSalaryLevel);
cleanedDf.addColumn("salaryLevel", colD, { inplace: true });

Output results:

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤═══════════════════╤════════════
║            │ name              │ age               │ department        │ salary            │ hire_date         │ salaryLevel
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 0          │ Alice             │ 28                │ Engineering       │ 75000             │ $2020-03-15       │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 1          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 2          │ David             │ 42                │ Engineering       │ 90000             │ 2016-09-05        │ High       
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 3          │ Eve               │ 31                │ Marketing         │ 68000             │ 2019-11-10        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 4          │ Frank             │ 35                │ Sales             │ 55000             │ 2020-05-18        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 5          │ Grace             │ 29                │ Engineering       │ 82000             │ 2019-02-28        │ High
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 6          │ Bob               │ 35                │ Marketing         │ 62000             │ 2018-07-22        │ Medium     
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 7          │ Henry             │ forty             │ HR                │ 48000             │ 2022-04-01        │ Low         
╟────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────
║ 8          │ Jack              │ 33                │ Engineering       │ 88000             │ 2017-12-03        │ High       
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧═══════════════════╧══════════

The last column is the newly added “salaryLevel”, whose values are determined based on the “salary”.

It should be noted here:

For single-column processing: Use df['column name'].apply(function), where the function receives each value in the single column.
For row processing (involving multiple columns): Use df.apply(function, { axis: 1 }), where the function receives all values in each row (as an array).
addColumn requires the second parameter to be an array / Series with the same number of rows as the DataFrame, ensuring that the return value format of apply is correct.

Danfojs locates rows and indexes using iloc and loc as well. iloc is based on row numbers, while loc is based on row indexes.

Summary:

Danfo.js is the most comprehensive library in Node.js that is closest to pandas, especially suitable for scenarios requiring complex tabular data cleaning and analysis in a Node.js environment. If you are familiar with pandas, you can almost seamlessly migrate to Danfo.js for development.