General Data Processing Operators

About 1449 wordsAbout 5 min

2025-06-09

Overview

DataFlow currently supports text data processing at the data point level, categorized into three types: refiners, deduplicators and filters.

Type	Count	Description
Refiners	16	Improves the content of data points through processing and augmentation without altering the total count.
Deduplicators	6	Removes duplicate data points using methods such as hashing.
Filters	42	Filters data points based on thresholds and other criteria.

Refiners

Name	Applicable Type	Description	Repository or Paper
CondorRefiner	SFT	Generate evaluations and rewrites of SFT responses using LLM APIs to improve QA quality	paper
LowercaseRefiner	NLP	Converts text fields to lowercase.	-
PIIAnonymizeRefiner	Pre-training	Anonymizes Personally Identifiable Information (PII), such as names and locations, to protect privacy.	Code
RemovePunctuationRefiner	NLP	Removes punctuation from text.	-
RemoveNumberRefiner	NLP	Removes numeric characters from text.	-
RemoveExtraSpacesRefiner	NLP, Pre-training	Replaces multiple consecutive spaces with a single space and trims leading/trailing spaces.	-
RemoveRepetitionsPunctuationRefiner	NLP	Removes repeated punctuation, e.g., "!!!" becomes "!".	-
RemoveEmojiRefiner	Pre-training	Removes emojis from text, e.g., "😀".	Code
RemoveEmoticonsRefiner	Pre-training	Removes emoticons such as ":-)", using a predefined list.	Code
RemoveContractionsRefiner	NLP	Expands contractions in text, e.g., "can't" becomes "cannot".	Code
HtmlUrlRemoverRefiner	Pre-training	Removes URLs and HTML tags from text.	-
TextNormalizationRefiner	NLP	Normalizes formats for dates, currencies, etc., in text.	-
NERRefiner	NLP	Uses Named Entity Recognition (NER) to identify and mask specific entities in text.	Code
StemmingLemmatizationRefiner	NLP	Performs stemming or lemmatization on text.	Code
SpellingCorrectionRefiner	NLP, Pre-training	Corrects spelling errors in text using SymSpell.	Code
RemoveStopwordsRefiner	NLP	Removes stopwords (e.g., "the", "is") from text.	Code

Deduplicators

Name	Type	Description	Repository or Paper
HashDeduplicator	Exact Deduplication	Uses various hash functions (e.g., MD5, SHA256, XXH3_128) to remove duplicate data based on exact hash value comparison. Suitable for small-scale simple deduplication.	-
CCNetDeduplicator	Exact Deduplication	Compares the first 64 bits of the SHA-1 hash to identify duplicate text, balancing security and computational efficiency.	-
NgramHashDeduplicator	Near Deduplication	Combines n-gram techniques with hashing to detect duplicates based on multiple hash comparisons of n-gram segments. Useful for identifying near-duplicates.	Paper
SemDeduplicator	Near Deduplication	Uses semantic similarity based on BERT embeddings and cosine similarity to detect duplicates. Ideal for detecting semantically similar but differently phrased text.	Paper Code
SimHashDeduplicator	Near Deduplication	Uses the SimHash algorithm to detect similar text based on Hamming distance of fingerprints. Efficient for large-scale data deduplication.	Paper
MinHashDeduplicator	Near Deduplication	Combines MinHash and LSH to compare sets with minimal memory usage and computation cost, detecting similarity between sets.	Paper

Filters

Name	Applicable Type	Description	Repository or Paper
GeneralFilter	Any DataFrame	Supports flexible filtering of the DataFrame using one or more custom lambda functions	-
LanguageFilter	Pre-training, SFT	Filters specific languages using the fasttext language identification model.	Huggingface
BlocklistFilter	Pre-training, SFT	Filters data points using a blocklist (e.g., List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words).	Code

Additionally, Open-DataFlow-Eval supports filtering data points based on scores from single data point scorers, with 18 supported scorers.

DeitaQualityFilter:
    min_score: 1                                         
    max_score: 5                                      
    scorer_args:
      device: 'cuda:0'
      model_name: 'hkust-nlp/deita-quality-scorer'
      max_length: 512

You can set min/max scores and scorer parameters in scorer_args for filtering. For more information on supported scorers, refer to the evaluation algorithm documentation (excluding the Diversity part).

In addition, heuristic rule filtering plays a significant role in the screening of pre-training data. In this regard, the Dingo Data Quality Evaluation Tool has greatly inspired our development. We have integrated some of the rule filtering algorithms used in Dingo, a total of 22 types, into dataflow/operators/filter/GeneralText/heuristics.py. For details, please refer to the Rules Documentation. The names of the filters can be found in the dataflow/operators/filter/GeneralText/heuristics.py file.

All 42 data filters mentioned above share the same yaml invocation method.