General Data Processing Operators
About 1449 wordsAbout 5 min
2025-06-09
Overview
DataFlow currently supports text data processing at the data point level, categorized into three types: refiners, deduplicators and filters.
| Type | Count | Description |
|---|---|---|
| Refiners | 16 | Improves the content of data points through processing and augmentation without altering the total count. |
| Deduplicators | 6 | Removes duplicate data points using methods such as hashing. |
| Filters | 42 | Filters data points based on thresholds and other criteria. |
Refiners
| Name | Applicable Type | Description | Repository or Paper |
|---|---|---|---|
| CondorRefiner | SFT | Generate evaluations and rewrites of SFT responses using LLM APIs to improve QA quality | paper |
| LowercaseRefiner | NLP | Converts text fields to lowercase. | - |
| PIIAnonymizeRefiner | Pre-training | Anonymizes Personally Identifiable Information (PII), such as names and locations, to protect privacy. | Code |
| RemovePunctuationRefiner | NLP | Removes punctuation from text. | - |
| RemoveNumberRefiner | NLP | Removes numeric characters from text. | - |
| RemoveExtraSpacesRefiner | NLP, Pre-training | Replaces multiple consecutive spaces with a single space and trims leading/trailing spaces. | - |
| RemoveRepetitionsPunctuationRefiner | NLP | Removes repeated punctuation, e.g., "!!!" becomes "!". | - |
| RemoveEmojiRefiner | Pre-training | Removes emojis from text, e.g., "😀". | Code |
| RemoveEmoticonsRefiner | Pre-training | Removes emoticons such as ":-)", using a predefined list. | Code |
| RemoveContractionsRefiner | NLP | Expands contractions in text, e.g., "can't" becomes "cannot". | Code |
| HtmlUrlRemoverRefiner | Pre-training | Removes URLs and HTML tags from text. | - |
| TextNormalizationRefiner | NLP | Normalizes formats for dates, currencies, etc., in text. | - |
| NERRefiner | NLP | Uses Named Entity Recognition (NER) to identify and mask specific entities in text. | Code |
| StemmingLemmatizationRefiner | NLP | Performs stemming or lemmatization on text. | Code |
| SpellingCorrectionRefiner | NLP, Pre-training | Corrects spelling errors in text using SymSpell. | Code |
| RemoveStopwordsRefiner | NLP | Removes stopwords (e.g., "the", "is") from text. | Code |
Deduplicators
| Name | Type | Description | Repository or Paper |
|---|---|---|---|
| HashDeduplicator | Exact Deduplication | Uses various hash functions (e.g., MD5, SHA256, XXH3_128) to remove duplicate data based on exact hash value comparison. Suitable for small-scale simple deduplication. | - |
| CCNetDeduplicator | Exact Deduplication | Compares the first 64 bits of the SHA-1 hash to identify duplicate text, balancing security and computational efficiency. | - |
| NgramHashDeduplicator | Near Deduplication | Combines n-gram techniques with hashing to detect duplicates based on multiple hash comparisons of n-gram segments. Useful for identifying near-duplicates. | Paper |
| SemDeduplicator | Near Deduplication | Uses semantic similarity based on BERT embeddings and cosine similarity to detect duplicates. Ideal for detecting semantically similar but differently phrased text. | Paper Code |
| SimHashDeduplicator | Near Deduplication | Uses the SimHash algorithm to detect similar text based on Hamming distance of fingerprints. Efficient for large-scale data deduplication. | Paper |
| MinHashDeduplicator | Near Deduplication | Combines MinHash and LSH to compare sets with minimal memory usage and computation cost, detecting similarity between sets. | Paper |
Filters
| Name | Applicable Type | Description | Repository or Paper |
|---|---|---|---|
| GeneralFilter | Any DataFrame | Supports flexible filtering of the DataFrame using one or more custom lambda functions | - |
| LanguageFilter | Pre-training, SFT | Filters specific languages using the fasttext language identification model. | Huggingface |
| BlocklistFilter | Pre-training, SFT | Filters data points using a blocklist (e.g., List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words). | Code |
Additionally, Open-DataFlow-Eval supports filtering data points based on scores from single data point scorers, with 18 supported scorers.
DeitaQualityFilter:
min_score: 1
max_score: 5
scorer_args:
device: 'cuda:0'
model_name: 'hkust-nlp/deita-quality-scorer'
max_length: 512You can set min/max scores and scorer parameters in scorer_args for filtering. For more information on supported scorers, refer to the evaluation algorithm documentation (excluding the Diversity part).
In addition, heuristic rule filtering plays a significant role in the screening of pre-training data. In this regard, the Dingo Data Quality Evaluation Tool has greatly inspired our development. We have integrated some of the rule filtering algorithms used in Dingo, a total of 22 types, into dataflow/operators/filter/GeneralText/heuristics.py. For details, please refer to the Rules Documentation. The names of the filters can be found in the dataflow/operators/filter/GeneralText/heuristics.py file.
All 42 data filters mentioned above share the same yaml invocation method.

