Agent-Quick Start

About 2773 wordsAbout 9 min

2025-06-19

This guide will help you quickly get started with the 5 core functional modules of the DataFlow Agent platform.

1. Pipeline Recommendation

Feature Overview

Automatically recommends and generates appropriate DataFlow Pipelines based on user's natural language descriptions, including operator selection, parameter configuration, and code generation.

Use Cases

Quickly build data processing workflows
Intelligent recommendations when unfamiliar with specific operators
Automated Pipeline generation

Input Parameters

Basic Configuration

Target Description (Required)
- Describe the data processing goal you want to achieve
- Example: "Give me 5 logically consistent operators, filter and deduplicate!"
- Example: "Clean, deduplicate, and classify text data"
Input JSONL File Path (Required)
- Data file for testing the Pipeline
- Format: One JSON object per line
- Default: {project_root}/tests/test.jsonl
Session ID
- Session identifier for caching and tracking
- Default: "default"

API Configuration

Primary Model Configuration

Chat API URL: LLM service address
- Default: http://123.129.219.111:3000/v1/
API Key: Access key
Model Name: e.g., gpt-4o, qwen-max, llama3, etc.
- Default: gpt-4o

Embedding Model Configuration

Embedding API URL: Embedding model service address (optional, uses primary API if empty)
Embedding Model Name: e.g., text-embedding-3-small

Debug Configuration

Enable Debug Mode: Whether to enable automatic debugging and fixing
Debug Mode Execution Count: 1-10 times, default 2

Output Results

1. Pipeline Code (Generated Code)

# Auto-generated Python code
# Contains complete Pipeline definition and execution logic

2. Execution Log

Detailed log of Pipeline execution process
Contains execution status of each operator
Error messages and debug information

3. Agent Results

{
  "recommender": {...},
  "pipeline_builder": {...},
  "operator_executor": {...}
}

Detailed execution results of each Agent node
Includes recommended operator list, build process, etc.

Usage Steps

Enter your requirements in the "Target Description" box
Configure API information (URL, Key, Model)
(Optional) Configure embedding model and debug options
Click "Generate Pipeline" button
View generated code and execution results

2. Operator Development

Feature Overview

Automatically generates new DataFlow operator code based on user requirements, including operator implementation, test code, and debugging.

Use Cases

Create custom data processing operators
Extend DataFlow functionality
Rapid prototyping

Input Parameters

Basic Configuration

Target Description (Required)
- Describe the operator's functionality and purpose
- Example: "Create an operator for sentiment analysis of text"
- Example: "Implement a data deduplication operator supporting multi-field combination deduplication"
Operator Category
- Category the operator belongs to, used to match similar operators as reference
- Default: "Default"
- Options: "filter", "mapper", "aggregator", etc.
Test Data File Path (JSONL)
- Data file for testing the operator
- Default: {project_root}/tests/test.jsonl

API Configuration

Chat API URL: LLM service address
API Key: Access key (uses environment variable DF_API_KEY if empty)
Model Name: Default gpt-4o

Advanced Configuration

Output Language: en (English) or zh (Chinese)
Enable Debug Mode: Automatically execute and fix code errors
Maximum Debug Rounds: 1-10 times, default 3
Output File Path: Location to save generated code (optional)

Output Results

1. Generated Code

# Complete operator implementation code
class YourOperator(Operator):
    def __init__(self, ...):
        ...
    
    def run(self, dataset, ...):
        ...

2. Matched Operators

[
  {
    "op_name": "similar_operator_1",
    "similarity": 0.85,
    "description": "..."
  }
]

List of similar operators matched by the system
Used as reference and learning material

3. Execution Results

{
  "success": true,
  "output": {...},
  "stderr": "",
  "stdout": "..."
}

Operator execution status
Output data preview
Error messages (if any)

4. Debug Information

{
  "round": 2,
  "input_key": "text",
  "available_keys": ["text", "label"],
  "stdout": "...",
  "stderr": "..."
}

Detailed information of debug process
Input/output of each debug round

5. Agent Results

Execution details of each Agent node
Includes matching, writing, execution, debugging phases

6. Execution Log

Complete execution process log
Contains detailed information of all phases

Usage Steps

Describe operator functionality in detail in "Target Description"
Select appropriate operator category
Configure API information
(Optional) Enable debug mode to automatically fix errors
Click "Generate Operator" button
View generated code and test results
If modifications needed, adjust parameters and regenerate

3. Manual Orchestration

Feature Overview

Manually select and assemble operators through visual interface to build custom Pipelines, supporting drag-and-drop sorting and parameter configuration.

Use Cases

Precise control of Pipeline structure
Reuse existing operators
Rapid prototype validation
Learn operator usage methods

Input Parameters

API and File Configuration

Chat API URL: LLM service address
API Key: Access key
Model Name: Default gpt-4o
Input JSONL File Path: Test data file

Operator Selection and Configuration

Step 1: Select Operator

Select category from "Operator Category" dropdown
- e.g., filter, mapper, deduplicator, etc.
Select specific operator from "Operator" dropdown
- System automatically displays parameter description for the operator

Step 2: Configure Parameters

Prompt Template (Optional)
- If operator supports Prompt templates, a dropdown selector will appear
- Automatically updates to __init__() parameters after selection

__init__() Parameters (JSON Format)

{
  "param1": "value1",
  "param2": 123,
  "prompt_template": "module.PromptClass"
}

Operator initialization parameters
Must be valid JSON object

run() Parameters (JSON Format)

{
  "input_key": "text",
  "output_key": "processed_text",
  "batch_size": 32
}

Operator runtime parameters
Must be valid JSON object

Step 3: Add to Pipeline

Click "➕ Add Operator to Pipeline" button
Operator will be added to Pipeline sequence

Step 4: Adjust Order

In Pipeline visualization area, drag operator cards to adjust order
System automatically renumbers

Step 5: Auto-linking

System automatically analyzes input/output relationships between operators
Displays link status:
- 🔗 Linked: Output key successfully matched to next operator's input
- ⚠️ Pending: Input is empty or unmatched

Output Results

1. Current Pipeline (Visual Display)

Each operator displayed as a card, containing:
- Step number
- Operator name
- __init__() parameter preview
- run() parameter preview
- Connection status with previous step

2. Current Pipeline (JSON Format)

[
  {
    "op_name": "TextCleanerOperator",
    "init_params": {...},
    "run_params": {...},
    "_incoming_links": [
      {
        "input_key": "text",
        "value": "raw_text",
        "output_keys": ["output"]
      }
    ]
  }
]

3. Generated Code

# Complete Pipeline execution code
from dataflow import Dataset
from dataflow.operators import *

# Load data
dataset = Dataset.load("input.jsonl")

# Execute Pipeline
dataset = TextCleanerOperator(...).run(dataset, ...)
dataset = DeduplicatorOperator(...).run(dataset, ...)
...

# Save results
dataset.save("output.jsonl")

4. Processing Result Data (First 100 Records)

[
  {"text": "processed text 1", "label": "A"},
  {"text": "processed text 2", "label": "B"},
  ...
]

5. Output File Path

Location where processed data is saved

Usage Steps

Configure API information and input file path
Select operator category and specific operator
Edit __init__() and run() parameters (JSON format)
Click "➕ Add Operator to Pipeline"
Repeat steps 2-4 to add more operators
Drag to adjust operator order (optional)
Check auto-link status, ensure parameters are correct
Click "🚀 Run Pipeline"
View generated code and execution results

Advanced Tips

Clear Pipeline: Click "🗑️ Clear Pipeline" button
Parameter Reuse: System automatically links previous operator's output key to next operator's input
Debugging: If execution fails, check error messages in log, adjust parameters and retry

Create high-quality Prompt templates for operators
Optimize existing Prompt effectiveness
Rapid Prompt design iteration
Generate test code and data

Input Parameters

Runtime Configuration

Chat API Base URL: LLM service address
- Default: http://123.129.219.111:3000/v1/
Chat API Key: Access key
Model: Model name, default gpt-4o
Language: Prompt language, zh (Chinese) or en (English)

Prompt Configuration

Task Description (Required)
- Describe in detail the task the Prompt should complete
- Example: "Perform sentiment analysis on user input text, determine if positive, negative, or neutral"
- Example: "Rewrite product descriptions into more attractive marketing copy"
Operator Name (op-name) (Required)
- Name of the Prompt class
- Example: SentimentAnalysisPrompt
- Example: MarketingCopywriterPrompt

Output Format (Optional)

Specify the format of Prompt output

Example:

{
  "sentiment": "positive/negative/neutral",
  "confidence": 0.95
}

Parameter List (Optional)
- Parameters needed by Prompt template, separated by comma, space, or newline
- Example: text, language, style
- Example:
```
input_text
target_audience
tone
```
File Output Root Path (Optional)
- Directory to save generated files
- Default: ./pa_cache
Delete Test Files After Generation
- Whether to delete test files after generation (keep path placeholder)
- Default: Enabled

Output Results

1. Prompt File Path

Location of generated Prompt template file
Example: ./pa_cache/prompts/SentimentAnalysisPrompt.py

2. Test Data File Path

Auto-generated test data file
Example: ./pa_cache/test_data/test_data.jsonl

3. Test Code File Path

Auto-generated test code
Example: ./pa_cache/tests/test_prompt.py

4. Test Data Preview

[
  {"text": "This product is great!", "language": "en"},
  {"text": "Quality is terrible", "language": "en"},
  {"text": "It's okay", "language": "en"}
]

5. Test Results Preview

[
  {
    "input": {"text": "This product is great!"},
    "output": {
      "sentiment": "positive",
      "confidence": 0.92
    }
  }
]

6. Prompt Code Preview

from dataflow_agent.promptstemplates import PromptTemplate

class SentimentAnalysisPrompt(PromptTemplate):
    """Sentiment Analysis Prompt Template"""
    
    def __init__(self):
        super().__init__()
        self.system_prompt = "You are a sentiment analysis expert..."
        self.user_prompt_template = "Please analyze the sentiment of the following text: {text}"
    
    def format(self, text: str, **kwargs) -> str:
        return self.user_prompt_template.format(text=text)

7. Test Code Preview

import json
from your_prompt import SentimentAnalysisPrompt

# Load test data
with open("test_data.jsonl") as f:
    test_data = [json.loads(line) for line in f]

# Test Prompt
prompt = SentimentAnalysisPrompt()
for item in test_data:
    result = prompt.format(**item)
    print(result)

Multi-round Rewriting Feature

In the right-side conversation area, you can:

View Initial Generation Results
- Prompt code
- Test results
Propose Improvements
- Describe how you want to modify in the conversation input box
- Examples:
  - "Add recognition of sarcastic tone"
  - "Change output format to return only positive/negative/neutral string"
  - "Add confidence threshold, return uncertain when below 0.7"
Send Rewrite Instructions
- Click "Send Rewrite Instruction" button
- System regenerates Prompt based on feedback
Iterative Optimization
- View updated code and test results
- Continue proposing improvements
- Repeat until satisfied
Clear Session
- Click "Clear Session" button to start over

Usage Steps

Initial Generation

Configure API information (URL, Key, Model)
Fill in task description and operator name
(Optional) Specify output format and parameter list
Click "Generate Prompt Template" button
View generated Prompt code and test results

Multi-round Optimization

Enter improvement suggestions in right-side dialog box
Click "Send Rewrite Instruction"
View updated code and test results
Repeat steps 1-3 until satisfied

Using Generated Prompt

Get file location from "Prompt File Path"
Import Prompt class into your operator
Specify prompt_template in operator's __init__()

Quickly build training datasets
Collect domain-specific data
Dataset format conversion
Batch download and processing

Input Parameters

Collection Configuration

Target Description (Required)
- Describe the type of data you want to collect
- Example: "Collect Python code example datasets"
- Example: "Collect Chinese conversation data for training chatbots"
- Example: "Collect image classification datasets with cat and dog pictures"
Data Category
- PT: Pre-Training data
- SFT: Supervised Fine-Tuning data
- Default: SFT
Dataset Quantity Limit (Per Keyword)
- Number of datasets returned per search keyword
- Range: 1-50
- Default: 5
- Note: For reference only, actual quantity may vary based on search results
Dataset Size Range
- Filter datasets by size range
- Options:
  - n<1K: Less than 1000 records
  - 1K<n<10K: 1000-10000 records
  - 10K<n<100K: 10000-100000 records
  - 100K<n<1M: 100000-1000000 records
  - n>1M: More than 1000000 records
- Default: 1K<n<10K
Download Subtask Limit
- Limit the number of download tasks finally executed
- Leave empty for no limit
- Used to control download scale and time
Maximum Dataset Size
- Size limit for single dataset
- Enter value then select unit (B/KB/MB/GB/TB)
- Leave empty for no limit
Download Directory
- Root directory for data storage
- Default: downloaded_data
Prompt Language
- zh: Chinese
- en: English
- Default: zh

LLM Configuration

CHAT_API_URL: LLM service address
- Default: http://123.129.219.111:3000/v1/chat/completions
CHAT_API_KEY: Access key
CHAT_MODEL: Model name
- Default: deepseek-chat

Other Environment Configuration

HF_ENDPOINT: HuggingFace mirror address
- Default: https://hf-mirror.com
KAGGLE_USERNAME: Kaggle username
KAGGLE_KEY: Kaggle API key
TAVILY_API_KEY: Tavily search API key

RAG Configuration

RAG_EBD_MODEL: Embedding model name
- Default: text-embedding-3-large
RAG_API_URL: RAG service address
RAG_API_KEY: RAG API key

Advanced Configuration (Collapsible)

Web Collection Advanced Configuration

Download Task Max Loop Count: 1-50, default 10
- Controls maximum retry count for each download task
Research Phase Max Loop Count: 1-50, default 15
- Maximum loop count for research phase, allows visiting more websites
Search Engine: tavily / duckduckgo / jina
- Default: tavily
Use Jina Reader: Whether to use Jina Reader to extract web content
- Default: Enabled
- Advantages: Fast, structured (Markdown format)
Enable RAG Enhancement: Whether to use RAG to refine content
- Default: Enabled
Parallel Page Processing Count: 1-20, default 5
- Number of pages processed in parallel
- Recommendation: 3-10 (adjust based on network and machine performance)
Disable Cache: Whether to disable HuggingFace and Kaggle cache
- Default: Enabled
- When enabled, uses temporary directory and auto-cleans after download
Temporary Directory: Custom temporary directory path
- Leave empty to use default temporary directory

Data Conversion Advanced Configuration

Conversion Model Temperature: 0.0-2.0, default 0.0
- Model temperature parameter during data conversion
Conversion Max Token Count: 512-8192, default 4096
- Maximum token count during data conversion
Max Sampling Length (Characters): 50-1000, default 200
- Maximum sampling length for each field
Sampling Record Count: 1-10, default 3
- Number of sampling records for analysis

Output Results

1. Execution Log (Real-time Streaming Output)

============================================================
Starting Web Collection and Conversion Workflow
============================================================
Target: Collect Python code example datasets
Category: SFT
Download Directory: downloaded_data

【Web Collection Configuration】
  - Search Engine: tavily
  - Download Subtask Limit: No limit
  - Task Max Loop Count: 10
  - Research Phase Max Loop Count: 15
  - Use Jina Reader: Yes
  - Enable RAG: Yes
  - Parallel Pages: 5
  - Disable Cache: Yes

【Data Conversion Configuration】
  - Model Temperature: 0.0
  - Max Token Count: 4096
  - Max Sampling Length: 200
  - Sampling Record Count: 3

Dataset Size Limit: No limit
============================================================

2025-01-23 10:00:00 [INFO] Starting dataset search...
2025-01-23 10:00:05 [INFO] Found 15 candidate datasets
2025-01-23 10:00:10 [INFO] Starting download dataset 1/5...
2025-01-23 10:01:00 [INFO] Dataset 1 download complete
...
2025-01-23 10:15:00 [INFO] Starting data conversion...
2025-01-23 10:20:00 [INFO] Data conversion complete
Workflow execution complete!

2. Result Summary

{
  "download_dir": "downloaded_data",
  "processed_output": "downloaded_data/processed_output",
  "category": "SFT",
  "language": "zh",
  "chat_model": "deepseek-chat",
  "max_download_subtasks": null,
  "max_dataset_size_bytes": null,
  "max_dataset_size_unit": null,
  "max_dataset_size_value": null
}

Output File Structure

downloaded_data/
├── raw/                          # Raw downloaded data
│   ├── dataset_1/
│   │   ├── data.jsonl
│   │   └── metadata.json
│   ├── dataset_2/
│   └── ...
└── processed_output/             # Converted unified format data
    ├── combined.jsonl           # Combined data
    ├── train.jsonl              # Training set (if split)
    ├── validation.jsonl         # Validation set (if split)
    └── metadata.json            # Metadata information

Usage Steps

Basic Usage

Describe the type of data to collect in detail in "Target Description"
Select data category (PT or SFT)
Configure dataset quantity and size limits
Configure LLM API information
(Optional) Configure keys for Kaggle, Tavily, and other services
Click "Start Web Collection and Conversion" button
View execution log in real-time
View result summary after completion
Check collected data in download directory

Advanced Usage

Expand "⚙️ Advanced Configuration" area
Adjust according to needs:
- Search engine selection
- Parallel processing count
- Cache strategy
- Data conversion parameters
Execute collection task
Adjust parameters based on log to optimize results

Notes

API Keys
- Ensure necessary API keys are configured
- Tavily for search, Kaggle for downloading Kaggle datasets
Network Environment
- If in China, recommend using HuggingFace mirror
- Adjust parallel count to suit network bandwidth
Storage Space
- Ensure sufficient disk space
- Large datasets may require several GB of space
Execution Time
- Collection process may take considerable time (minutes to hours)
- Can control time by limiting download task count
Data Quality
- Enabling RAG enhancement can improve data quality
- Adjust sampling parameters to balance quality and speed

FAQ

Q1: How to obtain API keys?

OpenAI/GPT: Visit OpenAI Platform
Tavily: Visit Tavily
Kaggle: Visit Kaggle Settings

Q2: How to choose the right model?

Quick Prototyping: gpt-3.5-turbo, deepseek-chat
High-Quality Output: gpt-4o, claude-3-opus
Chinese Optimization: qwen-max, deepseek-chat

Q3: What to do if Pipeline execution fails?

Check error messages in execution log
Confirm input data format is correct
Check operator parameter configuration
Enable debug mode for automatic fixing
View Agent results for detailed errors

Q4: How to improve data collection quality?

Use more precise target descriptions
Enable RAG enhancement
Adjust dataset size range
Increase sampling record count
Use more powerful LLM models

Q5: Can generated code be used directly?

Pipeline Recommendation: Can run directly, but recommend validating on test data first
Operator Development: Recommend testing first, manually adjust if necessary
Manual Orchestration: Generated code has been tested and can be used directly
Prompt Templates: Recommend multi-round optimization before using in production