Fields
How the worker generates new field definitions using AI, updates the dataset schema, and optionally backfills existing records with values for the new fields.
Overview
The ai_generate_fields task type lets users add new fields to a dataset by describing what they need in natural language. The worker uses AI to generate appropriate field definitions (name, type, description), appends them to the dataset's schema, and can optionally populate existing records with values for the new fields.
Schema-first approach: Unlike ai_generate_records which works with an existing schema to produce data rows, ai_generate_fields modifies the schema itself. The worker updates the datasets.fields column directly, so the new fields are immediately available for all future operations.
Task Configuration
| Property | Value |
|---|---|
| Task Type | ai_generate_fields |
| Created By | PATCH /api/datasets/:id with ai body |
| Tables Used | tasks, task_events, datasets (update fields), dataset_records (backfill) |
Input Schema
The task's input column contains the validated generation request. The API server pre-populates existingFields and recordCount from the dataset.
{
"prompt": "Add industry classification and founding year fields",
"model": "gpt-5-nano",
"temperature": 0.5,
"web_search": {
"enabled": true,
"sources": 3,
"include_urls": [],
"exclude_urls": [],
"x_search": false,
"citations": true
},
"backfill": true,
"datasetId": "ds_abc123",
"existingFields": [
{ "name": "Company", "type": "text" },
{ "name": "Price", "type": "number" }
],
"recordCount": 150
}
| Field | Type | Description |
|---|---|---|
prompt | string | User prompt describing what fields to generate |
model | string | Model ID to use for generation (determines provider) |
temperature | number? | Sampling temperature (0-2) |
web_search | object? | Web search config (same shape as ai_generate_records) |
backfill | boolean | If true and records exist, generate values for new fields across existing records |
datasetId | string | Target dataset ID to update fields on |
existingFields | FieldDefinition[] | Current field definitions on the dataset (for context) |
recordCount | integer? | Number of existing records (helps estimate backfill work) |
Output Schema
Written to task.output when the task completes successfully:
{
"fieldsAdded": [
{ "name": "Industry", "type": "text", "description": "Industry classification" },
{ "name": "Founded", "type": "integer", "description": "Year company was founded" }
],
"recordsBackfilled": 150,
"tokenUsage": {
"input_tokens": 9800,
"output_tokens": 6200,
"total_tokens": 16000
}
}
Field Generation
The worker uses AI to translate the user's natural language prompt into structured field definitions:
- The existing field schema is passed as context so the AI understands the dataset's current structure.
- The AI generates new field definitions with
name,type, anddescriptionfor each field. - The worker validates that no generated field names collide with existing fields.
- Valid new fields are appended to the dataset's
fieldscolumn in thedatasetstable.
Supported field types: The AI can generate fields of any type supported by the dataset schema, including text, number, integer, boolean, date, url, email, and more. Field naming follows the conventions of existing fields in the dataset.
Backfill Pattern
When backfill is true and existing records are present, the worker populates new fields across all existing records:
- First, generate and validate the new field definitions. Update the dataset's
fieldscolumn (append new fields to existing array). - Fetch existing records in batches (e.g., 25 at a time) to manage token limits.
- For each batch: read existing record data, generate new field values via AI, update each record's
datacolumn. - Update task progress after each batch.
- Token usage is accumulated across field generation and all backfill batches.
- If
backfillis false or omitted, only the field definitions are added — existing records are not modified.
Web search during backfill: When web_search.enabled is true, the worker searches for relevant information before generating field values for each batch. This helps ground values in real data (e.g., looking up a company's industry or founding year) rather than relying on model knowledge alone.
Provider Routing
The model string in the task input determines which AI provider handles the request:
| Model Prefix | Provider | Notes |
|---|---|---|
gpt-*, o3-*, o4-* | OpenAI | Default provider for most tasks |
claude-* | Anthropic | Strong at structured data generation |
gemini-* | Google Gemini | Good for high-volume tasks |
grok-* | xAI (Grok) | Required for X/Twitter search (x_search) |
Shared Requirements
- Token usage: Accumulated across field generation and all backfill batches, written to
task.output.tokenUsage - Progress: Updated via
task.progress(0-100) after each backfill batch - Cancellation: Task status checked between batches — processing stops if cancelled
- Error handling: On failure, status set to
failed, error message recorded,completed_atset - Structured output: All AI calls use JSON schema for parseable responses