Fields

How the worker generates new field definitions using AI, updates the dataset schema, and optionally backfills existing records with values for the new fields.

Overview

The ai_generate_fields task type lets users add new fields to a dataset by describing what they need in natural language. The worker uses AI to generate appropriate field definitions (name, type, description), appends them to the dataset's schema, and can optionally populate existing records with values for the new fields.

Schema-first approach: Unlike ai_generate_records which works with an existing schema to produce data rows, ai_generate_fields modifies the schema itself. The worker updates the datasets.fields column directly, so the new fields are immediately available for all future operations.

Task Configuration

Property	Value
Task Type	`ai_generate_fields`
Created By	PATCH /api/datasets/:id with ai body
Tables Used	tasks, task_events, datasets (update fields), dataset_records (backfill)

Input Schema

The task's input column contains the validated generation request. The API server pre-populates existingFields and recordCount from the dataset.

{
  "prompt": "Add industry classification and founding year fields",
  "model": "gpt-5-nano",
  "temperature": 0.5,
  "web_search": {
    "enabled": true,
    "sources": 3,
    "include_urls": [],
    "exclude_urls": [],
    "x_search": false,
    "citations": true
  },
  "backfill": true,
  "datasetId": "ds_abc123",
  "existingFields": [
    { "name": "Company", "type": "text" },
    { "name": "Price", "type": "number" }
  ],
  "recordCount": 150
}

Field	Type	Description
`prompt`	string	User prompt describing what fields to generate
`model`	string	Model ID to use for generation (determines provider)
`temperature`	number?	Sampling temperature (0-2)
`web_search`	object?	Web search config (same shape as ai_generate_records)
`backfill`	boolean	If true and records exist, generate values for new fields across existing records
`datasetId`	string	Target dataset ID to update fields on
`existingFields`	FieldDefinition[]	Current field definitions on the dataset (for context)
`recordCount`	integer?	Number of existing records (helps estimate backfill work)

Output Schema

Written to task.output when the task completes successfully:

{
  "fieldsAdded": [
    { "name": "Industry", "type": "text", "description": "Industry classification" },
    { "name": "Founded", "type": "integer", "description": "Year company was founded" }
  ],
  "recordsBackfilled": 150,
  "tokenUsage": {
    "input_tokens": 9800,
    "output_tokens": 6200,
    "total_tokens": 16000
  }
}

Field Generation

The worker uses AI to translate the user's natural language prompt into structured field definitions:

The existing field schema is passed as context so the AI understands the dataset's current structure.
The AI generates new field definitions with name, type, and description for each field.
The worker validates that no generated field names collide with existing fields.
Valid new fields are appended to the dataset's fields column in the datasets table.

Supported field types: The AI can generate fields of any type supported by the dataset schema, including text, number, integer, boolean, date, url, email, and more. Field naming follows the conventions of existing fields in the dataset.

Backfill Pattern

When backfill is true and existing records are present, the worker populates new fields across all existing records:

First, generate and validate the new field definitions. Update the dataset's fields column (append new fields to existing array).
Fetch existing records in batches (e.g., 25 at a time) to manage token limits.
For each batch: read existing record data, generate new field values via AI, update each record's data column.
Update task progress after each batch.
Token usage is accumulated across field generation and all backfill batches.
If backfill is false or omitted, only the field definitions are added — existing records are not modified.

Step 1 AI generates field definitions from prompt ↓ Step 2 Validate no name collisions with existing fields ↓ Step 3 Append new fields to datasets.fields column ↓ Step 4 If backfill=true and records exist: ↓ Batch 1/6 → Read 25 records → AI generates values → Update records → progress=17% Batch 2/6 → Read 25 records → AI generates values → Update records → progress=33% Batch 3/6 → Read 25 records → AI generates values → Update records → progress=50% ... ↓ → Write output { fieldsAdded, recordsBackfilled, tokenUsage } → Set status = 'completed'

Web search during backfill: When web_search.enabled is true, the worker searches for relevant information before generating field values for each batch. This helps ground values in real data (e.g., looking up a company's industry or founding year) rather than relying on model knowledge alone.

Provider Routing

The model string in the task input determines which AI provider handles the request:

Model Prefix	Provider	Notes
`gpt-`, `o3-`, `o4-*`	OpenAI	Default provider for most tasks
`claude-*`	Anthropic	Strong at structured data generation
`gemini-*`	Google Gemini	Good for high-volume tasks
`grok-*`	xAI (Grok)	Required for X/Twitter search (`x_search`)

Shared Requirements

Token usage: Accumulated across field generation and all backfill batches, written to task.output.tokenUsage
Progress: Updated via task.progress (0-100) after each backfill batch
Cancellation: Task status checked between batches — processing stops if cancelled
Error handling: On failure, status set to failed, error message recorded, completed_at set
Structured output: All AI calls use JSON schema for parseable responses