Fields

How the worker generates new field definitions using AI, updates the dataset schema, and optionally backfills existing records with values for the new fields.

Overview

The ai_generate_fields task type lets users add new fields to a dataset by describing what they need in natural language. The worker uses AI to generate appropriate field definitions (name, type, description), appends them to the dataset's schema, and can optionally populate existing records with values for the new fields.

Schema-first approach: Unlike ai_generate_records which works with an existing schema to produce data rows, ai_generate_fields modifies the schema itself. The worker updates the datasets.fields column directly, so the new fields are immediately available for all future operations.

Task Configuration

PropertyValue
Task Typeai_generate_fields
Created ByPATCH /api/datasets/:id with ai body
Tables Usedtasks, task_events, datasets (update fields), dataset_records (backfill)

Input Schema

The task's input column contains the validated generation request. The API server pre-populates existingFields and recordCount from the dataset.

{
  "prompt": "Add industry classification and founding year fields",
  "model": "gpt-5-nano",
  "temperature": 0.5,
  "web_search": {
    "enabled": true,
    "sources": 3,
    "include_urls": [],
    "exclude_urls": [],
    "x_search": false,
    "citations": true
  },
  "backfill": true,
  "datasetId": "ds_abc123",
  "existingFields": [
    { "name": "Company", "type": "text" },
    { "name": "Price", "type": "number" }
  ],
  "recordCount": 150
}
FieldTypeDescription
promptstringUser prompt describing what fields to generate
modelstringModel ID to use for generation (determines provider)
temperaturenumber?Sampling temperature (0-2)
web_searchobject?Web search config (same shape as ai_generate_records)
backfillbooleanIf true and records exist, generate values for new fields across existing records
datasetIdstringTarget dataset ID to update fields on
existingFieldsFieldDefinition[]Current field definitions on the dataset (for context)
recordCountinteger?Number of existing records (helps estimate backfill work)

Output Schema

Written to task.output when the task completes successfully:

{
  "fieldsAdded": [
    { "name": "Industry", "type": "text", "description": "Industry classification" },
    { "name": "Founded", "type": "integer", "description": "Year company was founded" }
  ],
  "recordsBackfilled": 150,
  "tokenUsage": {
    "input_tokens": 9800,
    "output_tokens": 6200,
    "total_tokens": 16000
  }
}

Field Generation

The worker uses AI to translate the user's natural language prompt into structured field definitions:

  1. The existing field schema is passed as context so the AI understands the dataset's current structure.
  2. The AI generates new field definitions with name, type, and description for each field.
  3. The worker validates that no generated field names collide with existing fields.
  4. Valid new fields are appended to the dataset's fields column in the datasets table.

Supported field types: The AI can generate fields of any type supported by the dataset schema, including text, number, integer, boolean, date, url, email, and more. Field naming follows the conventions of existing fields in the dataset.

Backfill Pattern

When backfill is true and existing records are present, the worker populates new fields across all existing records:

  1. First, generate and validate the new field definitions. Update the dataset's fields column (append new fields to existing array).
  2. Fetch existing records in batches (e.g., 25 at a time) to manage token limits.
  3. For each batch: read existing record data, generate new field values via AI, update each record's data column.
  4. Update task progress after each batch.
  5. Token usage is accumulated across field generation and all backfill batches.
  6. If backfill is false or omitted, only the field definitions are added — existing records are not modified.
Step 1 AI generates field definitions from prompt ↓ Step 2 Validate no name collisions with existing fields ↓ Step 3 Append new fields to datasets.fields column ↓ Step 4 If backfill=true and records exist: ↓ Batch 1/6 → Read 25 records → AI generates values → Update records → progress=17% Batch 2/6 → Read 25 records → AI generates values → Update records → progress=33% Batch 3/6 → Read 25 records → AI generates values → Update records → progress=50% ... ↓ → Write output { fieldsAdded, recordsBackfilled, tokenUsage } → Set status = 'completed'

Web search during backfill: When web_search.enabled is true, the worker searches for relevant information before generating field values for each batch. This helps ground values in real data (e.g., looking up a company's industry or founding year) rather than relying on model knowledge alone.

Provider Routing

The model string in the task input determines which AI provider handles the request:

Model PrefixProviderNotes
gpt-*, o3-*, o4-*OpenAIDefault provider for most tasks
claude-*AnthropicStrong at structured data generation
gemini-*Google GeminiGood for high-volume tasks
grok-*xAI (Grok)Required for X/Twitter search (x_search)

Shared Requirements