Fields

How field generation works at the worker level — generating new field definitions, updating dataset schemas, and backfilling existing records.

Overview

The ai_generate_fields task type generates new field definitions using AI and updates the dataset schema. Optionally, if backfill is true and existing records are present, the worker generates values for the new fields across all existing records.

This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.

Schema-first approach: Unlike ai_generate_records which works with an existing schema to produce data rows, ai_generate_fields modifies the schema itself. The worker updates the datasets.fields column directly, so the new fields are immediately available for all future operations.

Database Tables

datasets table: The full datasets table schema, including the fields column that this handler writes, is documented on the Datasets page. The ai_generate_fields handler reads options->>'locked' to detect locked datasets and updates datasets.fields to append the newly generated field definitions.

dataset_records Table

During backfill, the worker updates existing records to add values for the new fields.

ColumnTypeDescription
idvarchar (UUID)Primary key, auto-generated
dataset_idvarcharFK to datasets.id
datajsonbThe record's field values — worker merges new field values into this object
_metajsonb (nullable)Citation and generation metadata
deleted_attimestamp (nullable)Soft delete timestamp
created_attimestampAuto-set on insert
created_byvarcharUser ID who created the record
updated_attimestampUpdated on every change
updated_byvarcharUser ID who last updated

Task Configuration

PropertyValue
Task Typeai_generate_fields
Created ByPOST /api/datasets/:id/fields/generate
Tables Usedtasks, task_events, datasets (update fields), dataset_records (backfill)

Input Schema

The task's input column contains the validated generation request. AI configuration is wrapped in an ai object. The API server pre-populates existingFields and recordCount from the dataset before creating the task.

{
  "count": 5,
  "backfill": true,
  "ai": {
    "prompt": "Add industry classification and founding year fields",
    "model": "gpt-5-nano",
    "temperature": 0.5,
    "streaming": false,
    "max_output_tokens": 5000
  },
  "datasetId": "ds_abc123",
  "existingFields": [
    { "id": "fld_a1b2c3d4e5f6", "name": "Company", "type": "text" },
    { "id": "fld_g7h8i9j0k1l2", "name": "Price", "type": "number" }
  ],
  "recordCount": 150
}
FieldTypeDescription
countnumberNumber of fields to generate (1-20, default 5). Always present in task input.
backfillbooleanIf true and records exist, generate values for new fields across existing records. Always present (API server defaults to true).
aiobjectAI configuration for field generation. Always present.
ai.promptstringUser prompt describing what fields to generate
ai.modelstringModel ID to use (may be "auto" — use task.metadata.resolvedModel instead)
ai.temperaturenumber?Sampling temperature (0-2)
ai.streamingboolean?Whether the worker should stream the AI response
ai.max_output_tokensnumber?Maximum output token limit for the AI response
datasetIdstringTarget dataset ID to update fields on
existingFieldsFieldDefinition[]Current field definitions on the dataset (for context)
recordCountnumberNumber of existing records (always populated by API server)

Task Metadata

The API server writes a metadata object on the task row when creating it. The input.model field may contain "auto" (the default) — workers should use the resolved values from metadata instead.

{
  "resolvedModel": "gpt-4o",
  "resolvedProvider": "openai"
}
FieldTypeDescription
resolvedModelstringThe actual model ID to use for AI calls (resolved from "auto" or validated)
resolvedProviderstringThe provider to route to (e.g., "openai", "anthropic", "google", "openrouter")

Output Schema

Written to task.output when the task completes successfully:

{
  "fieldsAdded": ["Industry", "Founded"],
  "recordsBackfilled": 150,
  "tokenUsage": {
    "input_tokens": 9800,
    "output_tokens": 6200,
    "total_tokens": 16000
  }
}

Task Events

Field generation is not a streaming task, so workers are not required to write task_events rows. However, workers may optionally write events for observability. Recommended event types:

Event TypeDataWhen
fields_generated{ "fields": [...] }After generating and validating new field definitions
backfill_progress{ "batch": N, "totalBatches": N, "recordsUpdated": N }After each backfill batch completes
done{ "model": "...", "provider": "..." }On successful completion (before updating task status)

Progress should always be tracked via task.progress (0-100) regardless of whether events are written.

Field Definition Shape

Each generated field must follow this structure. The worker is responsible for generating valid field IDs and ensuring names don't collide with existing fields.

[
  {
    "id": "fld_Kx9mR2vL4wQn",
    "name": "Industry",
    "type": "text",
    "description": "The company's primary industry sector",
    "enum": ["SaaS", "Fintech", "Healthcare", "E-commerce", "AI/ML", "Other"],
    "required": true,
    "ai": { "search": { "citations": true } }
  },
  {
    "id": "fld_pQ3rS7tU8vWx",
    "name": "Founded Year",
    "type": "integer",
    "description": "Year the company was founded",
    "exampleValue": 2015,
    "ai": { "search": { "citations": true } }
  },
  {
    "id": "fld_aB4cD9eF0gHi",
    "name": "Sentiment Score",
    "type": "number",
    "description": "Computed sentiment of the company's market position (-1 to 1)",
    "ai": { "prompt": "Rate the company's market sentiment from -1 (negative) to 1 (positive)" }
  },
  {
    "id": "fld_jK5lM1nO2pQr",
    "name": "Personal Notes",
    "type": "text",
    "description": "User's personal notes about the company"
  }
]
PropertyTypeRequiredDescription
idstringYesWorkers must generate: "fld_" + 12 random alphanumeric chars (a-z, A-Z, 0-9). Must be unique across all fields in the dataset.
namestringYesField name. Must not collide with any name in existingFields. Can contain spaces. Must start with a letter or underscore and must not end with a space.
typestringYesOne of: text, number, integer, boolean, date, datetime, email, url, uuid, array, object. Do NOT generate formula fields.
descriptionstring?NoHuman-readable description of the field's purpose. Recommended for all generated fields.
requiredboolean?NoWhether this field must have a value on every record. Default: false.
uniqueboolean?NoWhether values must be unique across all records. Default: false. Only for scalar types.
defaultany?NoDefault value used when a record is created without this field. Must match the field's type.
autoGeneratestring?NoAuto-generate value on record creation. One of "uuid", "createdAt", "updatedAt".
enumany[]?NoAllowed values list. When set, record values must be one of these.
exampleValueany?NoExample value showing expected format. Must match the field's type. If enum is set, must be one of the enum values.
aiobject?NoAI generation config for this field during record generation. See AI Property on Generated Fields below.

ID Generation

Workers must generate field IDs themselves — the API server does not assign them. Use "fld_" + 12 random characters from a-zA-Z0-9. Example: "fld_Kx9mR2vL4wQn". Each ID must be unique within the dataset's field array.

Name Collision Handling

If the AI generates a field name that already exists in existingFields, the worker should either retry generation with an instruction to avoid those names, or append a suffix (e.g., "Industry_2"). Do not silently overwrite existing field definitions.

AI Property on Generated Fields

When generating new field definitions, workers must decide whether and how to set the ai property. This property controls how the field behaves during future AI record generation. The decision should be based on the nature of the field:

Field Natureai ValueExamples
Factual / real-world data{ "search": { "citations": true } }Company Name, CEO, Contact Email, Founded Year, Address, Phone Number, News Headline, Population, Stock Price
Synthesizable / analytical{} or { "prompt": "..." }Sentiment Score, Category, Summary, Tags, Risk Level, Recommendation, Classification
User-input / subjectiveOmit the ai property entirelyPersonal Notes, My Rating, Comments, Custom Label, Internal Memo, User Feedback

Never set ai: false on generated fields. This value is reserved for users who explicitly want to permanently exclude a field from AI generation. Workers should not make this decision — either omit the ai property (field uses default AI settings) or set it to an appropriate config object.

ai.search Sub-Object

The search property within ai configures web search grounding for the field. Its presence signals that the field's data should come from web sources rather than model knowledge.

PropertyTypeDescription
sourcesnumber?Number of web sources to fetch per search (1-50)
include_urlsstring[]?Restrict search to these URLs/domains
exclude_urlsstring[]?Exclude these URLs/domains from search
x_searchboolean?Include X.com posts in search results
citationsboolean?Require per-field citation metadata. Always set to true for factual fields.

During backfill, the worker reads each new field's ai.search config to decide whether to use web search for that field's values. Fields are grouped by search config and web search is run once per group.

Processing Steps

Complete step-by-step flow for processing an ai_generate_fields task:

  1. Fetch the dataset by input.datasetId. If locked = true, fail the task with error "Dataset is locked". If deleted_at is not null, fail with "Dataset has been deleted".
  2. Read resolved model from task.metadata.resolvedModel and task.metadata.resolvedProvider. Never use input.ai.model directly (may be "auto").
  3. Generate field definitions via AI using structured output (JSON array). Pass the user prompt (input.ai.prompt), input.count (number of fields to generate), existing field names (for context and collision avoidance), and the dataset title.
  4. Generate field IDs for each new field: "fld_" + 12 random alphanumeric chars.
  5. Validate names — ensure no generated field name collides with any name in input.existingFields. If collision detected, skip the colliding field.
  6. Update the dataset — append new field definitions to the existing fields JSONB array. Write the fields_generated event.
  7. If backfill — when input.backfill = true and input.recordCount > 0, batch through existing records and generate values for the new fields (see Backfill Pattern below).
  8. Write output to task.output and set status = 'completed', completed_at = NOW().

SQL: Update Dataset Fields

Append new field definitions to the existing JSONB array:

UPDATE datasets
SET    fields = fields || $newFields::jsonb,
       updated_at = NOW(),
       updated_by = $userId
WHERE  id = $datasetId;

-- $newFields is a JSON array of the new FieldDefinition objects
-- $userId is task.created_by

SQL: Backfill a Record

Merge new field values and metadata into an existing record:

UPDATE dataset_records
SET    data = data || $newFieldValues::jsonb,
       _meta = COALESCE(_meta, '{}'::jsonb) || $newMeta::jsonb,
       updated_at = NOW(),
       updated_by = $userId
WHERE  id = $recordId;

-- $newFieldValues is { "Industry": "SaaS", "Founded": 2015 }
-- $newMeta merges into existing _meta, preserving prior citations
-- $userId is task.created_by

Error Handling

Backfill Pattern

When backfill is true and recordCount > 0, the worker populates new fields across all existing records:

  1. First, generate and validate the new field definitions. Update the dataset's fields column using append SQL (fields || $newFields::jsonb).
  2. Group new fields by their ai.search config: fields with a search object are grouped for web search, fields with ai: {} or ai: { "prompt": "..." } are synthetic, fields without ai use default synthetic generation.
  3. Fetch existing records in batches (25 at a time) to manage token limits.
  4. For each batch: run web search for search-enabled field groups, read existing record data, generate new field values via AI (separating synthetic vs. web-search-grounded fields in the prompt), update each record using merge SQL.
  5. Each backfilled record's _meta always includes generatedBy, taskId, model, and generatedAt. Existing per-field citations are preserved (merged, not overwritten).
  6. Update task progress after each batch.
  7. Token usage is accumulated across field generation and all backfill batches.
  8. If backfill is false, omitted, or recordCount is 0, only the field definitions are added — existing records are not modified.
Step 1 AI generates field definitions from prompt (with ai.search config) ↓ Step 2 Validate no name collisions with existing fields ↓ Step 3 Assign field IDs (fld_ + 12 random chars) ↓ Step 4 Append new fields to datasets.fields column ↓ Step 5 If backfill=true and recordCount > 0: ↓ Batch 1/6 → Web search (for search fields) → AI generates values → Update records → progress=17% Batch 2/6 → Web search (for search fields) → AI generates values → Update records → progress=33% Batch 3/6 → Web search (for search fields) → AI generates values → Update records → progress=50% ... ↓ → Write output { fieldsAdded, recordsBackfilled, tokenUsage } → Set status = 'completed'

Web search during backfill: When a new field has an ai.search config, the worker runs web search before generating values for each batch. Fields are grouped by their search config (sources, include/exclude URLs, x_search) so fields with the same config share a single search query. This grounds factual fields in real data rather than relying on model knowledge alone.

Record _meta

When backfilling records, workers update each record's _meta column with generation metadata:

{
  "generatedBy": "ai",
  "taskId": "task_abc123",
  "model": "gpt-4o",
  "generatedAt": "2025-01-15T10:30:00Z",
  "citations": {
    "Industry": [
      { "url": "https://example.com/article", "title": "Industry Report", "snippet": "..." }
    ],
    "Founded": [
      { "url": "https://example.com/company", "title": "Company Profile" }
    ]
  }
}
FieldTypeDescription
generatedBystringAlways "ai" for AI-generated data
taskIdstring?The task ID that produced this data
modelstring?Model used for generation
generatedAtstring?ISO 8601 timestamp of when the data was generated
citationsRecord<string, Citation[]>?Per-field citation arrays. Each citation has url (required), title, and snippet (optional). Only present for fields that used web search.

If the record already has a _meta value (e.g., from prior AI generation), merge the new metadata — do not overwrite existing citations for other fields.

Provider Routing

The API server resolves the provider via task.metadata.resolvedProvider. For backward compatibility, the worker also detects the provider from the model prefix:

Model PrefixProviderNotes
gpt-*, o3-*, o4-*OpenAIDefault provider for most tasks
claude-*AnthropicStrong at structured data generation
gemini-*Google GeminiGood for high-volume tasks
grok-*xAI (Grok)Required for X/Twitter search (x_search)

Provider mapping from API server names: openai→openai, anthropic→anthropic, google→gemini, openrouter→xai.

Shared Requirements