Fields
How field generation works at the worker level — generating new field definitions, updating dataset schemas, and backfilling existing records.
Overview
The ai_generate_fields task type generates new field definitions using AI and updates the dataset schema. Optionally, if backfill is true and existing records are present, the worker generates values for the new fields across all existing records.
This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.
Schema-first approach: Unlike ai_generate_records which works with an existing schema to produce data rows, ai_generate_fields modifies the schema itself. The worker updates the datasets.fields column directly, so the new fields are immediately available for all future operations.
Database Tables
datasets table: The full datasets table schema, including the fields column that this handler writes, is documented on the Datasets page. The ai_generate_fields handler reads options->>'locked' to detect locked datasets and updates datasets.fields to append the newly generated field definitions.
dataset_records Table
During backfill, the worker updates existing records to add values for the new fields.
| Column | Type | Description |
|---|---|---|
id | varchar (UUID) | Primary key, auto-generated |
dataset_id | varchar | FK to datasets.id |
data | jsonb | The record's field values — worker merges new field values into this object |
_meta | jsonb (nullable) | Citation and generation metadata |
deleted_at | timestamp (nullable) | Soft delete timestamp |
created_at | timestamp | Auto-set on insert |
created_by | varchar | User ID who created the record |
updated_at | timestamp | Updated on every change |
updated_by | varchar | User ID who last updated |
Task Configuration
| Property | Value |
|---|---|
| Task Type | ai_generate_fields |
| Created By | POST /api/datasets/:id/fields/generate |
| Tables Used | tasks, task_events, datasets (update fields), dataset_records (backfill) |
Input Schema
The task's input column contains the validated generation request. AI configuration is wrapped in an ai object. The API server pre-populates existingFields and recordCount from the dataset before creating the task.
{
"count": 5,
"backfill": true,
"ai": {
"prompt": "Add industry classification and founding year fields",
"model": "gpt-5-nano",
"temperature": 0.5,
"streaming": false,
"max_output_tokens": 5000
},
"datasetId": "ds_abc123",
"existingFields": [
{ "id": "fld_a1b2c3d4e5f6", "name": "Company", "type": "text" },
{ "id": "fld_g7h8i9j0k1l2", "name": "Price", "type": "number" }
],
"recordCount": 150
}
| Field | Type | Description |
|---|---|---|
count | number | Number of fields to generate (1-20, default 5). Always present in task input. |
backfill | boolean | If true and records exist, generate values for new fields across existing records. Always present (API server defaults to true). |
ai | object | AI configuration for field generation. Always present. |
ai.prompt | string | User prompt describing what fields to generate |
ai.model | string | Model ID to use (may be "auto" — use task.metadata.resolvedModel instead) |
ai.temperature | number? | Sampling temperature (0-2) |
ai.streaming | boolean? | Whether the worker should stream the AI response |
ai.max_output_tokens | number? | Maximum output token limit for the AI response |
datasetId | string | Target dataset ID to update fields on |
existingFields | FieldDefinition[] | Current field definitions on the dataset (for context) |
recordCount | number | Number of existing records (always populated by API server) |
Task Metadata
The API server writes a metadata object on the task row when creating it. The input.model field may contain "auto" (the default) — workers should use the resolved values from metadata instead.
{
"resolvedModel": "gpt-4o",
"resolvedProvider": "openai"
}
| Field | Type | Description |
|---|---|---|
resolvedModel | string | The actual model ID to use for AI calls (resolved from "auto" or validated) |
resolvedProvider | string | The provider to route to (e.g., "openai", "anthropic", "google", "openrouter") |
Output Schema
Written to task.output when the task completes successfully:
{
"fieldsAdded": ["Industry", "Founded"],
"recordsBackfilled": 150,
"tokenUsage": {
"input_tokens": 9800,
"output_tokens": 6200,
"total_tokens": 16000
}
}
Task Events
Field generation is not a streaming task, so workers are not required to write task_events rows. However, workers may optionally write events for observability. Recommended event types:
| Event Type | Data | When |
|---|---|---|
fields_generated | { "fields": [...] } | After generating and validating new field definitions |
backfill_progress | { "batch": N, "totalBatches": N, "recordsUpdated": N } | After each backfill batch completes |
done | { "model": "...", "provider": "..." } | On successful completion (before updating task status) |
Progress should always be tracked via task.progress (0-100) regardless of whether events are written.
Field Definition Shape
Each generated field must follow this structure. The worker is responsible for generating valid field IDs and ensuring names don't collide with existing fields.
[
{
"id": "fld_Kx9mR2vL4wQn",
"name": "Industry",
"type": "text",
"description": "The company's primary industry sector",
"enum": ["SaaS", "Fintech", "Healthcare", "E-commerce", "AI/ML", "Other"],
"required": true,
"ai": { "search": { "citations": true } }
},
{
"id": "fld_pQ3rS7tU8vWx",
"name": "Founded Year",
"type": "integer",
"description": "Year the company was founded",
"exampleValue": 2015,
"ai": { "search": { "citations": true } }
},
{
"id": "fld_aB4cD9eF0gHi",
"name": "Sentiment Score",
"type": "number",
"description": "Computed sentiment of the company's market position (-1 to 1)",
"ai": { "prompt": "Rate the company's market sentiment from -1 (negative) to 1 (positive)" }
},
{
"id": "fld_jK5lM1nO2pQr",
"name": "Personal Notes",
"type": "text",
"description": "User's personal notes about the company"
}
]
| Property | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Workers must generate: "fld_" + 12 random alphanumeric chars (a-z, A-Z, 0-9). Must be unique across all fields in the dataset. |
name | string | Yes | Field name. Must not collide with any name in existingFields. Can contain spaces. Must start with a letter or underscore and must not end with a space. |
type | string | Yes | One of: text, number, integer, boolean, date, datetime, email, url, uuid, array, object. Do NOT generate formula fields. |
description | string? | No | Human-readable description of the field's purpose. Recommended for all generated fields. |
required | boolean? | No | Whether this field must have a value on every record. Default: false. |
unique | boolean? | No | Whether values must be unique across all records. Default: false. Only for scalar types. |
default | any? | No | Default value used when a record is created without this field. Must match the field's type. |
autoGenerate | string? | No | Auto-generate value on record creation. One of "uuid", "createdAt", "updatedAt". |
enum | any[]? | No | Allowed values list. When set, record values must be one of these. |
exampleValue | any? | No | Example value showing expected format. Must match the field's type. If enum is set, must be one of the enum values. |
ai | object? | No | AI generation config for this field during record generation. See AI Property on Generated Fields below. |
ID Generation
Workers must generate field IDs themselves — the API server does not assign them. Use "fld_" + 12 random characters from a-zA-Z0-9. Example: "fld_Kx9mR2vL4wQn". Each ID must be unique within the dataset's field array.
Name Collision Handling
If the AI generates a field name that already exists in existingFields, the worker should either retry generation with an instruction to avoid those names, or append a suffix (e.g., "Industry_2"). Do not silently overwrite existing field definitions.
AI Property on Generated Fields
When generating new field definitions, workers must decide whether and how to set the ai property. This property controls how the field behaves during future AI record generation. The decision should be based on the nature of the field:
| Field Nature | ai Value | Examples |
|---|---|---|
| Factual / real-world data | { "search": { "citations": true } } | Company Name, CEO, Contact Email, Founded Year, Address, Phone Number, News Headline, Population, Stock Price |
| Synthesizable / analytical | {} or { "prompt": "..." } | Sentiment Score, Category, Summary, Tags, Risk Level, Recommendation, Classification |
| User-input / subjective | Omit the ai property entirely | Personal Notes, My Rating, Comments, Custom Label, Internal Memo, User Feedback |
Never set ai: false on generated fields. This value is reserved for users who explicitly want to permanently exclude a field from AI generation. Workers should not make this decision — either omit the ai property (field uses default AI settings) or set it to an appropriate config object.
ai.search Sub-Object
The search property within ai configures web search grounding for the field. Its presence signals that the field's data should come from web sources rather than model knowledge.
| Property | Type | Description |
|---|---|---|
sources | number? | Number of web sources to fetch per search (1-50) |
include_urls | string[]? | Restrict search to these URLs/domains |
exclude_urls | string[]? | Exclude these URLs/domains from search |
x_search | boolean? | Include X.com posts in search results |
citations | boolean? | Require per-field citation metadata. Always set to true for factual fields. |
During backfill, the worker reads each new field's ai.search config to decide whether to use web search for that field's values. Fields are grouped by search config and web search is run once per group.
Processing Steps
Complete step-by-step flow for processing an ai_generate_fields task:
- Fetch the dataset by
input.datasetId. Iflocked = true, fail the task with error "Dataset is locked". Ifdeleted_atis not null, fail with "Dataset has been deleted". - Read resolved model from
task.metadata.resolvedModelandtask.metadata.resolvedProvider. Never useinput.ai.modeldirectly (may be "auto"). - Generate field definitions via AI using structured output (JSON array). Pass the user prompt (
input.ai.prompt),input.count(number of fields to generate), existing field names (for context and collision avoidance), and the dataset title. - Generate field IDs for each new field:
"fld_"+ 12 random alphanumeric chars. - Validate names — ensure no generated field name collides with any name in
input.existingFields. If collision detected, skip the colliding field. - Update the dataset — append new field definitions to the existing fields JSONB array. Write the
fields_generatedevent. - If backfill — when
input.backfill = trueandinput.recordCount > 0, batch through existing records and generate values for the new fields (see Backfill Pattern below). - Write output to
task.outputand setstatus = 'completed',completed_at = NOW().
SQL: Update Dataset Fields
Append new field definitions to the existing JSONB array:
UPDATE datasets
SET fields = fields || $newFields::jsonb,
updated_at = NOW(),
updated_by = $userId
WHERE id = $datasetId;
-- $newFields is a JSON array of the new FieldDefinition objects
-- $userId is task.created_by
SQL: Backfill a Record
Merge new field values and metadata into an existing record:
UPDATE dataset_records
SET data = data || $newFieldValues::jsonb,
_meta = COALESCE(_meta, '{}'::jsonb) || $newMeta::jsonb,
updated_at = NOW(),
updated_by = $userId
WHERE id = $recordId;
-- $newFieldValues is { "Industry": "SaaS", "Founded": 2015 }
-- $newMeta merges into existing _meta, preserving prior citations
-- $userId is task.created_by
Error Handling
- If the dataset is locked or deleted, set
task.status = 'failed',task.error = reason, andtask.completed_at = NOW(). - If a backfill batch fails, keep previously updated records and continue to the next batch. Report the failure in a
backfill_progressevent. - Check
task.statusfor cancellation every few batches. If cancelled, write partial output and set status to 'cancelled'.
Backfill Pattern
When backfill is true and recordCount > 0, the worker populates new fields across all existing records:
- First, generate and validate the new field definitions. Update the dataset's
fieldscolumn using append SQL (fields || $newFields::jsonb). - Group new fields by their
ai.searchconfig: fields with asearchobject are grouped for web search, fields withai: {}orai: { "prompt": "..." }are synthetic, fields withoutaiuse default synthetic generation. - Fetch existing records in batches (25 at a time) to manage token limits.
- For each batch: run web search for search-enabled field groups, read existing record data, generate new field values via AI (separating synthetic vs. web-search-grounded fields in the prompt), update each record using merge SQL.
- Each backfilled record's
_metaalways includesgeneratedBy,taskId,model, andgeneratedAt. Existing per-field citations are preserved (merged, not overwritten). - Update task progress after each batch.
- Token usage is accumulated across field generation and all backfill batches.
- If
backfillis false, omitted, orrecordCountis 0, only the field definitions are added — existing records are not modified.
Web search during backfill: When a new field has an ai.search config, the worker runs web search before generating values for each batch. Fields are grouped by their search config (sources, include/exclude URLs, x_search) so fields with the same config share a single search query. This grounds factual fields in real data rather than relying on model knowledge alone.
Record _meta
When backfilling records, workers update each record's _meta column with generation metadata:
{
"generatedBy": "ai",
"taskId": "task_abc123",
"model": "gpt-4o",
"generatedAt": "2025-01-15T10:30:00Z",
"citations": {
"Industry": [
{ "url": "https://example.com/article", "title": "Industry Report", "snippet": "..." }
],
"Founded": [
{ "url": "https://example.com/company", "title": "Company Profile" }
]
}
}
| Field | Type | Description |
|---|---|---|
generatedBy | string | Always "ai" for AI-generated data |
taskId | string? | The task ID that produced this data |
model | string? | Model used for generation |
generatedAt | string? | ISO 8601 timestamp of when the data was generated |
citations | Record<string, Citation[]>? | Per-field citation arrays. Each citation has url (required), title, and snippet (optional). Only present for fields that used web search. |
If the record already has a _meta value (e.g., from prior AI generation), merge the new metadata — do not overwrite existing citations for other fields.
Provider Routing
The API server resolves the provider via task.metadata.resolvedProvider. For backward compatibility, the worker also detects the provider from the model prefix:
| Model Prefix | Provider | Notes |
|---|---|---|
gpt-*, o3-*, o4-* | OpenAI | Default provider for most tasks |
claude-* | Anthropic | Strong at structured data generation |
gemini-* | Google Gemini | Good for high-volume tasks |
grok-* | xAI (Grok) | Required for X/Twitter search (x_search) |
Provider mapping from API server names: openai→openai, anthropic→anthropic, google→gemini, openrouter→xai.
Shared Requirements
- Token usage: Accumulated across field generation and all backfill batches, written to
task.output.tokenUsage - Progress: Updated via
task.progress(0-100) after each backfill batch - Cancellation: Task status checked between batches — processing stops if cancelled
- Error handling: On failure, status set to
failed, error message recorded,completed_atset - Structured output: All AI calls use JSON schema for parseable responses