Records

How record operations work at the worker level — generating, validating, deduplicating, and inserting records with citation metadata.

Overview

The ai_generate_records task type generates new records using AI based on a prompt and the dataset's field definitions, then inserts them directly into dataset_records. It supports chunked generation, deduplication, web search grounding, and per-field citation metadata.

This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.

Multi-provider support: The worker routes AI calls to the correct provider based on the model string in the task input. Supported providers: OpenAI (gpt-*), Anthropic (claude-*), Google Gemini (gemini-*), and xAI/Grok (grok-*). X/Twitter search (x_search) specifically requires a Grok model.

Database Tables

Workers handling record tasks need access to these tables in addition to tasks and task_events:

dataset_records Table

Each row is a single record in a dataset. Workers insert new rows here during ai_generate_records and update existing rows during ai_generate_fields backfill.

ColumnTypeDescription
idvarchar (UUID)Primary key, auto-generated
dataset_idvarcharFK to datasets.id
datajsonbThe record's field values as a JSON object
_metajsonb (nullable)Citation and generation metadata (see below)
deleted_attimestamp (nullable)Soft delete timestamp
created_attimestampAuto-set on insert
created_byvarcharUser ID who created the task (use task.created_by)
updated_attimestampUpdated on every change
updated_byvarcharUser ID who last updated

_meta Column

Every AI-generated record must have a _meta column containing per-field citations and generation provenance. This is a separate JSONB column on dataset_records, not inside the data column, to keep user data clean.

{
  "citations": {
    "Company": [{ "url": "https://example.com/page", "title": "Source Page", "snippet": "Acme Corp is..." }],
    "Price": [{ "url": "https://pricing.example.com", "title": "Pricing", "snippet": "$29/mo" }]
  },
  "generatedBy": "ai",
  "taskId": "task_xyz789",
  "model": "gpt-5-nano",
  "generatedAt": "2026-02-22T10:30:00.000Z"
}
FieldTypeDescription
citationsRecord<string, {url, title?, snippet?}[]>Per-field citation sources from web search. Key is field name, value is array of sources.
generatedBystringAlways "ai" for AI-generated records
taskIdstringID of the task that generated this record
modelstringAI model used for generation
generatedAtstringISO 8601 timestamp when the record was generated

ai_generate_records

This task type generates dataset records using AI and inserts them directly into dataset_records. The worker reads the dataset's field definitions, generates records in chunks, deduplicates against existing data, and writes each record with citation metadata in the _meta column.

PropertyValue
Task Typeai_generate_records
Created ByPOST /api/datasets/:id/records with ai body
Tables Usedtasks, task_events, dataset_records (insert)

Input Schema

The task's input column contains the validated generation request. The API server pre-populates fields, selectedFields, fieldSources, and existingUniqueValues from the dataset before creating the task.

{
  "prompt": "Generate 50 SaaS companies with pricing data",
  "count": 50,
  "model": "gpt-5-nano",
  "temperature": 0.7,
  "streaming": true,
  "max_output_tokens": 10000,
  "web_search": {
    "enabled": true,
    "sources": 5,
    "include_urls": ["https://example.com"],
    "exclude_urls": [],
    "x_search": false,
    "citations": true
  },
  "chunk_size": 50,
  "datasetId": "ds_abc123",
  "fields": [
    { "id": "fld_a1b2c3d4e5f6", "name": "Company", "type": "text", "required": true, "unique": true, "ai": { "search": { "sources": 10, "citations": true } } },
    { "id": "fld_b2c3d4e5f6g7", "name": "Price", "type": "number", "ai": { "search": {} } },
    { "id": "fld_c3d4e5f6g7h8", "name": "Website", "type": "url", "ai": { "search": {} } },
    { "id": "fld_d4e5f6g7h8i9", "name": "ID", "type": "uuid", "autoGenerate": "uuid" },
    { "id": "fld_e5f6g7h8i9j0", "name": "Category", "type": "text", "enum": ["B2B", "B2C", "Enterprise"], "ai": { "model": "claude-sonnet-4-20250514" } },
    { "id": "fld_f6g7h8i9j0k1", "name": "Notes", "type": "text", "ai": false },
    { "id": "fld_g7h8i9j0k1l2", "name": "Revenue Score", "type": "formula", "formula": "Price * 12" }
  ],
  "selectedFields": ["Company", "Price", "Website", "Category"],
  "fieldSources": {
    "Company": "web_search",
    "Price": "web_search",
    "Website": "web_search",
    "Category": "web_search"
  },
  "fieldWebSearchConfigs": {
    "Company": { "sources": 10, "citations": true }
  },
  "fieldModels": {
    "Category": "claude-sonnet-4-20250514"
  },
  "fieldPrompts": {
    "Company": "Research the company and provide their official description"
  },
  "fieldTemperatures": {
    "Company": 0.5
  },
  "fieldStreaming": {
    "Company": true
  },
  "fieldMaxOutputTokens": {
    "Company": 2000
  },
  "existingUniqueValues": {
    "Company": ["Acme Corp", "Globex"]
  }
}

Model resolution: The API server writes task.metadata.resolvedModel (concrete model ID, never "auto") and task.metadata.resolvedProvider (e.g., "openai", "anthropic", "google", "openrouter"). The worker reads these first, falling back to input.model and prefix-based detection for backward compatibility. The resolved model is used in _meta.model on each inserted record.

FieldTypeDescription
promptstring?User prompt describing what records to generate
countintegerTotal number of records to generate
modelstringRequested model ID (may be "auto"). Authoritative model/provider come from task.metadata.resolvedModel / resolvedProvider.
temperaturenumber?Request-level sampling temperature (0-2)
streamingboolean?Request-level streaming preference
max_output_tokensnumber?Request-level maximum output tokens. When provided, used as the token limit for AI generation calls instead of the estimated default.
web_searchobject?Web search config. enabled is auto-set to true when any field uses web_search. Sub-fields: sources, include_urls, exclude_urls, x_search, citations
chunk_sizeintegerRecords per generation chunk (default 50)
datasetIdstringTarget dataset ID to insert records into
fieldsFieldDefinition[]Complete field schema. Each field has id (stable ID like "fld_...") and optionally ai (false or {model?, prompt?, search?}). Includes autoGenerate, enum, formula, exampleValue, description.
selectedFieldsstring[]Field names to generate via AI. Pre-resolved by the API server (always present). Defaults to all non-formula fields where ai is not false.
fieldSourcesobjectPre-resolved map of field name → source ("synthetic" or "web_search"). Authoritative source for each field.
fieldWebSearchConfigsobject?Per-field web search preferences (from field's ai.search). Merged with request-level web_search, with request-level taking priority.
fieldModelsobject?Pre-resolved map of field name → preferred model ID (from field's ai.model config).
fieldPromptsobject?Pre-resolved map of field name → custom prompt guidance (from field's ai.prompt config). Appended to the AI generation instructions for the specific field.
fieldTemperaturesobject?Pre-resolved map of field name → temperature override (from field's ai.temperature config, 0-2).
fieldStreamingobject?Pre-resolved map of field name → streaming preference (from field's ai.streaming config).
fieldMaxOutputTokensobject?Pre-resolved map of field name → max output token limit (from field's ai.max_output_tokens config).
existingUniqueValuesobjectMap of unique field names to arrays of existing values, for deduplication

Non-Selected Field Handling

Fields in fields but NOT in selectedFields are not sent to the AI. Instead, the worker assigns values using this priority:

  1. If field.type === "formula" → skip entirely (do not include in data object)
  2. If field.autoGenerate === "uuid" → generate a UUID v4 string
  3. If field.autoGenerate === "createdAt" or "updatedAt" → current ISO 8601 datetime
  4. If field.default is defined → use that value
  5. Otherwise → set to null

Output Schema

Written to task.output when the task completes successfully:

{
  "inserted": 47,
  "skipped": 3,
  "skippedReasons": [
    { "data": { "Company": "Acme Corp", "Price": 29, "Category": "B2B" }, "reason": "Duplicate value 'Acme Corp' for unique field 'Company'" },
    { "data": { "Company": "Globex", "Price": 49, "Category": "Enterprise" }, "reason": "Duplicate value 'Globex' for unique field 'Company'" }
  ],
  "totalChunks": 5,
  "tokenUsage": {
    "input_tokens": 12500,
    "output_tokens": 8400,
    "total_tokens": 20900
  }
}

Chunked Generation

Large generation requests are split into chunks to stay within LLM token limits and provide incremental progress:

Chunk 1/2 → Generate 50 records → Validate → Dedup → Insert → progress=50% Chunk 2/2 → Generate 50 records → Validate → Dedup → Insert → progress=100% → Write output { inserted, skipped, skippedReasons, totalChunks, tokenUsage } → Set status = 'completed'

Deduplication

For fields marked as unique in the field definitions, the worker applies a three-layer deduplication strategy:

  1. Pre-prompt: Pass existingUniqueValues to the AI prompt so it avoids generating known duplicates. Also include values generated in previous chunks.
  2. Post-validate within batch: After generation, check for duplicates within the current chunk.
  3. Post-validate against DB: Check generated unique values against existing database records.

Skipped records and reasons are tracked in output.skipped and output.skippedReasons.

Anti-Hallucination

When web search is enabled, the worker follows a search-then-extract pipeline to prevent hallucinated data:

X/Twitter search: When x_search is true, the task must use a Grok model (grok-*) since xAI has native access to X.com content. Other providers cannot access X/Twitter data.

Shared Requirements

Provider Routing

The API server resolves the provider via task.metadata.resolvedProvider. For backward compatibility, the worker also detects the provider from the model prefix:

Model PrefixProviderNotes
gpt-*, o3-*, o4-*OpenAIDefault provider for most tasks
claude-*AnthropicStrong at structured data generation
gemini-*Google GeminiGood for high-volume tasks
grok-*xAI (Grok)Required for X/Twitter search (x_search)

All providers are accessed via the OpenAI-compatible SDK with different base URLs and API keys. Provider mapping from API server names: openai→openai, anthropic→anthropic, google→gemini, openrouter→xai.