Records

How record operations work at the worker level — generating, validating, deduplicating, and inserting records with citation metadata.

Overview

The ai_generate_records task type generates new records using AI based on a prompt and the dataset's field definitions, then inserts them directly into dataset_records. It supports chunked generation, deduplication, web search grounding, and per-field citation metadata.

This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.

Multi-provider support: The worker routes AI calls to the correct provider based on the model string in the task input. Supported providers: OpenAI (gpt-*), Anthropic (claude-*), Google Gemini (gemini-*), and xAI/Grok (grok-*). X/Twitter search (x_search) specifically requires a Grok model.

Database Tables

Workers handling record tasks need access to these tables in addition to tasks and task_events:

dataset_records Table

Each row is a single record in a dataset. Workers insert new rows here during ai_generate_records and update existing rows during ai_generate_fields backfill.

ColumnTypeDescription
idvarchar (UUID)Primary key, auto-generated
dataset_idvarcharFK to datasets.id
datajsonbThe record's field values as a JSON object
_metajsonb (nullable)Citation and generation metadata (see below)
deleted_attimestamp (nullable)Soft delete timestamp
created_attimestampAuto-set on insert
created_byvarcharUser ID who created the task (use task.created_by)
updated_attimestampUpdated on every change
updated_byvarcharUser ID who last updated

_meta Column

Every AI-generated record must have a _meta column containing per-field citations and generation provenance. This is a separate JSONB column on dataset_records, not inside the data column, to keep user data clean.

{
  "citations": {
    "Company": [{ "url": "https://example.com/page", "title": "Source Page", "snippet": "Acme Corp is..." }],
    "Price": [{ "url": "https://pricing.example.com", "title": "Pricing", "snippet": "$29/mo" }]
  },
  "generatedBy": "ai",
  "taskId": "task_xyz789",
  "model": "gpt-5-nano",
  "generatedAt": "2026-02-22T10:30:00.000Z"
}
FieldTypeDescription
citationsRecord<string, {url, title?, snippet?}[]>Per-field citation sources from web search. Key is field name, value is array of sources.
generatedBystringAlways "ai" for AI-generated records
taskIdstringID of the task that generated this record
modelstringAI model used for generation
generatedAtstringISO 8601 timestamp when the record was generated

ai_generate_records

This task type generates dataset records using AI and inserts them directly into dataset_records. The worker reads the dataset's field definitions, generates records in chunks, deduplicates against existing data, and writes each record with citation metadata in the _meta column.

PropertyValue
Task Typeai_generate_records
Created ByPOST /api/datasets/:id/records with ai body
Tables Usedtasks, task_events, dataset_records (insert)

Input Schema

The task's input column contains the validated generation request. The API server pre-populates fields, selectedFields, fieldSources, and existingUniqueValues from the dataset before creating the task.

{
  "prompt": "Generate 50 SaaS companies with pricing data",
  "count": 50,
  "model": "gpt-5-nano",
  "temperature": 0.7,
  "web_search": {
    "enabled": true,
    "sources": 5,
    "include_urls": ["https://example.com"],
    "exclude_urls": [],
    "x_search": false,
    "citations": true
  },
  "chunk_size": 10,
  "datasetId": "ds_abc123",
  "fields": [
    { "name": "Company", "type": "text", "required": true, "unique": true, "aiGenerate": true, "aiSource": "web_search" },
    { "name": "Price", "type": "number", "aiGenerate": true, "aiSource": "web_search" },
    { "name": "Website", "type": "url", "aiGenerate": true, "aiSource": "web_search" },
    { "name": "ID", "type": "uuid", "autoGenerate": "uuid" },
    { "name": "Category", "type": "text", "enum": ["B2B", "B2C", "Enterprise"], "aiGenerate": true },
    { "name": "Notes", "type": "text", "aiGenerate": false },
    { "name": "Revenue Score", "type": "formula", "formula": "Price * 12" }
  ],
  "selectedFields": ["Company", "Price", "Website", "Category"],
  "source_override": "web_search",
  "fieldSources": {
    "Company": "web_search",
    "Price": "web_search",
    "Website": "web_search",
    "Category": "web_search"
  },
  "existingUniqueValues": {
    "Company": ["Acme Corp", "Globex"]
  }
}
FieldTypeDescription
promptstringUser prompt describing what records to generate
countintegerTotal number of records to generate
modelstringModel ID to use for generation (determines provider)
temperaturenumber?Sampling temperature (0-2)
web_searchobject?Web search config. enabled is auto-set to true when any field uses web_search. Other fields: sources, include_urls, exclude_urls, x_search, citations
chunk_sizeintegerRecords per generation chunk (default 10)
datasetIdstringTarget dataset ID to insert records into
fieldsFieldDefinition[]Complete field schema for the dataset, including all fields (even non-selected). Includes properties like autoGenerate, enum, formula, exampleValue.
selectedFieldsstring[]Field names to generate via AI. Always present. Formula fields are always excluded.
fieldSourcesobjectPre-resolved map of field name → source ("synthetic" or "web_search"). Authoritative source for each field.
source_overridestring?When present, all fields use this source. Already resolved into fieldSources by the API server.
existingUniqueValuesobjectMap of unique field names to arrays of existing values, for deduplication

Non-Selected Field Handling

Fields in fields but NOT in selectedFields are not sent to the AI. Instead, the worker assigns values using this priority:

  1. If field.type === "formula" → skip entirely (do not include in data object)
  2. If field.autoGenerate === "uuid" → generate a UUID v4 string
  3. If field.autoGenerate === "createdAt" or "updatedAt" → current ISO 8601 datetime
  4. If field.default is defined → use that value
  5. Otherwise → set to null

Output Schema

Written to task.output when the task completes successfully:

{
  "inserted": 47,
  "skipped": 3,
  "skippedReasons": [
    "Duplicate value 'Acme Corp' for unique field 'Company'",
    "Duplicate value 'Globex' for unique field 'Company'",
    "Duplicate within batch: 'NewCo' for field 'Company'"
  ],
  "totalChunks": 5,
  "tokenUsage": {
    "input_tokens": 12500,
    "output_tokens": 8400,
    "total_tokens": 20900
  }
}

Chunked Generation

Large generation requests are split into chunks to stay within LLM token limits and provide incremental progress:

Chunk 1/5 → Generate 10 records → Validate → Dedup → Insert → progress=20% Chunk 2/5 → Generate 10 records → Validate → Dedup → Insert → progress=40% Chunk 3/5 → Generate 10 records → Validate → Dedup → Insert → progress=60% Chunk 4/5 → Generate 10 records → Validate → Dedup → Insert → progress=80% Chunk 5/5 → Generate 10 records → Validate → Dedup → Insert → progress=100% → Write output { inserted, skipped, skippedReasons, totalChunks, tokenUsage } → Set status = 'completed'

Deduplication

For fields marked as unique in the field definitions, the worker applies a three-layer deduplication strategy:

  1. Pre-prompt: Pass existingUniqueValues to the AI prompt so it avoids generating known duplicates. Also include values generated in previous chunks.
  2. Post-validate within batch: After generation, check for duplicates within the current chunk.
  3. Post-validate against DB: Check generated unique values against existing database records.

Skipped records and reasons are tracked in output.skipped and output.skippedReasons.

Anti-Hallucination

When web search is enabled, the worker follows a search-then-extract pipeline to prevent hallucinated data:

X/Twitter search: When x_search is true, the task must use a Grok model (grok-*) since xAI has native access to X.com content. Other providers cannot access X/Twitter data.

Shared Requirements

Provider Routing

The model string in the task input determines which AI provider handles the request:

Model PrefixProviderNotes
gpt-*, o3-*, o4-*OpenAIDefault provider for most tasks
claude-*AnthropicStrong at structured data generation
gemini-*Google GeminiGood for high-volume tasks
grok-*xAI (Grok)Required for X/Twitter search (x_search)

All providers are accessed via the OpenAI-compatible SDK with different base URLs and API keys. The worker resolves the correct client at runtime based on the model prefix.