Records

How record operations work at the worker level — generating, validating, deduplicating, and inserting records with citation metadata.

Overview

The ai_generate_records task type generates new records using AI based on a prompt and the dataset's field definitions, then inserts them directly into dataset_records. It supports chunked generation, deduplication, web search grounding, and per-field citation metadata.

This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.

Multi-provider support: The worker routes AI calls to the correct provider based on the model string in the task input. Supported providers: OpenAI (gpt-*), Anthropic (claude-*), Google Gemini (gemini-*), and xAI/Grok (grok-*). X/Twitter search (x_search) specifically requires a Grok model.

Database Tables

Workers handling record tasks need access to these tables in addition to tasks and task_events:

dataset_records Table

Each row is a single record in a dataset. Workers insert new rows here during ai_generate_records and update existing rows during ai_generate_fields backfill.

Column	Type	Description
`id`	varchar (UUID)	Primary key, auto-generated
`dataset_id`	varchar	FK to datasets.id
`data`	jsonb	The record's field values as a JSON object
`_meta`	jsonb (nullable)	Citation and generation metadata (see below)
`deleted_at`	timestamp (nullable)	Soft delete timestamp
`created_at`	timestamp	Auto-set on insert
`created_by`	varchar	User ID who created the task (use task.created_by)
`updated_at`	timestamp	Updated on every change
`updated_by`	varchar	User ID who last updated

_meta Column

Every AI-generated record must have a _meta column containing per-field citations and generation provenance. This is a separate JSONB column on dataset_records, not inside the data column, to keep user data clean.

{
  "citations": {
    "Company": [{ "url": "https://example.com/page", "title": "Source Page", "snippet": "Acme Corp is..." }],
    "Price": [{ "url": "https://pricing.example.com", "title": "Pricing", "snippet": "$29/mo" }]
  },
  "generatedBy": "ai",
  "taskId": "task_xyz789",
  "model": "gpt-5-nano",
  "generatedAt": "2026-02-22T10:30:00.000Z"
}

Field	Type	Description
`citations`	Record<string, {url, title?, snippet?}[]>	Per-field citation sources from web search. Key is field name, value is array of sources.
`generatedBy`	string	Always "ai" for AI-generated records
`taskId`	string	ID of the task that generated this record
`model`	string	AI model used for generation
`generatedAt`	string	ISO 8601 timestamp when the record was generated

ai_generate_records

This task type generates dataset records using AI and inserts them directly into dataset_records. The worker reads the dataset's field definitions, generates records in chunks, deduplicates against existing data, and writes each record with citation metadata in the _meta column.

Property	Value
Task Type	`ai_generate_records`
Created By	POST /api/datasets/:id/records with ai body
Tables Used	tasks, task_events, dataset_records (insert)

Input Schema

The task's input column contains the validated generation request. The API server pre-populates fields, selectedFields, fieldSources, and existingUniqueValues from the dataset before creating the task.

{
  "prompt": "Generate 50 SaaS companies with pricing data",
  "count": 50,
  "model": "gpt-5-nano",
  "temperature": 0.7,
  "web_search": {
    "enabled": true,
    "sources": 5,
    "include_urls": ["https://example.com"],
    "exclude_urls": [],
    "x_search": false,
    "citations": true
  },
  "chunk_size": 10,
  "datasetId": "ds_abc123",
  "fields": [
    { "name": "Company", "type": "text", "required": true, "unique": true, "aiGenerate": true, "aiSource": "web_search" },
    { "name": "Price", "type": "number", "aiGenerate": true, "aiSource": "web_search" },
    { "name": "Website", "type": "url", "aiGenerate": true, "aiSource": "web_search" },
    { "name": "ID", "type": "uuid", "autoGenerate": "uuid" },
    { "name": "Category", "type": "text", "enum": ["B2B", "B2C", "Enterprise"], "aiGenerate": true },
    { "name": "Notes", "type": "text", "aiGenerate": false },
    { "name": "Revenue Score", "type": "formula", "formula": "Price * 12" }
  ],
  "selectedFields": ["Company", "Price", "Website", "Category"],
  "source_override": "web_search",
  "fieldSources": {
    "Company": "web_search",
    "Price": "web_search",
    "Website": "web_search",
    "Category": "web_search"
  },
  "existingUniqueValues": {
    "Company": ["Acme Corp", "Globex"]
  }
}

Field	Type	Description
`prompt`	string	User prompt describing what records to generate
`count`	integer	Total number of records to generate
`model`	string	Model ID to use for generation (determines provider)
`temperature`	number?	Sampling temperature (0-2)
`web_search`	object?	Web search config. `enabled` is auto-set to true when any field uses web_search. Other fields: sources, include_urls, exclude_urls, x_search, citations
`chunk_size`	integer	Records per generation chunk (default 10)
`datasetId`	string	Target dataset ID to insert records into
`fields`	FieldDefinition[]	Complete field schema for the dataset, including all fields (even non-selected). Includes properties like autoGenerate, enum, formula, exampleValue.
`selectedFields`	string[]	Field names to generate via AI. Always present. Formula fields are always excluded.
`fieldSources`	object	Pre-resolved map of field name → source ("synthetic" or "web_search"). Authoritative source for each field.
`source_override`	string?	When present, all fields use this source. Already resolved into `fieldSources` by the API server.
`existingUniqueValues`	object	Map of unique field names to arrays of existing values, for deduplication

Non-Selected Field Handling

Fields in fields but NOT in selectedFields are not sent to the AI. Instead, the worker assigns values using this priority:

If field.type === "formula" → skip entirely (do not include in data object)
If field.autoGenerate === "uuid" → generate a UUID v4 string
If field.autoGenerate === "createdAt" or "updatedAt" → current ISO 8601 datetime
If field.default is defined → use that value
Otherwise → set to null

Output Schema

Written to task.output when the task completes successfully:

{
  "inserted": 47,
  "skipped": 3,
  "skippedReasons": [
    "Duplicate value 'Acme Corp' for unique field 'Company'",
    "Duplicate value 'Globex' for unique field 'Company'",
    "Duplicate within batch: 'NewCo' for field 'Company'"
  ],
  "totalChunks": 5,
  "tokenUsage": {
    "input_tokens": 12500,
    "output_tokens": 8400,
    "total_tokens": 20900
  }
}

Chunked Generation

Large generation requests are split into chunks to stay within LLM token limits and provide incremental progress:

Break count into chunks of chunk_size (e.g., 50 records with chunk_size 10 = 5 chunks)
For each chunk: generate records via AI, validate against field schema, insert into dataset_records
Update task progress after each chunk: Math.round((chunkIndex + 1) / totalChunks * 100)
Accumulate token usage across all chunks
If a chunk fails, previously inserted records are kept and the error is reported

Chunk 1/5 → Generate 10 records → Validate → Dedup → Insert → progress=20% Chunk 2/5 → Generate 10 records → Validate → Dedup → Insert → progress=40% Chunk 3/5 → Generate 10 records → Validate → Dedup → Insert → progress=60% Chunk 4/5 → Generate 10 records → Validate → Dedup → Insert → progress=80% Chunk 5/5 → Generate 10 records → Validate → Dedup → Insert → progress=100% → Write output { inserted, skipped, skippedReasons, totalChunks, tokenUsage } → Set status = 'completed'

Deduplication

For fields marked as unique in the field definitions, the worker applies a three-layer deduplication strategy:

Pre-prompt: Pass existingUniqueValues to the AI prompt so it avoids generating known duplicates. Also include values generated in previous chunks.
Post-validate within batch: After generation, check for duplicates within the current chunk.
Post-validate against DB: Check generated unique values against existing database records.

Skipped records and reasons are tracked in output.skipped and output.skippedReasons.

Anti-Hallucination

When web search is enabled, the worker follows a search-then-extract pipeline to prevent hallucinated data:

Search-then-extract: When web_search.enabled is true, search for relevant sources first, then pass retrieved content as grounding context for the AI model.
Citation-required mode: When web_search.citations is true, instruct the model to cite sources for each field value. Citations are stored in _meta.citations.
Null over guessing: The model should output null for fields where no reliable source is found, rather than hallucinating values.
Never fabricate: URLs, statistics, and factual claims must come from actual sources.

X/Twitter search: When x_search is true, the task must use a Grok model (grok-*) since xAI has native access to X.com content. Other providers cannot access X/Twitter data.

Shared Requirements

Token usage: Accumulated across all AI calls, written to task.output.tokenUsage
Progress: Updated via task.progress (0-100) after each chunk/batch
Cancellation: Task status checked every few chunks — processing stops if cancelled
Error handling: On failure, status set to failed, error message recorded, completed_at set
Structured output: All AI calls use JSON schema for parseable responses

Provider Routing

The model string in the task input determines which AI provider handles the request:

Model Prefix	Provider	Notes
`gpt-`, `o3-`, `o4-*`	OpenAI	Default provider for most tasks
`claude-*`	Anthropic	Strong at structured data generation
`gemini-*`	Google Gemini	Good for high-volume tasks
`grok-*`	xAI (Grok)	Required for X/Twitter search (`x_search`)

All providers are accessed via the OpenAI-compatible SDK with different base URLs and API keys. The worker resolves the correct client at runtime based on the model prefix.