Records
How record operations work at the worker level — generating, validating, deduplicating, and inserting records with citation metadata.
Overview
The ai_generate_records task type generates new records using AI based on a prompt and the dataset's field definitions, then inserts them directly into dataset_records. It supports chunked generation, deduplication, web search grounding, and per-field citation metadata.
This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.
Multi-provider support: The worker routes AI calls to the correct provider based on the model string in the task input. Supported providers: OpenAI (gpt-*), Anthropic (claude-*), Google Gemini (gemini-*), and xAI/Grok (grok-*). X/Twitter search (x_search) specifically requires a Grok model.
Database Tables
Workers handling record tasks need access to these tables in addition to tasks and task_events:
dataset_records Table
Each row is a single record in a dataset. Workers insert new rows here during ai_generate_records and update existing rows during ai_generate_fields backfill.
| Column | Type | Description |
|---|---|---|
id | varchar (UUID) | Primary key, auto-generated |
dataset_id | varchar | FK to datasets.id |
data | jsonb | The record's field values as a JSON object |
_meta | jsonb (nullable) | Citation and generation metadata (see below) |
deleted_at | timestamp (nullable) | Soft delete timestamp |
created_at | timestamp | Auto-set on insert |
created_by | varchar | User ID who created the task (use task.created_by) |
updated_at | timestamp | Updated on every change |
updated_by | varchar | User ID who last updated |
_meta Column
Every AI-generated record must have a _meta column containing per-field citations and generation provenance. This is a separate JSONB column on dataset_records, not inside the data column, to keep user data clean.
{
"citations": {
"Company": [{ "url": "https://example.com/page", "title": "Source Page", "snippet": "Acme Corp is..." }],
"Price": [{ "url": "https://pricing.example.com", "title": "Pricing", "snippet": "$29/mo" }]
},
"generatedBy": "ai",
"taskId": "task_xyz789",
"model": "gpt-5-nano",
"generatedAt": "2026-02-22T10:30:00.000Z"
}
| Field | Type | Description |
|---|---|---|
citations | Record<string, {url, title?, snippet?}[]> | Per-field citation sources from web search. Key is field name, value is array of sources. |
generatedBy | string | Always "ai" for AI-generated records |
taskId | string | ID of the task that generated this record |
model | string | AI model used for generation |
generatedAt | string | ISO 8601 timestamp when the record was generated |
ai_generate_records
This task type generates dataset records using AI and inserts them directly into dataset_records. The worker reads the dataset's field definitions, generates records in chunks, deduplicates against existing data, and writes each record with citation metadata in the _meta column.
| Property | Value |
|---|---|
| Task Type | ai_generate_records |
| Created By | POST /api/datasets/:id/records with ai body |
| Tables Used | tasks, task_events, dataset_records (insert) |
Input Schema
The task's input column contains the validated generation request. The API server pre-populates fields, selectedFields, fieldSources, and existingUniqueValues from the dataset before creating the task.
{
"prompt": "Generate 50 SaaS companies with pricing data",
"count": 50,
"model": "gpt-5-nano",
"temperature": 0.7,
"web_search": {
"enabled": true,
"sources": 5,
"include_urls": ["https://example.com"],
"exclude_urls": [],
"x_search": false,
"citations": true
},
"chunk_size": 10,
"datasetId": "ds_abc123",
"fields": [
{ "name": "Company", "type": "text", "required": true, "unique": true, "aiGenerate": true, "aiSource": "web_search" },
{ "name": "Price", "type": "number", "aiGenerate": true, "aiSource": "web_search" },
{ "name": "Website", "type": "url", "aiGenerate": true, "aiSource": "web_search" },
{ "name": "ID", "type": "uuid", "autoGenerate": "uuid" },
{ "name": "Category", "type": "text", "enum": ["B2B", "B2C", "Enterprise"], "aiGenerate": true },
{ "name": "Notes", "type": "text", "aiGenerate": false },
{ "name": "Revenue Score", "type": "formula", "formula": "Price * 12" }
],
"selectedFields": ["Company", "Price", "Website", "Category"],
"source_override": "web_search",
"fieldSources": {
"Company": "web_search",
"Price": "web_search",
"Website": "web_search",
"Category": "web_search"
},
"existingUniqueValues": {
"Company": ["Acme Corp", "Globex"]
}
}
| Field | Type | Description |
|---|---|---|
prompt | string | User prompt describing what records to generate |
count | integer | Total number of records to generate |
model | string | Model ID to use for generation (determines provider) |
temperature | number? | Sampling temperature (0-2) |
web_search | object? | Web search config. enabled is auto-set to true when any field uses web_search. Other fields: sources, include_urls, exclude_urls, x_search, citations |
chunk_size | integer | Records per generation chunk (default 10) |
datasetId | string | Target dataset ID to insert records into |
fields | FieldDefinition[] | Complete field schema for the dataset, including all fields (even non-selected). Includes properties like autoGenerate, enum, formula, exampleValue. |
selectedFields | string[] | Field names to generate via AI. Always present. Formula fields are always excluded. |
fieldSources | object | Pre-resolved map of field name → source ("synthetic" or "web_search"). Authoritative source for each field. |
source_override | string? | When present, all fields use this source. Already resolved into fieldSources by the API server. |
existingUniqueValues | object | Map of unique field names to arrays of existing values, for deduplication |
Non-Selected Field Handling
Fields in fields but NOT in selectedFields are not sent to the AI. Instead, the worker assigns values using this priority:
- If
field.type === "formula"→ skip entirely (do not include in data object) - If
field.autoGenerate === "uuid"→ generate a UUID v4 string - If
field.autoGenerate === "createdAt"or"updatedAt"→ current ISO 8601 datetime - If
field.defaultis defined → use that value - Otherwise → set to null
Output Schema
Written to task.output when the task completes successfully:
{
"inserted": 47,
"skipped": 3,
"skippedReasons": [
"Duplicate value 'Acme Corp' for unique field 'Company'",
"Duplicate value 'Globex' for unique field 'Company'",
"Duplicate within batch: 'NewCo' for field 'Company'"
],
"totalChunks": 5,
"tokenUsage": {
"input_tokens": 12500,
"output_tokens": 8400,
"total_tokens": 20900
}
}
Chunked Generation
Large generation requests are split into chunks to stay within LLM token limits and provide incremental progress:
- Break
countinto chunks ofchunk_size(e.g., 50 records with chunk_size 10 = 5 chunks) - For each chunk: generate records via AI, validate against field schema, insert into
dataset_records - Update task progress after each chunk:
Math.round((chunkIndex + 1) / totalChunks * 100) - Accumulate token usage across all chunks
- If a chunk fails, previously inserted records are kept and the error is reported
Deduplication
For fields marked as unique in the field definitions, the worker applies a three-layer deduplication strategy:
- Pre-prompt: Pass
existingUniqueValuesto the AI prompt so it avoids generating known duplicates. Also include values generated in previous chunks. - Post-validate within batch: After generation, check for duplicates within the current chunk.
- Post-validate against DB: Check generated unique values against existing database records.
Skipped records and reasons are tracked in output.skipped and output.skippedReasons.
Anti-Hallucination
When web search is enabled, the worker follows a search-then-extract pipeline to prevent hallucinated data:
- Search-then-extract: When
web_search.enabledis true, search for relevant sources first, then pass retrieved content as grounding context for the AI model. - Citation-required mode: When
web_search.citationsis true, instruct the model to cite sources for each field value. Citations are stored in_meta.citations. - Null over guessing: The model should output null for fields where no reliable source is found, rather than hallucinating values.
- Never fabricate: URLs, statistics, and factual claims must come from actual sources.
X/Twitter search: When x_search is true, the task must use a Grok model (grok-*) since xAI has native access to X.com content. Other providers cannot access X/Twitter data.
Shared Requirements
- Token usage: Accumulated across all AI calls, written to
task.output.tokenUsage - Progress: Updated via
task.progress(0-100) after each chunk/batch - Cancellation: Task status checked every few chunks — processing stops if cancelled
- Error handling: On failure, status set to
failed, error message recorded,completed_atset - Structured output: All AI calls use JSON schema for parseable responses
Provider Routing
The model string in the task input determines which AI provider handles the request:
| Model Prefix | Provider | Notes |
|---|---|---|
gpt-*, o3-*, o4-* | OpenAI | Default provider for most tasks |
claude-* | Anthropic | Strong at structured data generation |
gemini-* | Google Gemini | Good for high-volume tasks |
grok-* | xAI (Grok) | Required for X/Twitter search (x_search) |
All providers are accessed via the OpenAI-compatible SDK with different base URLs and API keys. The worker resolves the correct client at runtime based on the model prefix.