Records
How record operations work at the worker level — generating, validating, deduplicating, and inserting records with citation metadata.
Overview
The ai_generate_records task type generates new records using AI based on a prompt and the dataset's field definitions, then inserts them directly into dataset_records. It supports chunked generation, deduplication, web search grounding, and per-field citation metadata.
This task follows the same worker lifecycle as other tasks (poll → claim → process → complete), but unlike ai_response, it reads and writes the datasets and dataset_records tables directly.
Multi-provider support: The worker routes AI calls to the correct provider based on the model string in the task input. Supported providers: OpenAI (gpt-*), Anthropic (claude-*), Google Gemini (gemini-*), and xAI/Grok (grok-*). X/Twitter search (x_search) specifically requires a Grok model.
Database Tables
Workers handling record tasks need access to these tables in addition to tasks and task_events:
dataset_records Table
Each row is a single record in a dataset. Workers insert new rows here during ai_generate_records and update existing rows during ai_generate_fields backfill.
| Column | Type | Description |
|---|---|---|
id | varchar (UUID) | Primary key, auto-generated |
dataset_id | varchar | FK to datasets.id |
data | jsonb | The record's field values as a JSON object |
_meta | jsonb (nullable) | Citation and generation metadata (see below) |
deleted_at | timestamp (nullable) | Soft delete timestamp |
created_at | timestamp | Auto-set on insert |
created_by | varchar | User ID who created the task (use task.created_by) |
updated_at | timestamp | Updated on every change |
updated_by | varchar | User ID who last updated |
_meta Column
Every AI-generated record must have a _meta column containing per-field citations and generation provenance. This is a separate JSONB column on dataset_records, not inside the data column, to keep user data clean.
{
"citations": {
"Company": [{ "url": "https://example.com/page", "title": "Source Page", "snippet": "Acme Corp is..." }],
"Price": [{ "url": "https://pricing.example.com", "title": "Pricing", "snippet": "$29/mo" }]
},
"generatedBy": "ai",
"taskId": "task_xyz789",
"model": "gpt-5-nano",
"generatedAt": "2026-02-22T10:30:00.000Z"
}
| Field | Type | Description |
|---|---|---|
citations | Record<string, {url, title?, snippet?}[]> | Per-field citation sources from web search. Key is field name, value is array of sources. |
generatedBy | string | Always "ai" for AI-generated records |
taskId | string | ID of the task that generated this record |
model | string | AI model used for generation |
generatedAt | string | ISO 8601 timestamp when the record was generated |
ai_generate_records
This task type generates dataset records using AI and inserts them directly into dataset_records. The worker reads the dataset's field definitions, generates records in chunks, deduplicates against existing data, and writes each record with citation metadata in the _meta column.
| Property | Value |
|---|---|
| Task Type | ai_generate_records |
| Created By | POST /api/datasets/:id/records with ai body |
| Tables Used | tasks, task_events, dataset_records (insert) |
Input Schema
The task's input column contains the validated generation request. The API server pre-populates fields, selectedFields, fieldSources, and existingUniqueValues from the dataset before creating the task.
{
"prompt": "Generate 50 SaaS companies with pricing data",
"count": 50,
"model": "gpt-5-nano",
"temperature": 0.7,
"streaming": true,
"max_output_tokens": 10000,
"web_search": {
"enabled": true,
"sources": 5,
"include_urls": ["https://example.com"],
"exclude_urls": [],
"x_search": false,
"citations": true
},
"chunk_size": 50,
"datasetId": "ds_abc123",
"fields": [
{ "id": "fld_a1b2c3d4e5f6", "name": "Company", "type": "text", "required": true, "unique": true, "ai": { "search": { "sources": 10, "citations": true } } },
{ "id": "fld_b2c3d4e5f6g7", "name": "Price", "type": "number", "ai": { "search": {} } },
{ "id": "fld_c3d4e5f6g7h8", "name": "Website", "type": "url", "ai": { "search": {} } },
{ "id": "fld_d4e5f6g7h8i9", "name": "ID", "type": "uuid", "autoGenerate": "uuid" },
{ "id": "fld_e5f6g7h8i9j0", "name": "Category", "type": "text", "enum": ["B2B", "B2C", "Enterprise"], "ai": { "model": "claude-sonnet-4-20250514" } },
{ "id": "fld_f6g7h8i9j0k1", "name": "Notes", "type": "text", "ai": false },
{ "id": "fld_g7h8i9j0k1l2", "name": "Revenue Score", "type": "formula", "formula": "Price * 12" }
],
"selectedFields": ["Company", "Price", "Website", "Category"],
"fieldSources": {
"Company": "web_search",
"Price": "web_search",
"Website": "web_search",
"Category": "web_search"
},
"fieldWebSearchConfigs": {
"Company": { "sources": 10, "citations": true }
},
"fieldModels": {
"Category": "claude-sonnet-4-20250514"
},
"fieldPrompts": {
"Company": "Research the company and provide their official description"
},
"fieldTemperatures": {
"Company": 0.5
},
"fieldStreaming": {
"Company": true
},
"fieldMaxOutputTokens": {
"Company": 2000
},
"existingUniqueValues": {
"Company": ["Acme Corp", "Globex"]
}
}
Model resolution: The API server writes task.metadata.resolvedModel (concrete model ID, never "auto") and task.metadata.resolvedProvider (e.g., "openai", "anthropic", "google", "openrouter"). The worker reads these first, falling back to input.model and prefix-based detection for backward compatibility. The resolved model is used in _meta.model on each inserted record.
| Field | Type | Description |
|---|---|---|
prompt | string? | User prompt describing what records to generate |
count | integer | Total number of records to generate |
model | string | Requested model ID (may be "auto"). Authoritative model/provider come from task.metadata.resolvedModel / resolvedProvider. |
temperature | number? | Request-level sampling temperature (0-2) |
streaming | boolean? | Request-level streaming preference |
max_output_tokens | number? | Request-level maximum output tokens. When provided, used as the token limit for AI generation calls instead of the estimated default. |
web_search | object? | Web search config. enabled is auto-set to true when any field uses web_search. Sub-fields: sources, include_urls, exclude_urls, x_search, citations |
chunk_size | integer | Records per generation chunk (default 50) |
datasetId | string | Target dataset ID to insert records into |
fields | FieldDefinition[] | Complete field schema. Each field has id (stable ID like "fld_...") and optionally ai (false or {model?, prompt?, search?}). Includes autoGenerate, enum, formula, exampleValue, description. |
selectedFields | string[] | Field names to generate via AI. Pre-resolved by the API server (always present). Defaults to all non-formula fields where ai is not false. |
fieldSources | object | Pre-resolved map of field name → source ("synthetic" or "web_search"). Authoritative source for each field. |
fieldWebSearchConfigs | object? | Per-field web search preferences (from field's ai.search). Merged with request-level web_search, with request-level taking priority. |
fieldModels | object? | Pre-resolved map of field name → preferred model ID (from field's ai.model config). |
fieldPrompts | object? | Pre-resolved map of field name → custom prompt guidance (from field's ai.prompt config). Appended to the AI generation instructions for the specific field. |
fieldTemperatures | object? | Pre-resolved map of field name → temperature override (from field's ai.temperature config, 0-2). |
fieldStreaming | object? | Pre-resolved map of field name → streaming preference (from field's ai.streaming config). |
fieldMaxOutputTokens | object? | Pre-resolved map of field name → max output token limit (from field's ai.max_output_tokens config). |
existingUniqueValues | object | Map of unique field names to arrays of existing values, for deduplication |
Non-Selected Field Handling
Fields in fields but NOT in selectedFields are not sent to the AI. Instead, the worker assigns values using this priority:
- If
field.type === "formula"→ skip entirely (do not include in data object) - If
field.autoGenerate === "uuid"→ generate a UUID v4 string - If
field.autoGenerate === "createdAt"or"updatedAt"→ current ISO 8601 datetime - If
field.defaultis defined → use that value - Otherwise → set to null
Output Schema
Written to task.output when the task completes successfully:
{
"inserted": 47,
"skipped": 3,
"skippedReasons": [
{ "data": { "Company": "Acme Corp", "Price": 29, "Category": "B2B" }, "reason": "Duplicate value 'Acme Corp' for unique field 'Company'" },
{ "data": { "Company": "Globex", "Price": 49, "Category": "Enterprise" }, "reason": "Duplicate value 'Globex' for unique field 'Company'" }
],
"totalChunks": 5,
"tokenUsage": {
"input_tokens": 12500,
"output_tokens": 8400,
"total_tokens": 20900
}
}
Chunked Generation
Large generation requests are split into chunks to stay within LLM token limits and provide incremental progress:
- Break
countinto chunks ofchunk_size(e.g., 100 records with chunk_size 50 = 2 chunks) - For each chunk: generate records via AI, validate against field schema, insert into
dataset_records - Update task progress after each chunk:
Math.round((chunkIndex + 1) / totalChunks * 100) - Accumulate token usage across all chunks
- If a chunk fails, previously inserted records are kept and the error is reported
Deduplication
For fields marked as unique in the field definitions, the worker applies a three-layer deduplication strategy:
- Pre-prompt: Pass
existingUniqueValuesto the AI prompt so it avoids generating known duplicates. Also include values generated in previous chunks. - Post-validate within batch: After generation, check for duplicates within the current chunk.
- Post-validate against DB: Check generated unique values against existing database records.
Skipped records and reasons are tracked in output.skipped and output.skippedReasons.
Anti-Hallucination
When web search is enabled, the worker follows a search-then-extract pipeline to prevent hallucinated data:
- Search-then-extract: When
web_search.enabledis true, search for relevant sources first, then pass retrieved content as grounding context for the AI model. - Citation-required mode: When
web_search.citationsis true, instruct the model to cite sources for each field value. Citations are stored in_meta.citations. - Null over guessing: The model should output null for fields where no reliable source is found, rather than hallucinating values.
- Never fabricate: URLs, statistics, and factual claims must come from actual sources.
X/Twitter search: When x_search is true, the task must use a Grok model (grok-*) since xAI has native access to X.com content. Other providers cannot access X/Twitter data.
Shared Requirements
- Token usage: Accumulated across all AI calls, written to
task.output.tokenUsage - Progress: Updated via
task.progress(0-100) after each chunk/batch - Cancellation: Task status checked every few chunks — processing stops if cancelled
- Error handling: On failure, status set to
failed, error message recorded,completed_atset - Structured output: All AI calls use JSON schema for parseable responses
Provider Routing
The API server resolves the provider via task.metadata.resolvedProvider. For backward compatibility, the worker also detects the provider from the model prefix:
| Model Prefix | Provider | Notes |
|---|---|---|
gpt-*, o3-*, o4-* | OpenAI | Default provider for most tasks |
claude-* | Anthropic | Strong at structured data generation |
gemini-* | Google Gemini | Good for high-volume tasks |
grok-* | xAI (Grok) | Required for X/Twitter search (x_search) |
All providers are accessed via the OpenAI-compatible SDK with different base URLs and API keys. Provider mapping from API server names: openai→openai, anthropic→anthropic, google→gemini, openrouter→xai.