Automate Metadata Labeling (Document Ingestion Pipeline)
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

This guide shows you how to automatically extract and generate structured metadata from unstructured documents using Sync's ontology-driven metadata extraction. This is a simple way to enable document ingestion for a large variety of file types. By the end of this tutorial, you'll understand how to define custom metadata schemas, apply them to documents, and retrieve AI-extracted structured data at scale.

What You'll Learn

How to create an ontology that defines document categories and metadata extraction rules
How to upload documents to a dataspace and apply categories
How to trigger ingestion to make documents AI-ready and extract metadata
How to retrieve documents with their extracted metadata and attribution information
How to scale this workflow to hundreds or thousands of documents

Prerequisites

Before starting, make sure you have:

Completed the Account Setup Guide
An active workspace and dataspace
Your authentication token ready

Understanding the Core Concepts

Before we dive into the implementation, let's briefly understand the two key concepts:

Ontologies

An ontology defines the structure and organization of your content. It consists of:

Categories: Types of documents (e.g., "Invoice", "Contract", "Research Paper")
Metadata Queries: AI-powered extraction instructions that pull specific data from documents
Query Bindings: Rules connecting metadata queries to categories

Ontologies enable Sync to automatically classify documents and extract structured data using AI.

Content

Content represents a document along with all its derivatives, including:

The original file
Extracted text and embeddings
Metadata: Structured data extracted by AI or provided manually
Inference Task Executions: Attribution tracking for which AI tasks generated each metadata value

Step 1: Create a Simple Ontology

Let's create an ontology for managing insurance claim documents. We'll define two categories and three metadata queries.

Create the Ontology

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Insurance Claims Ontology",
  "description": "Ontology for classifying and extracting data from insurance claim documents"
}

Response:

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "name": "Insurance Claims Ontology",
  "description": "Ontology for classifying and extracting data from insurance claim documents",
  "createdAt": "2025-01-20T10:00:00Z"
}

Save the id - you'll need it for the next steps. Let's call it {ontologyId}.

Create Categories

Now create two document categories: "Auto Claim" and "Home Claim".

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Auto Claim",
  "description": "Insurance claims related to automobile accidents",
  "instructions": "This category includes all claim forms for vehicle damage, personal injury from auto accidents, and related documentation."
}

Response:

{
  "id": "cat-auto-001",
  "name": "Auto Claim",
  "description": "Insurance claims related to automobile accidents",
  "instructions": "This category includes all claim forms for vehicle damage...",
  "ontologyId": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "boundMetadataQueries": [],
  "createdAt": "2025-01-20T10:05:00Z"
}

Repeat for the "Home Claim" category:

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Home Claim",
  "description": "Insurance claims related to home damage or loss",
  "instructions": "This category includes property damage claims, theft, fire, water damage, and other home-related incidents."
}

Create Metadata Queries

Now let's define three AI-powered metadata extraction queries that will run on our documents:

1. Claim Number

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Claim Number",
  "description": "The unique identifier assigned to this claim",
  "dataType": "SHORT_STRING",
  "instructions": "Extract the claim number from the document. It's typically a sequence of letters and numbers like 'CLM-2024-12345' or similar format."
}

2. Incident Date

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Incident Date",
  "description": "The date when the incident occurred",
  "dataType": "DATE",
  "instructions": "Extract the date of the incident being claimed. Return in ISO 8601 format (YYYY-MM-DD). Look for phrases like 'date of loss', 'incident date', or 'date of accident'."
}

3. Estimated Loss Amount

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Estimated Loss Amount",
  "description": "The estimated financial loss or damage amount",
  "dataType": "NUMBER",
  "instructions": "Extract the total estimated loss or damage amount in dollars. Look for fields like 'estimated loss', 'total damage', or 'claim amount'. Return just the numeric value without currency symbols."
}

Save each metadata query's id - you'll need them for binding.

Bind Metadata Queries to Categories

Now we'll bind these metadata queries to our categories. Let's make "Claim Number" and "Incident Date" required for both categories:

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories/{categoryId}/metadata-queries/{metadataQueryId}
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "required": true,
  "uniqueIndexElement": false
}

Repeat this for each combination of category and metadata query. For example:

Bind "Claim Number" to "Auto Claim" (required)
Bind "Incident Date" to "Auto Claim" (required)
Bind "Estimated Loss Amount" to "Auto Claim" (not required)
Bind "Claim Number" to "Home Claim" (required)
Bind "Incident Date" to "Home Claim" (required)
Bind "Estimated Loss Amount" to "Home Claim" (not required)

Step 2: Upload a Document

Now that your ontology is configured, let's upload a claim document to your dataspace.

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}
Authorization: Bearer <your-token>
Content-Type: multipart/form-data

# Form fields:
# file: <binary data of your PDF/image/document>
# categoryId: "cat-auto-001"
# metadata: {}
# fileName: "claim_2024_12345.pdf"
# fileFormat: "application/pdf"

Response:

{
  "contentId": "123e4567-e89b-12d3-a456-426614174000",
  "dataspaceId": "sds-abc12345",
  "categoryId": "cat-auto-001",
  "fileName": "claim_2024_12345.pdf",
  "fileFormat": "application/pdf",
  "metadata": {},
  "inferenceTaskExecutions": {},
  "createdAt": "2025-01-20T11:00:00Z",
  "updatedAt": "2025-01-20T11:00:00Z"
}

At this point, the document is uploaded but not yet AI-ready. The metadata field is empty because we haven't run the extraction workflow yet.

Step 3: Trigger Ingestion

Ingestion is the process that makes your document AI-ready by:

Extracting text from the file (supporting 40+ file formats)
Generating embeddings and semantic indexes
Running all bound metadata queries to extract structured data

First, you'll need a workflow ID. You can create a default ingestion workflow or use an existing one. For this guide, we'll assume you have a workflow ID ({workflowId}).

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}/ingest?workflowId={workflowId}
Authorization: Bearer <your-token>
Content-Type: application/json

{}

Response:

{
  "id": "789e0123-e89b-12d3-a456-426614174002",
  "workflowId": "456f1234-e89b-12d3-a456-426614174001",
  "primaryScopeObjectId": "123e4567-e89b-12d3-a456-426614174000",
  "primaryScopeDataspaceId": "sds-abc12345",
  "status": "started",
  "startedAt": "2025-01-20T11:05:00Z",
  "completedAt": null,
  "errorMessage": null
}

The ingestion process runs asynchronously. Depending on the document size and complexity, it may take from a few seconds to a few minutes.

Step 4: Retrieve Content with Extracted Metadata

Once ingestion is complete, you can retrieve the content with its extracted metadata:

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}
Authorization: Bearer <your-token>

Response:

{
  "contentId": "123e4567-e89b-12d3-a456-426614174000",
  "dataspaceId": "sds-abc12345",
  "categoryId": "cat-auto-001",
  "fileName": "claim_2024_12345.pdf",
  "fileFormat": "application/pdf",
  "metadata": {
    "Claim Number": "CLM-2024-12345",
    "Incident Date": "2024-01-15",
    "Estimated Loss Amount": 5250.00
  },
  "inferenceTaskExecutions": {
    "Claim Number": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
    "Incident Date": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
    "Estimated Loss Amount": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6"
  },
  "createdAt": "2025-01-20T11:00:00Z",
  "updatedAt": "2025-01-20T11:10:00Z"
}

Understanding the Response

metadata: Contains the structured data extracted by AI. Each key corresponds to a metadata query name.
inferenceTaskExecutions: Maps each metadata field to the task execution ID that generated it. This provides full attribution and traceability. All three fields share the same task execution ID because they were all extracted in a single metadata extraction task.

Viewing Task Execution Details

To see exactly how the AI extracted each value, including confidence scores and justifications, you can fetch the detailed task execution output.

Using the task execution ID from the inferenceTaskExecutions object (in this case a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6):

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/task-execution-outputs/get-metadata-value/{dataspaceId}/a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6
Authorization: Bearer <your-token>

Response:

{
  "taskNameSlug": "get-metadata-value",
  "taskExecutionId": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
  "output": {
    "message": "Successfully extracted metadata for 3 keys",
    "results": {
      "Claim Number": {
        "value": "CLM-2024-12345",
        "confidence": "high",
        "justification": "Found claim number in document header at the top right corner labeled as 'Claim #'",
        "potentialAlternatives": null,
        "selectionRationale": null
      },
      "Incident Date": {
        "value": "2024-01-15",
        "confidence": "high",
        "justification": "Date of loss explicitly stated in Section 2 of the form under 'Date of Incident'",
        "potentialAlternatives": null,
        "selectionRationale": null
      },
      "Estimated Loss Amount": {
        "value": 5250.00,
        "confidence": "medium",
        "justification": "Estimated repair cost from initial assessment in Section 4. Final amount may differ pending adjuster review.",
        "potentialAlternatives": [4800, 5500],
        "selectionRationale": "Selected the mid-range estimate as the most conservative value between low estimate ($4,800) and high estimate ($5,500)"
      }
    }
  }
}

What This Output Tells You

For each extracted metadata field, the task execution output provides:

value: The extracted data that was stored in the content's metadata
confidence: AI's confidence level ("high", "medium", or "low")
justification: Detailed explanation of where and why the AI extracted this value
potentialAlternatives: Other values the AI considered (if applicable)
selectionRationale: Why the AI chose this value over the alternatives

This level of detail is invaluable for:

Quality assurance: Review extraction accuracy and identify patterns in errors
Compliance: Provide audit trails showing how metadata was derived
Model improvement: Use justifications to refine metadata query instructions
Debugging: Understand why certain extractions failed or had low confidence

Step 5: Scaling to Hundreds of Documents

Now that you've processed a single document, here's how to scale this workflow to handle hundreds or thousands of documents:

Batch Upload Strategy

Upload documents in parallel: Use multiple concurrent API calls to upload documents faster
Set categories at upload time: If you know the category beforehand, include it in the upload request
Queue ingestion requests: After uploading, trigger ingestion for all documents

Example Batch Processing Script

# Upload all documents
content_ids = []
for document_file in document_files:
    response = upload_content(
        dataspace_id=dataspace_id,
        file=document_file,
        category_id=category_id,  # Pre-classify if possible
        metadata={}  # Let AI extract everything
    )
    content_ids.append(response['contentId'])

# Trigger ingestion for all documents
for content_id in content_ids:
    trigger_ingestion(
        dataspace_id=dataspace_id,
        content_id=content_id,
        workflow_id=workflow_id
    )

# Poll for completion
while not all_complete(content_ids):
    time.sleep(10)
    check_status(content_ids)

# Retrieve all content with metadata
results = []
for content_id in content_ids:
    content = get_content(dataspace_id, content_id)
    results.append(content)

# Export to CSV or database
export_metadata_to_csv(results)

Best Practices for Scale

Parallel uploads: Upload 10-20 documents concurrently for maximum throughput
Batch size: Process documents in batches of 100-500 to monitor progress
Error handling: Implement retry logic for failed uploads or ingestion tasks
Status polling: Use the workflow execution endpoint to check batch progress
Rate limiting: Respect API rate limits (typically 100 requests/minute per workspace)

Querying Across All Content

Once documents are processed, you can query across all of them using SQL-like filters:

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}?category=Auto%20Claim&pageSize=100
Authorization: Bearer <your-token>

Or use Sync's semantic search and query capabilities to ask natural language questions across your entire corpus:

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/query
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "query": "Find all auto claims from January 2024 with estimated losses over $5000",
  "context": {
    "contentFilters": {
      "categories": ["Auto Claim"]
    }
  }
}

Using the Sync Cloud UI

While this guide focused on the programmatic API approach, Sync Cloud provides a rich web interface for managing ontologies and content:

Ontology Management UI

Visual ontology builder: Create and edit categories and metadata queries through an intuitive interface
Metadata query playground: Test and iterate on extraction instructions before applying them to production content
Excel import: Bulk upload ontologies from Excel spreadsheets with predefined categories and queries
Version history: Track changes to your ontology over time

Content Management UI

Document viewer: Preview uploaded documents and their extracted metadata side-by-side
Metadata editor: Manually correct or supplement AI-extracted metadata
Bulk operations: Update categories, trigger re-ingestion, or backfill metadata for multiple documents at once
Backfill wizard: After updating an ontology, automatically re-process existing documents to extract new metadata fields

Accessing the UI

Log in to cloud.syncdocs.ai
Navigate to Dataspaces → Select your dataspace
Click Ontology to manage categories and metadata queries
Click Content to view and manage uploaded documents

The UI is particularly useful for:

Initial ontology design: Experiment with metadata queries and test them on sample documents
Quality assurance: Review AI-extracted metadata and make corrections
Iterative improvement: Refine metadata query instructions based on extraction quality

What's Next?

Now that you can automatically extract structured metadata from documents, you can:

Build a Research Agent

Create an AI agent that can answer questions using both the original document content and the structured metadata you've extracted.

Explore Advanced Features

Ontologies: Learn about version control, subcategories, and progressive ontology expansion
Queries: Use extracted metadata to power sophisticated document queries and research workflows
Projects: Organize related documents into projects for better workflow management

What You'll Learn

Prerequisites

Understanding the Core Concepts

Ontologies

Content

Step 1: Create a Simple Ontology

Create the Ontology

Create Categories

Create Metadata Queries

Bind Metadata Queries to Categories

Step 2: Upload a Document

Step 3: Trigger Ingestion

Step 4: Retrieve Content with Extracted Metadata

Understanding the Response

Viewing Task Execution Details

What This Output Tells You

Step 5: Scaling to Hundreds of Documents

Batch Upload Strategy

Example Batch Processing Script

Best Practices for Scale

Querying Across All Content

Using the Sync Cloud UI

Ontology Management UI

Content Management UI

Accessing the UI

What's Next?

Build a Research Agent

Explore Advanced Features

Was this helpful?