Skip to content
Last updated

Automate Metadata Labeling (Document Ingestion Pipeline)

This guide shows you how to automatically extract and generate structured metadata from unstructured documents using Sync's ontology-driven metadata extraction. This is a simple way to enable document ingestion for a large variety of file types. By the end of this tutorial, you'll understand how to define custom metadata schemas, apply them to documents, and retrieve AI-extracted structured data at scale.

What You'll Learn

  • How to create an ontology that defines document categories and metadata extraction rules
  • How to upload documents to a dataspace and apply categories
  • How to trigger ingestion to make documents AI-ready and extract metadata
  • How to retrieve documents with their extracted metadata and attribution information
  • How to scale this workflow to hundreds or thousands of documents

Prerequisites

Before starting, make sure you have:

  • Completed the Account Setup Guide
  • An active workspace and dataspace
  • Your authentication token ready

Understanding the Core Concepts

Before we dive into the implementation, let's briefly understand the two key concepts:

Ontologies

An ontology defines the structure and organization of your content. It consists of:

  • Categories: Types of documents (e.g., "Invoice", "Contract", "Research Paper")
  • Metadata Queries: AI-powered extraction instructions that pull specific data from documents
  • Query Bindings: Rules connecting metadata queries to categories

Ontologies enable Sync to automatically classify documents and extract structured data using AI.

Content

Content represents a document along with all its derivatives, including:

  • The original file
  • Extracted text and embeddings
  • Metadata: Structured data extracted by AI or provided manually
  • Inference Task Executions: Attribution tracking for which AI tasks generated each metadata value

Step 1: Create a Simple Ontology

Let's create an ontology for managing insurance claim documents. We'll define two categories and three metadata queries.

Create the Ontology

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Insurance Claims Ontology",
  "description": "Ontology for classifying and extracting data from insurance claim documents"
}

Response:

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "name": "Insurance Claims Ontology",
  "description": "Ontology for classifying and extracting data from insurance claim documents",
  "createdAt": "2025-01-20T10:00:00Z"
}

Save the id - you'll need it for the next steps. Let's call it {ontologyId}.

Create Categories

Now create two document categories: "Auto Claim" and "Home Claim".

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Auto Claim",
  "description": "Insurance claims related to automobile accidents",
  "instructions": "This category includes all claim forms for vehicle damage, personal injury from auto accidents, and related documentation."
}

Response:

{
  "id": "cat-auto-001",
  "name": "Auto Claim",
  "description": "Insurance claims related to automobile accidents",
  "instructions": "This category includes all claim forms for vehicle damage...",
  "ontologyId": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "boundMetadataQueries": [],
  "createdAt": "2025-01-20T10:05:00Z"
}

Repeat for the "Home Claim" category:

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Home Claim",
  "description": "Insurance claims related to home damage or loss",
  "instructions": "This category includes property damage claims, theft, fire, water damage, and other home-related incidents."
}

Create Metadata Queries

Now let's define three AI-powered metadata extraction queries that will run on our documents:

1. Claim Number

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Claim Number",
  "description": "The unique identifier assigned to this claim",
  "dataType": "SHORT_STRING",
  "instructions": "Extract the claim number from the document. It's typically a sequence of letters and numbers like 'CLM-2024-12345' or similar format."
}

2. Incident Date

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Incident Date",
  "description": "The date when the incident occurred",
  "dataType": "DATE",
  "instructions": "Extract the date of the incident being claimed. Return in ISO 8601 format (YYYY-MM-DD). Look for phrases like 'date of loss', 'incident date', or 'date of accident'."
}

3. Estimated Loss Amount

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "name": "Estimated Loss Amount",
  "description": "The estimated financial loss or damage amount",
  "dataType": "NUMBER",
  "instructions": "Extract the total estimated loss or damage amount in dollars. Look for fields like 'estimated loss', 'total damage', or 'claim amount'. Return just the numeric value without currency symbols."
}

Save each metadata query's id - you'll need them for binding.

Bind Metadata Queries to Categories

Now we'll bind these metadata queries to our categories. Let's make "Claim Number" and "Incident Date" required for both categories:

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories/{categoryId}/metadata-queries/{metadataQueryId}
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "required": true,
  "uniqueIndexElement": false
}

Repeat this for each combination of category and metadata query. For example:

  • Bind "Claim Number" to "Auto Claim" (required)
  • Bind "Incident Date" to "Auto Claim" (required)
  • Bind "Estimated Loss Amount" to "Auto Claim" (not required)
  • Bind "Claim Number" to "Home Claim" (required)
  • Bind "Incident Date" to "Home Claim" (required)
  • Bind "Estimated Loss Amount" to "Home Claim" (not required)

Step 2: Upload a Document

Now that your ontology is configured, let's upload a claim document to your dataspace.

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}
Authorization: Bearer <your-token>
Content-Type: multipart/form-data

# Form fields:
# file: <binary data of your PDF/image/document>
# categoryId: "cat-auto-001"
# metadata: {}
# fileName: "claim_2024_12345.pdf"
# fileFormat: "application/pdf"

Response:

{
  "contentId": "123e4567-e89b-12d3-a456-426614174000",
  "dataspaceId": "sds-abc12345",
  "categoryId": "cat-auto-001",
  "fileName": "claim_2024_12345.pdf",
  "fileFormat": "application/pdf",
  "metadata": {},
  "inferenceTaskExecutions": {},
  "createdAt": "2025-01-20T11:00:00Z",
  "updatedAt": "2025-01-20T11:00:00Z"
}

At this point, the document is uploaded but not yet AI-ready. The metadata field is empty because we haven't run the extraction workflow yet.

Step 3: Trigger Ingestion

Ingestion is the process that makes your document AI-ready by:

  1. Extracting text from the file (supporting 40+ file formats)
  2. Generating embeddings and semantic indexes
  3. Running all bound metadata queries to extract structured data

First, you'll need a workflow ID. You can create a default ingestion workflow or use an existing one. For this guide, we'll assume you have a workflow ID ({workflowId}).

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}/ingest?workflowId={workflowId}
Authorization: Bearer <your-token>
Content-Type: application/json

{}

Response:

{
  "id": "789e0123-e89b-12d3-a456-426614174002",
  "workflowId": "456f1234-e89b-12d3-a456-426614174001",
  "primaryScopeObjectId": "123e4567-e89b-12d3-a456-426614174000",
  "primaryScopeDataspaceId": "sds-abc12345",
  "status": "started",
  "startedAt": "2025-01-20T11:05:00Z",
  "completedAt": null,
  "errorMessage": null
}

The ingestion process runs asynchronously. Depending on the document size and complexity, it may take from a few seconds to a few minutes.

Step 4: Retrieve Content with Extracted Metadata

Once ingestion is complete, you can retrieve the content with its extracted metadata:

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}
Authorization: Bearer <your-token>

Response:

{
  "contentId": "123e4567-e89b-12d3-a456-426614174000",
  "dataspaceId": "sds-abc12345",
  "categoryId": "cat-auto-001",
  "fileName": "claim_2024_12345.pdf",
  "fileFormat": "application/pdf",
  "metadata": {
    "Claim Number": "CLM-2024-12345",
    "Incident Date": "2024-01-15",
    "Estimated Loss Amount": 5250.00
  },
  "inferenceTaskExecutions": {
    "Claim Number": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
    "Incident Date": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
    "Estimated Loss Amount": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6"
  },
  "createdAt": "2025-01-20T11:00:00Z",
  "updatedAt": "2025-01-20T11:10:00Z"
}

Understanding the Response

  • metadata: Contains the structured data extracted by AI. Each key corresponds to a metadata query name.
  • inferenceTaskExecutions: Maps each metadata field to the task execution ID that generated it. This provides full attribution and traceability. All three fields share the same task execution ID because they were all extracted in a single metadata extraction task.

Viewing Task Execution Details

To see exactly how the AI extracted each value, including confidence scores and justifications, you can fetch the detailed task execution output.

Using the task execution ID from the inferenceTaskExecutions object (in this case a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6):

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/task-execution-outputs/get-metadata-value/{dataspaceId}/a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6
Authorization: Bearer <your-token>

Response:

{
  "taskNameSlug": "get-metadata-value",
  "taskExecutionId": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
  "output": {
    "message": "Successfully extracted metadata for 3 keys",
    "results": {
      "Claim Number": {
        "value": "CLM-2024-12345",
        "confidence": "high",
        "justification": "Found claim number in document header at the top right corner labeled as 'Claim #'",
        "potentialAlternatives": null,
        "selectionRationale": null
      },
      "Incident Date": {
        "value": "2024-01-15",
        "confidence": "high",
        "justification": "Date of loss explicitly stated in Section 2 of the form under 'Date of Incident'",
        "potentialAlternatives": null,
        "selectionRationale": null
      },
      "Estimated Loss Amount": {
        "value": 5250.00,
        "confidence": "medium",
        "justification": "Estimated repair cost from initial assessment in Section 4. Final amount may differ pending adjuster review.",
        "potentialAlternatives": [4800, 5500],
        "selectionRationale": "Selected the mid-range estimate as the most conservative value between low estimate ($4,800) and high estimate ($5,500)"
      }
    }
  }
}

What This Output Tells You

For each extracted metadata field, the task execution output provides:

  • value: The extracted data that was stored in the content's metadata
  • confidence: AI's confidence level ("high", "medium", or "low")
  • justification: Detailed explanation of where and why the AI extracted this value
  • potentialAlternatives: Other values the AI considered (if applicable)
  • selectionRationale: Why the AI chose this value over the alternatives

This level of detail is invaluable for:

  • Quality assurance: Review extraction accuracy and identify patterns in errors
  • Compliance: Provide audit trails showing how metadata was derived
  • Model improvement: Use justifications to refine metadata query instructions
  • Debugging: Understand why certain extractions failed or had low confidence

Step 5: Scaling to Hundreds of Documents

Now that you've processed a single document, here's how to scale this workflow to handle hundreds or thousands of documents:

Batch Upload Strategy

  1. Upload documents in parallel: Use multiple concurrent API calls to upload documents faster
  2. Set categories at upload time: If you know the category beforehand, include it in the upload request
  3. Queue ingestion requests: After uploading, trigger ingestion for all documents

Example Batch Processing Script

# Upload all documents
content_ids = []
for document_file in document_files:
    response = upload_content(
        dataspace_id=dataspace_id,
        file=document_file,
        category_id=category_id,  # Pre-classify if possible
        metadata={}  # Let AI extract everything
    )
    content_ids.append(response['contentId'])

# Trigger ingestion for all documents
for content_id in content_ids:
    trigger_ingestion(
        dataspace_id=dataspace_id,
        content_id=content_id,
        workflow_id=workflow_id
    )

# Poll for completion
while not all_complete(content_ids):
    time.sleep(10)
    check_status(content_ids)

# Retrieve all content with metadata
results = []
for content_id in content_ids:
    content = get_content(dataspace_id, content_id)
    results.append(content)

# Export to CSV or database
export_metadata_to_csv(results)

Best Practices for Scale

  • Parallel uploads: Upload 10-20 documents concurrently for maximum throughput
  • Batch size: Process documents in batches of 100-500 to monitor progress
  • Error handling: Implement retry logic for failed uploads or ingestion tasks
  • Status polling: Use the workflow execution endpoint to check batch progress
  • Rate limiting: Respect API rate limits (typically 100 requests/minute per workspace)

Querying Across All Content

Once documents are processed, you can query across all of them using SQL-like filters:

GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}?category=Auto%20Claim&pageSize=100
Authorization: Bearer <your-token>

Or use Sync's semantic search and query capabilities to ask natural language questions across your entire corpus:

POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/query
Authorization: Bearer <your-token>
Content-Type: application/json

{
  "query": "Find all auto claims from January 2024 with estimated losses over $5000",
  "context": {
    "contentFilters": {
      "categories": ["Auto Claim"]
    }
  }
}

Using the Sync Cloud UI

While this guide focused on the programmatic API approach, Sync Cloud provides a rich web interface for managing ontologies and content:

Ontology Management UI

  • Visual ontology builder: Create and edit categories and metadata queries through an intuitive interface
  • Metadata query playground: Test and iterate on extraction instructions before applying them to production content
  • Excel import: Bulk upload ontologies from Excel spreadsheets with predefined categories and queries
  • Version history: Track changes to your ontology over time

Content Management UI

  • Document viewer: Preview uploaded documents and their extracted metadata side-by-side
  • Metadata editor: Manually correct or supplement AI-extracted metadata
  • Bulk operations: Update categories, trigger re-ingestion, or backfill metadata for multiple documents at once
  • Backfill wizard: After updating an ontology, automatically re-process existing documents to extract new metadata fields

Accessing the UI

  1. Log in to cloud.syncdocs.ai
  2. Navigate to Dataspaces → Select your dataspace
  3. Click Ontology to manage categories and metadata queries
  4. Click Content to view and manage uploaded documents

The UI is particularly useful for:

  • Initial ontology design: Experiment with metadata queries and test them on sample documents
  • Quality assurance: Review AI-extracted metadata and make corrections
  • Iterative improvement: Refine metadata query instructions based on extraction quality

What's Next?

Now that you can automatically extract structured metadata from documents, you can:

Create an AI agent that can answer questions using both the original document content and the structured metadata you've extracted.

Explore Advanced Features

  • Ontologies: Learn about version control, subcategories, and progressive ontology expansion
  • Queries: Use extracted metadata to power sophisticated document queries and research workflows
  • Projects: Organize related documents into projects for better workflow management