This guide shows you how to automatically extract and generate structured metadata from unstructured documents using Sync's ontology-driven metadata extraction. This is a simple way to enable document ingestion for a large variety of file types. By the end of this tutorial, you'll understand how to define custom metadata schemas, apply them to documents, and retrieve AI-extracted structured data at scale.
- How to create an ontology that defines document categories and metadata extraction rules
- How to upload documents to a dataspace and apply categories
- How to trigger ingestion to make documents AI-ready and extract metadata
- How to retrieve documents with their extracted metadata and attribution information
- How to scale this workflow to hundreds or thousands of documents
Before starting, make sure you have:
- Completed the Account Setup Guide
- An active workspace and dataspace
- Your authentication token ready
Before we dive into the implementation, let's briefly understand the two key concepts:
An ontology defines the structure and organization of your content. It consists of:
- Categories: Types of documents (e.g., "Invoice", "Contract", "Research Paper")
- Metadata Queries: AI-powered extraction instructions that pull specific data from documents
- Query Bindings: Rules connecting metadata queries to categories
Ontologies enable Sync to automatically classify documents and extract structured data using AI.
Content represents a document along with all its derivatives, including:
- The original file
- Extracted text and embeddings
- Metadata: Structured data extracted by AI or provided manually
- Inference Task Executions: Attribution tracking for which AI tasks generated each metadata value
Let's create an ontology for managing insurance claim documents. We'll define two categories and three metadata queries.
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Insurance Claims Ontology",
"description": "Ontology for classifying and extracting data from insurance claim documents"
}Response:
{
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"name": "Insurance Claims Ontology",
"description": "Ontology for classifying and extracting data from insurance claim documents",
"createdAt": "2025-01-20T10:00:00Z"
}Save the id - you'll need it for the next steps. Let's call it {ontologyId}.
Now create two document categories: "Auto Claim" and "Home Claim".
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Auto Claim",
"description": "Insurance claims related to automobile accidents",
"instructions": "This category includes all claim forms for vehicle damage, personal injury from auto accidents, and related documentation."
}Response:
{
"id": "cat-auto-001",
"name": "Auto Claim",
"description": "Insurance claims related to automobile accidents",
"instructions": "This category includes all claim forms for vehicle damage...",
"ontologyId": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"boundMetadataQueries": [],
"createdAt": "2025-01-20T10:05:00Z"
}Repeat for the "Home Claim" category:
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Home Claim",
"description": "Insurance claims related to home damage or loss",
"instructions": "This category includes property damage claims, theft, fire, water damage, and other home-related incidents."
}Now let's define three AI-powered metadata extraction queries that will run on our documents:
1. Claim Number
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Claim Number",
"description": "The unique identifier assigned to this claim",
"dataType": "SHORT_STRING",
"instructions": "Extract the claim number from the document. It's typically a sequence of letters and numbers like 'CLM-2024-12345' or similar format."
}2. Incident Date
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Incident Date",
"description": "The date when the incident occurred",
"dataType": "DATE",
"instructions": "Extract the date of the incident being claimed. Return in ISO 8601 format (YYYY-MM-DD). Look for phrases like 'date of loss', 'incident date', or 'date of accident'."
}3. Estimated Loss Amount
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries
Authorization: Bearer <your-token>
Content-Type: application/json
{
"name": "Estimated Loss Amount",
"description": "The estimated financial loss or damage amount",
"dataType": "NUMBER",
"instructions": "Extract the total estimated loss or damage amount in dollars. Look for fields like 'estimated loss', 'total damage', or 'claim amount'. Return just the numeric value without currency symbols."
}Save each metadata query's id - you'll need them for binding.
Now we'll bind these metadata queries to our categories. Let's make "Claim Number" and "Incident Date" required for both categories:
POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories/{categoryId}/metadata-queries/{metadataQueryId}
Authorization: Bearer <your-token>
Content-Type: application/json
{
"required": true,
"uniqueIndexElement": false
}Repeat this for each combination of category and metadata query. For example:
- Bind "Claim Number" to "Auto Claim" (required)
- Bind "Incident Date" to "Auto Claim" (required)
- Bind "Estimated Loss Amount" to "Auto Claim" (not required)
- Bind "Claim Number" to "Home Claim" (required)
- Bind "Incident Date" to "Home Claim" (required)
- Bind "Estimated Loss Amount" to "Home Claim" (not required)
Now that your ontology is configured, let's upload a claim document to your dataspace.
POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}
Authorization: Bearer <your-token>
Content-Type: multipart/form-data
# Form fields:
# file: <binary data of your PDF/image/document>
# categoryId: "cat-auto-001"
# metadata: {}
# fileName: "claim_2024_12345.pdf"
# fileFormat: "application/pdf"Response:
{
"contentId": "123e4567-e89b-12d3-a456-426614174000",
"dataspaceId": "sds-abc12345",
"categoryId": "cat-auto-001",
"fileName": "claim_2024_12345.pdf",
"fileFormat": "application/pdf",
"metadata": {},
"inferenceTaskExecutions": {},
"createdAt": "2025-01-20T11:00:00Z",
"updatedAt": "2025-01-20T11:00:00Z"
}At this point, the document is uploaded but not yet AI-ready. The metadata field is empty because we haven't run the extraction workflow yet.
Ingestion is the process that makes your document AI-ready by:
- Extracting text from the file (supporting 40+ file formats)
- Generating embeddings and semantic indexes
- Running all bound metadata queries to extract structured data
First, you'll need a workflow ID. You can create a default ingestion workflow or use an existing one. For this guide, we'll assume you have a workflow ID ({workflowId}).
POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}/ingest?workflowId={workflowId}
Authorization: Bearer <your-token>
Content-Type: application/json
{}Response:
{
"id": "789e0123-e89b-12d3-a456-426614174002",
"workflowId": "456f1234-e89b-12d3-a456-426614174001",
"primaryScopeObjectId": "123e4567-e89b-12d3-a456-426614174000",
"primaryScopeDataspaceId": "sds-abc12345",
"status": "started",
"startedAt": "2025-01-20T11:05:00Z",
"completedAt": null,
"errorMessage": null
}The ingestion process runs asynchronously. Depending on the document size and complexity, it may take from a few seconds to a few minutes.
Once ingestion is complete, you can retrieve the content with its extracted metadata:
GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}
Authorization: Bearer <your-token>Response:
{
"contentId": "123e4567-e89b-12d3-a456-426614174000",
"dataspaceId": "sds-abc12345",
"categoryId": "cat-auto-001",
"fileName": "claim_2024_12345.pdf",
"fileFormat": "application/pdf",
"metadata": {
"Claim Number": "CLM-2024-12345",
"Incident Date": "2024-01-15",
"Estimated Loss Amount": 5250.00
},
"inferenceTaskExecutions": {
"Claim Number": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
"Incident Date": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
"Estimated Loss Amount": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6"
},
"createdAt": "2025-01-20T11:00:00Z",
"updatedAt": "2025-01-20T11:10:00Z"
}metadata: Contains the structured data extracted by AI. Each key corresponds to a metadata query name.inferenceTaskExecutions: Maps each metadata field to the task execution ID that generated it. This provides full attribution and traceability. All three fields share the same task execution ID because they were all extracted in a single metadata extraction task.
To see exactly how the AI extracted each value, including confidence scores and justifications, you can fetch the detailed task execution output.
Using the task execution ID from the inferenceTaskExecutions object (in this case a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6):
GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/task-execution-outputs/get-metadata-value/{dataspaceId}/a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6
Authorization: Bearer <your-token>Response:
{
"taskNameSlug": "get-metadata-value",
"taskExecutionId": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6",
"output": {
"message": "Successfully extracted metadata for 3 keys",
"results": {
"Claim Number": {
"value": "CLM-2024-12345",
"confidence": "high",
"justification": "Found claim number in document header at the top right corner labeled as 'Claim #'",
"potentialAlternatives": null,
"selectionRationale": null
},
"Incident Date": {
"value": "2024-01-15",
"confidence": "high",
"justification": "Date of loss explicitly stated in Section 2 of the form under 'Date of Incident'",
"potentialAlternatives": null,
"selectionRationale": null
},
"Estimated Loss Amount": {
"value": 5250.00,
"confidence": "medium",
"justification": "Estimated repair cost from initial assessment in Section 4. Final amount may differ pending adjuster review.",
"potentialAlternatives": [4800, 5500],
"selectionRationale": "Selected the mid-range estimate as the most conservative value between low estimate ($4,800) and high estimate ($5,500)"
}
}
}
}For each extracted metadata field, the task execution output provides:
value: The extracted data that was stored in the content's metadataconfidence: AI's confidence level ("high","medium", or"low")justification: Detailed explanation of where and why the AI extracted this valuepotentialAlternatives: Other values the AI considered (if applicable)selectionRationale: Why the AI chose this value over the alternatives
This level of detail is invaluable for:
- Quality assurance: Review extraction accuracy and identify patterns in errors
- Compliance: Provide audit trails showing how metadata was derived
- Model improvement: Use justifications to refine metadata query instructions
- Debugging: Understand why certain extractions failed or had low confidence
Now that you've processed a single document, here's how to scale this workflow to handle hundreds or thousands of documents:
- Upload documents in parallel: Use multiple concurrent API calls to upload documents faster
- Set categories at upload time: If you know the category beforehand, include it in the upload request
- Queue ingestion requests: After uploading, trigger ingestion for all documents
# Upload all documents
content_ids = []
for document_file in document_files:
response = upload_content(
dataspace_id=dataspace_id,
file=document_file,
category_id=category_id, # Pre-classify if possible
metadata={} # Let AI extract everything
)
content_ids.append(response['contentId'])
# Trigger ingestion for all documents
for content_id in content_ids:
trigger_ingestion(
dataspace_id=dataspace_id,
content_id=content_id,
workflow_id=workflow_id
)
# Poll for completion
while not all_complete(content_ids):
time.sleep(10)
check_status(content_ids)
# Retrieve all content with metadata
results = []
for content_id in content_ids:
content = get_content(dataspace_id, content_id)
results.append(content)
# Export to CSV or database
export_metadata_to_csv(results)- Parallel uploads: Upload 10-20 documents concurrently for maximum throughput
- Batch size: Process documents in batches of 100-500 to monitor progress
- Error handling: Implement retry logic for failed uploads or ingestion tasks
- Status polling: Use the workflow execution endpoint to check batch progress
- Rate limiting: Respect API rate limits (typically 100 requests/minute per workspace)
Once documents are processed, you can query across all of them using SQL-like filters:
GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}?category=Auto%20Claim&pageSize=100
Authorization: Bearer <your-token>Or use Sync's semantic search and query capabilities to ask natural language questions across your entire corpus:
POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/query
Authorization: Bearer <your-token>
Content-Type: application/json
{
"query": "Find all auto claims from January 2024 with estimated losses over $5000",
"context": {
"contentFilters": {
"categories": ["Auto Claim"]
}
}
}While this guide focused on the programmatic API approach, Sync Cloud provides a rich web interface for managing ontologies and content:
- Visual ontology builder: Create and edit categories and metadata queries through an intuitive interface
- Metadata query playground: Test and iterate on extraction instructions before applying them to production content
- Excel import: Bulk upload ontologies from Excel spreadsheets with predefined categories and queries
- Version history: Track changes to your ontology over time
- Document viewer: Preview uploaded documents and their extracted metadata side-by-side
- Metadata editor: Manually correct or supplement AI-extracted metadata
- Bulk operations: Update categories, trigger re-ingestion, or backfill metadata for multiple documents at once
- Backfill wizard: After updating an ontology, automatically re-process existing documents to extract new metadata fields
- Log in to cloud.syncdocs.ai
- Navigate to Dataspaces → Select your dataspace
- Click Ontology to manage categories and metadata queries
- Click Content to view and manage uploaded documents
The UI is particularly useful for:
- Initial ontology design: Experiment with metadata queries and test them on sample documents
- Quality assurance: Review AI-extracted metadata and make corrections
- Iterative improvement: Refine metadata query instructions based on extraction quality
Now that you can automatically extract structured metadata from documents, you can:
Create an AI agent that can answer questions using both the original document content and the structured metadata you've extracted.
- Ontologies: Learn about version control, subcategories, and progressive ontology expansion
- Queries: Use extracted metadata to power sophisticated document queries and research workflows
- Projects: Organize related documents into projects for better workflow management