# Automate Metadata Labeling (Document Ingestion Pipeline) This guide shows you how to automatically extract and generate structured metadata from unstructured documents using Sync's ontology-driven metadata extraction. This is a simple way to enable document ingestion for a large variety of file types. By the end of this tutorial, you'll understand how to define custom metadata schemas, apply them to documents, and retrieve AI-extracted structured data at scale. ## What You'll Learn - How to create an **ontology** that defines document categories and metadata extraction rules - How to upload documents to a **dataspace** and apply categories - How to trigger **ingestion** to make documents AI-ready and extract metadata - How to retrieve documents with their extracted metadata and attribution information - How to scale this workflow to hundreds or thousands of documents ## Prerequisites Before starting, make sure you have: - Completed the [Account Setup Guide](/guides/account-setup) - An active workspace and dataspace - Your authentication token ready ## Understanding the Core Concepts Before we dive into the implementation, let's briefly understand the two key concepts: ### Ontologies An **[ontology](/concepts/ontologies)** defines the structure and organization of your content. It consists of: - **Categories**: Types of documents (e.g., "Invoice", "Contract", "Research Paper") - **Metadata Queries**: AI-powered extraction instructions that pull specific data from documents - **Query Bindings**: Rules connecting metadata queries to categories Ontologies enable Sync to automatically classify documents and extract structured data using AI. ### Content **[Content](/concepts/content)** represents a document along with all its derivatives, including: - The original file - Extracted text and embeddings - **Metadata**: Structured data extracted by AI or provided manually - **Inference Task Executions**: Attribution tracking for which AI tasks generated each metadata value ## Step 1: Create a Simple Ontology Let's create an ontology for managing insurance claim documents. We'll define two categories and three metadata queries. ### Create the Ontology ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies Authorization: Bearer Content-Type: application/json { "name": "Insurance Claims Ontology", "description": "Ontology for classifying and extracting data from insurance claim documents" } ``` **Response:** ```json { "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "name": "Insurance Claims Ontology", "description": "Ontology for classifying and extracting data from insurance claim documents", "createdAt": "2025-01-20T10:00:00Z" } ``` Save the `id` - you'll need it for the next steps. Let's call it `{ontologyId}`. ### Create Categories Now create two document categories: "Auto Claim" and "Home Claim". ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories Authorization: Bearer Content-Type: application/json { "name": "Auto Claim", "description": "Insurance claims related to automobile accidents", "instructions": "This category includes all claim forms for vehicle damage, personal injury from auto accidents, and related documentation." } ``` **Response:** ```json { "id": "cat-auto-001", "name": "Auto Claim", "description": "Insurance claims related to automobile accidents", "instructions": "This category includes all claim forms for vehicle damage...", "ontologyId": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "boundMetadataQueries": [], "createdAt": "2025-01-20T10:05:00Z" } ``` Repeat for the "Home Claim" category: ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories Authorization: Bearer Content-Type: application/json { "name": "Home Claim", "description": "Insurance claims related to home damage or loss", "instructions": "This category includes property damage claims, theft, fire, water damage, and other home-related incidents." } ``` ### Create Metadata Queries Now let's define three AI-powered metadata extraction queries that will run on our documents: **1. Claim Number** ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries Authorization: Bearer Content-Type: application/json { "name": "Claim Number", "description": "The unique identifier assigned to this claim", "dataType": "SHORT_STRING", "instructions": "Extract the claim number from the document. It's typically a sequence of letters and numbers like 'CLM-2024-12345' or similar format." } ``` **2. Incident Date** ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries Authorization: Bearer Content-Type: application/json { "name": "Incident Date", "description": "The date when the incident occurred", "dataType": "DATE", "instructions": "Extract the date of the incident being claimed. Return in ISO 8601 format (YYYY-MM-DD). Look for phrases like 'date of loss', 'incident date', or 'date of accident'." } ``` **3. Estimated Loss Amount** ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/metadata-queries Authorization: Bearer Content-Type: application/json { "name": "Estimated Loss Amount", "description": "The estimated financial loss or damage amount", "dataType": "NUMBER", "instructions": "Extract the total estimated loss or damage amount in dollars. Look for fields like 'estimated loss', 'total damage', or 'claim amount'. Return just the numeric value without currency symbols." } ``` Save each metadata query's `id` - you'll need them for binding. ### Bind Metadata Queries to Categories Now we'll bind these metadata queries to our categories. Let's make "Claim Number" and "Incident Date" required for both categories: ```bash POST https://cloud.syncdocs.ai/api/accounts/{accountId}/ontologies/{ontologyId}/categories/{categoryId}/metadata-queries/{metadataQueryId} Authorization: Bearer Content-Type: application/json { "required": true, "uniqueIndexElement": false } ``` Repeat this for each combination of category and metadata query. For example: - Bind "Claim Number" to "Auto Claim" (required) - Bind "Incident Date" to "Auto Claim" (required) - Bind "Estimated Loss Amount" to "Auto Claim" (not required) - Bind "Claim Number" to "Home Claim" (required) - Bind "Incident Date" to "Home Claim" (required) - Bind "Estimated Loss Amount" to "Home Claim" (not required) ## Step 2: Upload a Document Now that your ontology is configured, let's upload a claim document to your dataspace. ```bash POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId} Authorization: Bearer Content-Type: multipart/form-data # Form fields: # file: # categoryId: "cat-auto-001" # metadata: {} # fileName: "claim_2024_12345.pdf" # fileFormat: "application/pdf" ``` **Response:** ```json { "contentId": "123e4567-e89b-12d3-a456-426614174000", "dataspaceId": "sds-abc12345", "categoryId": "cat-auto-001", "fileName": "claim_2024_12345.pdf", "fileFormat": "application/pdf", "metadata": {}, "inferenceTaskExecutions": {}, "createdAt": "2025-01-20T11:00:00Z", "updatedAt": "2025-01-20T11:00:00Z" } ``` At this point, the document is **uploaded** but not yet **AI-ready**. The `metadata` field is empty because we haven't run the extraction workflow yet. ## Step 3: Trigger Ingestion Ingestion is the process that makes your document AI-ready by: 1. Extracting text from the file (supporting 40+ file formats) 2. Generating embeddings and semantic indexes 3. **Running all bound metadata queries** to extract structured data First, you'll need a workflow ID. You can create a default ingestion workflow or use an existing one. For this guide, we'll assume you have a workflow ID (`{workflowId}`). ```bash POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId}/ingest?workflowId={workflowId} Authorization: Bearer Content-Type: application/json {} ``` **Response:** ```json { "id": "789e0123-e89b-12d3-a456-426614174002", "workflowId": "456f1234-e89b-12d3-a456-426614174001", "primaryScopeObjectId": "123e4567-e89b-12d3-a456-426614174000", "primaryScopeDataspaceId": "sds-abc12345", "status": "started", "startedAt": "2025-01-20T11:05:00Z", "completedAt": null, "errorMessage": null } ``` The ingestion process runs asynchronously. Depending on the document size and complexity, it may take from a few seconds to a few minutes. ## Step 4: Retrieve Content with Extracted Metadata Once ingestion is complete, you can retrieve the content with its extracted metadata: ```bash GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/{contentId} Authorization: Bearer ``` **Response:** ```json { "contentId": "123e4567-e89b-12d3-a456-426614174000", "dataspaceId": "sds-abc12345", "categoryId": "cat-auto-001", "fileName": "claim_2024_12345.pdf", "fileFormat": "application/pdf", "metadata": { "Claim Number": "CLM-2024-12345", "Incident Date": "2024-01-15", "Estimated Loss Amount": 5250.00 }, "inferenceTaskExecutions": { "Claim Number": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6", "Incident Date": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6", "Estimated Loss Amount": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6" }, "createdAt": "2025-01-20T11:00:00Z", "updatedAt": "2025-01-20T11:10:00Z" } ``` ### Understanding the Response - **`metadata`**: Contains the structured data extracted by AI. Each key corresponds to a metadata query name. - **`inferenceTaskExecutions`**: Maps each metadata field to the task execution ID that generated it. This provides full attribution and traceability. All three fields share the same task execution ID because they were all extracted in a single metadata extraction task. ### Viewing Task Execution Details To see exactly how the AI extracted each value, including confidence scores and justifications, you can fetch the detailed task execution output. Using the task execution ID from the `inferenceTaskExecutions` object (in this case `a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6`): ```bash GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/task-execution-outputs/get-metadata-value/{dataspaceId}/a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6 Authorization: Bearer ``` **Response:** ```json { "taskNameSlug": "get-metadata-value", "taskExecutionId": "a1b2c3d4-e5f6-7g8h-9i0j-k1l2m3n4o5p6", "output": { "message": "Successfully extracted metadata for 3 keys", "results": { "Claim Number": { "value": "CLM-2024-12345", "confidence": "high", "justification": "Found claim number in document header at the top right corner labeled as 'Claim #'", "potentialAlternatives": null, "selectionRationale": null }, "Incident Date": { "value": "2024-01-15", "confidence": "high", "justification": "Date of loss explicitly stated in Section 2 of the form under 'Date of Incident'", "potentialAlternatives": null, "selectionRationale": null }, "Estimated Loss Amount": { "value": 5250.00, "confidence": "medium", "justification": "Estimated repair cost from initial assessment in Section 4. Final amount may differ pending adjuster review.", "potentialAlternatives": [4800, 5500], "selectionRationale": "Selected the mid-range estimate as the most conservative value between low estimate ($4,800) and high estimate ($5,500)" } } } } ``` ### What This Output Tells You For each extracted metadata field, the task execution output provides: - **`value`**: The extracted data that was stored in the content's metadata - **`confidence`**: AI's confidence level (`"high"`, `"medium"`, or `"low"`) - **`justification`**: Detailed explanation of where and why the AI extracted this value - **`potentialAlternatives`**: Other values the AI considered (if applicable) - **`selectionRationale`**: Why the AI chose this value over the alternatives This level of detail is invaluable for: - **Quality assurance**: Review extraction accuracy and identify patterns in errors - **Compliance**: Provide audit trails showing how metadata was derived - **Model improvement**: Use justifications to refine metadata query instructions - **Debugging**: Understand why certain extractions failed or had low confidence ## Step 5: Scaling to Hundreds of Documents Now that you've processed a single document, here's how to scale this workflow to handle hundreds or thousands of documents: ### Batch Upload Strategy 1. **Upload documents in parallel**: Use multiple concurrent API calls to upload documents faster 2. **Set categories at upload time**: If you know the category beforehand, include it in the upload request 3. **Queue ingestion requests**: After uploading, trigger ingestion for all documents ### Example Batch Processing Script ```python # Upload all documents content_ids = [] for document_file in document_files: response = upload_content( dataspace_id=dataspace_id, file=document_file, category_id=category_id, # Pre-classify if possible metadata={} # Let AI extract everything ) content_ids.append(response['contentId']) # Trigger ingestion for all documents for content_id in content_ids: trigger_ingestion( dataspace_id=dataspace_id, content_id=content_id, workflow_id=workflow_id ) # Poll for completion while not all_complete(content_ids): time.sleep(10) check_status(content_ids) # Retrieve all content with metadata results = [] for content_id in content_ids: content = get_content(dataspace_id, content_id) results.append(content) # Export to CSV or database export_metadata_to_csv(results) ``` ### Best Practices for Scale - **Parallel uploads**: Upload 10-20 documents concurrently for maximum throughput - **Batch size**: Process documents in batches of 100-500 to monitor progress - **Error handling**: Implement retry logic for failed uploads or ingestion tasks - **Status polling**: Use the workflow execution endpoint to check batch progress - **Rate limiting**: Respect API rate limits (typically 100 requests/minute per workspace) ### Querying Across All Content Once documents are processed, you can query across all of them using SQL-like filters: ```bash GET https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}?category=Auto%20Claim&pageSize=100 Authorization: Bearer ``` Or use Sync's semantic search and query capabilities to ask natural language questions across your entire corpus: ```bash POST https://sws-{workspaceId}.cloud.syncdocs.ai/api/content/{dataspaceId}/query Authorization: Bearer Content-Type: application/json { "query": "Find all auto claims from January 2024 with estimated losses over $5000", "context": { "contentFilters": { "categories": ["Auto Claim"] } } } ``` ## Using the Sync Cloud UI While this guide focused on the programmatic API approach, Sync Cloud provides a rich web interface for managing ontologies and content: ### Ontology Management UI - **Visual ontology builder**: Create and edit categories and metadata queries through an intuitive interface - **Metadata query playground**: Test and iterate on extraction instructions before applying them to production content - **Excel import**: Bulk upload ontologies from Excel spreadsheets with predefined categories and queries - **Version history**: Track changes to your ontology over time ### Content Management UI - **Document viewer**: Preview uploaded documents and their extracted metadata side-by-side - **Metadata editor**: Manually correct or supplement AI-extracted metadata - **Bulk operations**: Update categories, trigger re-ingestion, or backfill metadata for multiple documents at once - **Backfill wizard**: After updating an ontology, automatically re-process existing documents to extract new metadata fields ### Accessing the UI 1. Log in to [cloud.syncdocs.ai](https://cloud.syncdocs.ai) 2. Navigate to **Dataspaces** → Select your dataspace 3. Click **Ontology** to manage categories and metadata queries 4. Click **Content** to view and manage uploaded documents The UI is particularly useful for: - **Initial ontology design**: Experiment with metadata queries and test them on sample documents - **Quality assurance**: Review AI-extracted metadata and make corrections - **Iterative improvement**: Refine metadata query instructions based on extraction quality ## What's Next? Now that you can automatically extract structured metadata from documents, you can: ### [Build a Research Agent](/guides/build-research-agent) Create an AI agent that can answer questions using both the original document content and the structured metadata you've extracted. ### Explore Advanced Features - **[Ontologies](/concepts/ontologies)**: Learn about version control, subcategories, and progressive ontology expansion - **[Queries](/concepts/queries)**: Use extracted metadata to power sophisticated document queries and research workflows - **[Projects](/concepts/projects)**: Organize related documents into projects for better workflow management