Skip to content
Last updated

Dataspaces

A Dataspace is Sync's primary data organization unit—think of it as a structured repository that combines unstructured content (documents, images, CAD files, videos) with their AI-generated structured derivatives (embeddings, extracted metadata, text, summaries).

Dataspaces provide complete isolation between different business use cases, similar to how separate databases keep different application data segregated. Each dataspace has its own storage, metadata schema, and access controls.

What is a Dataspace?

At a high level, a dataspace is analogous to a table in a data lakehouse, but specifically designed for unstructured content. Each row represents a piece of content (a file), and the columns represent:

  • Original file (stored as a blob)
  • Extracted text (from PDFs, images via OCR, audio transcripts, etc.)
  • Vector embeddings (for semantic search)
  • Structured metadata (both user-provided and AI-extracted)
  • Derivatives (thumbnails, previews, page-level extractions)

This design makes it possible to store, search, and analyze unstructured content as if it were structured data—enabling SQL queries, BI tools, and traditional analytics workflows alongside AI-powered semantic search.

Key Characteristics

Isolated Storage: Each dataspace maintains complete separation from other dataspaces. Content, metadata, and embeddings in one dataspace are completely isolated from another, allowing you to:

  • Segregate data by department, project, or customer
  • Apply different security policies to different dataspaces
  • Scale storage independently for different use cases

Schema Flexibility: Dataspaces support arbitrary metadata—you can attach any JSON structure to your content without defining a schema upfront. Optionally, you can define an ontology to:

  • Enforce validation rules on uploaded metadata
  • Automatically extract structured data from documents during ingestion
  • Define categories and taxonomies for your content

Multi-Format Support: Dataspaces handle over 40 different file types, including:

  • Documents (PDF, Word, Excel, PowerPoint)
  • Images (JPEG, PNG, TIFF with OCR)
  • CAD files (AutoCAD, SolidWorks)
  • Media (videos with transcripts, audio with speech-to-text)
  • Archives (ZIP, TAR with automatic extraction)

Direct Access: Data in a dataspace can be accessed through:

  • REST API: Full-featured programmatic access
  • SQL: Direct database connections for BI tools and analytics
  • Cloud Provider APIs: Native blob storage access (e.g., S3 API)

Creating a Dataspace

Dataspaces are created via the Admin API and associated with an account and an ontology.

Example: Create a Dataspace

POST https://cloud.syncdocs.ai/api/accounts/{accountId}/dataspaces
Authorization: Bearer <token>
Content-Type: application/json

{
  "name": "Legal Contracts",
  "description": "Repository for all legal contracts and agreements",
  "ontologyId": "ont-3fa85f64-5717-4562-b3fc-2c963f66afa6"
}

Response:

{
  "id": "sds-12345678-abcd-1234-efgh-123456789012",
  "accountId": "acc-98765432-dcba-4321-hgfe-987654321098",
  "name": "Legal Contracts",
  "description": "Repository for all legal contracts and agreements",
  "ontologyId": "ont-3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "createdAt": "2024-10-28T12:00:00Z",
  "updatedAt": "2024-10-28T12:00:00Z"
}

Parameters:

  • name (required): Human-readable name for the dataspace
  • description (required): Description of the dataspace's purpose
  • ontologyId (required): UUID of the ontology that defines how content is organized and validated in this dataspace

Example: List All Dataspaces

GET https://cloud.syncdocs.ai/api/accounts/{accountId}/dataspaces
Authorization: Bearer <token>

Response:

[
  {
    "id": "sds-12345678...",
    "name": "Legal Contracts",
    "description": "Repository for all legal contracts and agreements",
    "ontologyId": "ont-3fa85f64...",
    "createdAt": "2024-10-28T12:00:00Z"
  },
  {
    "id": "sds-87654321...",
    "name": "Marketing Assets",
    "description": "All marketing collateral and brand assets",
    "ontologyId": "ont-abc12345...",
    "createdAt": "2024-10-25T09:15:00Z"
  }
]

Using a Dataspace

Once created, you interact with a dataspace primarily through a workspace. Workspaces provide the compute layer that enables you to:

  • Upload content to the dataspace
  • Trigger AI-powered ingestion and extraction
  • Query content with natural language
  • Execute workflows and batch operations
  • Access structured metadata via API or SQL

See the Workspaces documentation for details on how workspaces access dataspace content.

Dataspace Architecture

Under the hood, each dataspace consists of:

Blob Storage:

  • Raw uploaded files
  • Extracted text and derivatives
  • Generated thumbnails and previews
  • Accessible via cloud provider's native APIs

Relational Tables:

  • Content metadata (file names, types, upload dates)
  • User-provided metadata (arbitrary JSON)
  • AI-extracted metadata (from ontology-driven queries)
  • Supports direct SQL queries for analytics

Vector Store:

  • Embeddings for semantic search
  • Multiple chunk sizes for different query types
  • High-performance similarity search indexes

All three storage layers are transparently managed by Sync and accessed through a unified API.

Use Cases

Department Isolation: Create separate dataspaces for HR, Legal, Finance, and Engineering—each with its own ontology, access controls, and storage.

Customer Segmentation: In multi-tenant applications, create one dataspace per customer to ensure complete data isolation and independent scaling.

Project-Specific Repositories: Create dataspaces for specific initiatives (e.g., "Q4 Product Launch", "Audit 2024") with tailored ontologies and temporary lifecycles.

Compliance & Data Residency: Use separate dataspaces in different regions to comply with data sovereignty requirements (GDPR, CCPA, etc.).

Coming Soon: Remote Dataspaces

Remote dataspaces will enable you to point Sync at existing repositories without migrating data. Instead of uploading files to Sync's storage, you'll configure a remote dataspace to reference external systems like:

  • SharePoint: Sync content directly from SharePoint sites
  • Cloud Storage: Point to existing S3 buckets, Azure Blob Storage, or Google Cloud Storage
  • Network Drives: Access on-premises file shares
  • Document Management Systems: Integrate with existing ECM platforms

Remote dataspaces will enable Sync's AI capabilities (semantic search, metadata extraction, query agents) while keeping data in its original location. This is ideal for:

  • Organizations with existing large document repositories
  • Compliance scenarios requiring data to remain in specific systems
  • Hybrid deployments with on-premises and cloud storage

Status: Remote dataspaces are currently in development. Contact us to join the early access program.

Next Steps