CordonData is a complete Document Management System — upload, organize, version, share, and annotate. Then layer on AI search, compliance scanning, and OCR to unlock everything inside your files.
Document Management
Upload up to 500MB per file. Organize with nested folders (50 levels, 100K+ files each). Full version history, granular sharing with VIEW/EDIT/DELETE permissions, and PDF annotation with permanent redaction.
AI-Powered Search
Ask questions in natural language across all your documents. Hybrid search combines semantic vectors with BM25 keywords. Every answer cites the exact source document and page.
PII, NHI & Secret Detection
Auto-scan every document for sensitive data before it enters the AI pipeline. Detect SSNs, emails, medical records, API keys, and credentials. Auto-redact or flag for review.
OCR & Document Intelligence
Extract searchable text from scanned PDFs, images, and mixed-content documents. 100+ languages, RTL support, layout-aware parsing for multi-column and table-heavy files.
Full Audit Trail
Every query, retrieval, LLM prompt, and response is logged. Export deterministic audit traces showing exactly which document chunk produced each sentence.
Permission-Safe Retrieval
DMS permissions are the single source of truth. When you share or revoke access, the AI search index updates instantly. Users only see documents they're authorized to access — impossible to bypass.
Connect External Sources
Already have documents elsewhere? Connect to SharePoint, Alfresco, S3, email, file servers, and REST APIs. Index in-place — no file duplication, no data migration.
Model-Agnostic LLM Gateway
Use any LLM — OpenAI, Azure, Anthropic, local Ollama models. Configure per-knowledge-base with automatic fallback across priority tiers.
Deploy Anywhere
On-premise, air-gapped, your own cloud (BYOC), or managed single-tenant SaaS. Docker Compose or Kubernetes. AES-256 encryption. Keycloak SSO.
// INTERACTIVE_DEMO
See the Full Platform in Action
Click through each screen to explore CordonData's complete document intelligence pipeline — from upload to AI-powered answers.
Ask AI — Grounded answers with citations
// AGENT_BUILDER
Agent Builder & Admin Platform
Beyond search — CordonData includes a full agent-builder platform for creating custom AI assistants, configuring model pipelines, and managing enterprise knowledge at scale.
Custom AI Agents
Build purpose-specific AI agents with custom system prompts, tool configurations, and knowledge base assignments. Each agent can use different LLM models and retrieval strategies tailored to specific business functions.
Global Model Settings
Configure LLM, embedding, reranker, condenser, and vision models globally across all knowledge bases. Set priority tiers with automatic fallback — use OpenAI for primary, local Ollama models as backup.
Visual Workflow Editor
Design complex AI pipelines with a drag-and-drop workflow editor. Chain together data ingestion, text extraction, chunking, embedding, retrieval, and response generation nodes — no code required.
Knowledge Base Management
Create and manage multiple knowledge bases, each with independent data sources, chunking strategies, embedding models, and ACL policies. Monitor indexing status, document counts, and sync health from a unified dashboard.
Processing Pipeline Monitor
Real-time visibility into OCR, compliance scanning, chunking, embedding, and RAG indexing pipelines. Track per-document status, retry failed documents, and monitor throughput across all connected sources.
SSO & Identity Management
Integrated Keycloak SSO with support for Active Directory, LDAP, and OIDC/SAML identity providers. Role-based access control across admin console, chat interface, and API endpoints.
// CONNECTORS
Connect to Everything
CordonData connects to your existing infrastructure through native protocol connectors. No data migration, no file duplication — just secure, in-place indexing.
CMIS
Alfresco, Documentum, FileNet, any CMIS-compliant repository
SharePoint
SharePoint Online & On-Premise via REST API with OAuth2
S3 / Object Storage
AWS S3, MinIO, Azure Blob, any S3-compatible storage
REST API
Any REST API with JSONPath-based field mapping & pagination
Email
IMAP/IMAPS, Office 365, Gmail with attachment extraction
Filesystem
Local & network file systems with recursive directory crawling
SMB / CIFS
Windows file shares and NAS devices via SMB protocol
Local Upload
Direct drag-and-drop upload into the built-in DMS
Self-Registering Microservice Architecture
Connectors run as independent microservices that self-register with the admin platform on startup. Each connector sends periodic heartbeats and exposes a standard REST API for crawl, test, and status operations. Add new connectors without modifying the core platform — just deploy and register.
Built from the ground up for the security, compliance, and scale requirements of the world's most demanding organizations.
Air-Gapped Ready
Zero external API calls
AES-256 Encryption
At rest & in transit
Docker / Kubernetes
Single-node to multi-AZ
Keycloak SSO
AD, LDAP, OIDC, SAML
On-Premise
Deploy entirely within your data center. Air-gapped operation with no external dependencies. Full control over infrastructure, networking, and data residency.
Bare metal or VM deployment
Docker Compose or Kubernetes
Local LLM inference via Ollama
BYOC (Bring Your Own Cloud)
Deploy inside your own AWS, Azure, or GCP environment. You maintain control of the infrastructure while we provide the software and support.
Your VPC, your security groups
Your IAM roles and policies
Your encryption keys (BYOK)
Managed Single-Tenant
Let us host it for you — in a dedicated, physically isolated environment. No shared databases, no shared indexes, no cross-tenant data leakage.
Dedicated infrastructure per customer
99.9% uptime SLA
Managed updates & monitoring
Enterprise-Grade Transparency
We built CordonData to solve the two biggest blockers for Enterprise AI adoption: Data Security and Hallucinations.
// RETRIEVAL_AUDIT_TRACE
Verifiable Retrieval Audit Trace
LLM hallucinations are unacceptable in the enterprise. CordonData provides a deterministic audit trace for every generated sentence. Instantly verify the exact document, page number, and extracted text chunk the AI used to formulate its response.
Direct links to source files in your DMS
Confidence scoring on vector matches
Exact text chunk highlighting
AI Response
Generated in 1.2s
Based on the current guidelines, the Q3 bonus pool has been increased by 15% across the APAC division [1].
Audit Trace: Reference [1]
MATCH_SCORE: 0.94
DOC: APAC_Project_Phoenix_Launch_Brief.pdf
PAGE: 12 | CHUNK: #402
"...the executive board has approved a 15% increase to the bonus pool specifically allocated for the APAC division following record sales..."
JS
John SmithRole: HR Director
Query: "Q3 Layoffs"
Found 4 matching documents.
Indexing Source: Enterprise_DMS/HR_Confidential
ED
Emma DoeRole: Engineering Intern
Query: "Q3 Layoffs"
No results found.
Filtered by Index Authorization Rules
// ZERO_TRUST_ACL
Permission-Safe Retrieval Routing
A search engine is only as safe as its weakest access control. While Keycloak handles seamless identity authentication, CordonData’s native authorization engine takes over at the data layer. When a user queries the system, the vector space is dynamically filtered by cross-referencing their username, group, and authority directly against the indexed document metadata.
Secure authentication via Keycloak/Active Directory
Index-level authorization (User/Group matching)
Impossible to bypass via prompt injection
// NATIVE_DMS
Built-in Document Management System
CordonData includes a full-featured, enterprise-grade DMS — not just a file picker. Upload, organize, version, share, annotate, and collaborate on documents with fine-grained access control, all within your secure infrastructure.
File Upload & Organization
Upload files up to 500MB each via drag-and-drop or folder upload. Organize with unlimited nested folders (up to 50 levels deep, 100K+ files per folder). All files encrypted at rest with AES-256-GCM in MinIO object storage.
Version Control
Upload new versions of any file while preserving full history. View, download, promote, or archive previous versions. Each version is independently tracked with upload timestamps and version labels.
Granular Sharing & Permissions
Share files and folders with specific users at VIEW, EDIT, or DELETE permission levels. Folder sharing cascades to all children. Revoke access instantly — removed users immediately lose search visibility and content access.
Comments & Collaboration
Leave comments on any file or folder. Threaded discussions keep context with the document. Comments respect the same permission model — only authorized users can view or modify them.
PDF Annotation & Redaction
Annotate PDFs directly in the browser with highlights, text notes, sticky notes, stamps, and permanent redactions. Annotations are burned into a new version and saved back to the DMS with full version history.
Public Share Links
Generate password-protected public share links for external stakeholders. Set expiration dates, enforce password complexity, and revoke links at any time. Public access is isolated from internal permissions.
Complete Document Lifecycle Management
Create & Upload
Drag-and-drop, folder upload, bulk operations
Move & Copy
Bulk move/copy with async subtree jobs for large folders
Trash & Restore
Soft-delete with restore. Permanent delete with async cleanup
Every file uploaded to the DMS is automatically processed through the full AI pipeline: OCR → compliance scanning → chunking → embedding → vector indexing. Documents appear in "My Library" knowledge base and are instantly searchable via the chat interface.
Automatic OCR for scanned PDFs and images
PII/NHI/secret scanning before indexing
Real-time RAG status visibility per document
Permission-Safe by Design
DMS permissions are the single source of truth for AI access control. When a document is shared or revoked, the vector index updates immediately. Users searching via chat only see results from documents they have explicit permission to access.
ACL metadata stored alongside vector embeddings
Instant permission revocation propagates to search
Ask AI never leaks content across permission boundaries
// COMPLIANCE_ENGINE
Automated PII, NHI & Secret Detection
Before any document enters your AI pipeline, CordonData scans, classifies, and redacts sensitive data — ensuring compliance with GDPR, HIPAA, PCI-DSS, and internal data governance policies.
PII Detection
Automatically identify and classify Personally Identifiable Information across all ingested documents — names, addresses, phone numbers, email addresses, social security numbers, passport numbers, driver's license IDs, and more.
SSNEmailPhonePassportDL#DOB
NHI Detection
Detect Non-public Health Information and protected health data — medical record numbers, health insurance IDs, patient identifiers, diagnosis codes, and clinical trial data — ensuring HIPAA compliance.
MRNHIPAAICD-10HITECH
Secret & Credential Detection
Scan for leaked API keys, access tokens, database connection strings, private keys, AWS/Azure/GCP credentials, and other secrets accidentally embedded in documents before they reach the AI model.
API KeyTokenConnStrPEM
How Compliance Scanning Works
1
Ingest
Document enters the pipeline from any connected source
2
Scan & Classify
Regex + ML models detect PII, NHI, and secrets with confidence scoring
3
Redact or Flag
Auto-redact sensitive spans or flag for manual review based on policy
4
Index Safely
Only sanitized content enters the vector index and LLM context window
// DOCUMENT_INTELLIGENCE
Advanced OCR & Document Intelligence
CordonData extracts structured, searchable text from any document format — scanned PDFs, images, handwritten notes, and complex multi-column layouts — using state-of-the-art OCR and document understanding models.
Scanned PDF OCR
Convert image-based PDFs into fully searchable text. Supports multi-page documents, mixed content (text + images), and RTL languages including Arabic and Hebrew.
Image Text Extraction
Extract text from PNG, JPEG, TIFF, and other image formats. Handles low-resolution scans, skewed documents, and complex backgrounds with high accuracy.
Layout-Aware Parsing
Understands multi-column layouts, tables, headers, footnotes, and callout boxes. Preserves reading order and document structure for accurate chunking.
Multilingual OCR
Supports 100+ languages including CJK (Chinese, Japanese, Korean), Arabic, Cyrillic, and Indic scripts. Automatic language detection for mixed-language documents.
Supported Document Formats
PDF (Scanned & Native)
DOCX / DOC
PPTX / PPT
XLSX / XLS
PNG / JPEG / TIFF
HTML / XML
Markdown / Plain Text
EML / MSG (Email)
Built for Regulated Enterprises
CordonData is purpose-built for industries where data security, compliance, and auditability are non-negotiable.
Healthcare & Life Sciences
HIPAA-compliant AI search across clinical notes, research papers, trial data, and patient records. Built-in PHI detection and redaction ensures protected health information never leaks into AI prompts or vector indexes.
Financial Services
PCI-DSS and SOX-compliant document intelligence. Search across trade confirmations, compliance reports, and internal policies with full audit traceability and PII/secret redaction.
Legal & Compliance
Search across case files, contracts, regulatory filings, and e-discovery documents. Every AI-generated answer is backed by a deterministic citation trail to the exact source paragraph.
Government & Defense
Air-gapped, fully on-premise deployment. No external API calls. Classified document handling with role-based access control at the vector index level. FedRAMP-ready architecture.
Manufacturing & Engineering
Search across technical specifications, CAD documentation, SOPs, and maintenance logs. Connect to SharePoint, network file shares, and legacy DMS without migrating data.
Education & Research
AI-powered research across academic papers, grant proposals, and institutional repositories. Respects copyright and access restrictions with document-level permission enforcement.
// ARCHITECTURE
How CordonData Works
A modular, self-hosted platform that connects to your existing infrastructure. Documents stay in place — CordonData indexes and makes them AI-searchable.
Self-Hosted
Runs entirely within your infrastructure — bare metal, VMs, or Kubernetes. No data ever leaves your network. 19 containerized services orchestrated via Docker Compose.
Modular & Swappable
Swap any component: embedding model (OpenAI, Ollama, Cohere), LLM (GPT, Claude, Granite, Qwen), vector DB (OpenSearch, pgvector), or OCR engine.
Permission-Safe by Design
ACLs from source systems (SharePoint, Alfresco, file servers) are preserved and enforced at query time. Users only see documents they have permission to access.
Frequently Asked Questions
Everything you need to know about CordonData's enterprise AI platform.
What makes CordonData different from other enterprise AI search tools?
CordonData is the only platform that combines on-premise RAG, document-level permission enforcement, automated PII/NHI/secret redaction, and advanced OCR in a single self-hosted package. Unlike cloud-only solutions, your data never leaves your infrastructure. Unlike simple RAG wrappers, we provide native connectors to your existing DMS, full audit traceability, and zero-trust retrieval routing.
Can CordonData run completely air-gapped?
Yes. CordonData is designed for air-gapped, offline deployments. You can run the entire stack — OCR, embedding, vector search, LLM inference, and SSO — entirely within your secure network with no external API calls. We support local LLM inference via Ollama and other self-hosted model runtimes.
How does document-level security work?
When documents are indexed, their ACL metadata (owner, group, permissions) is stored alongside the vector embeddings. At query time, the user's identity — authenticated via Keycloak or Active Directory — is cross-referenced against this metadata. The vector search space is dynamically filtered so users only see results from documents they have permission to access. This happens at the index level, making it impossible to bypass via prompt injection.
What document formats and languages do you support?
We support PDF (scanned and native), DOCX, PPTX, XLSX, PNG, JPEG, TIFF, HTML, Markdown, plain text, and email formats (EML/MSG). Our OCR engine supports 100+ languages including CJK, Arabic, Cyrillic, and Indic scripts. We also handle RTL (right-to-left) languages with proper text layer alignment.
How does the PII and secret detection work?
Before any document content enters the vector index or LLM context window, it passes through our compliance scanning pipeline. We use a combination of regex patterns, ML-based named entity recognition, and entropy-based secret detection to identify PII (SSN, email, phone, passport, etc.), NHI (medical records, health IDs), and secrets (API keys, tokens, connection strings). Detected spans can be automatically redacted or flagged for manual review based on your policy configuration.
Can I use my own LLM or embedding model?
Absolutely. CordonData is model-agnostic. You can use OpenAI, Azure OpenAI, Anthropic, local models via Ollama, or any OpenAI-compatible API. The embedding model, reranker, and chat model are all configurable per knowledge base. You maintain full control over which models process your data.
How do I get started?
Join our waitlist or apply for the Design Partner Program. Design partners get white-glove onboarding, direct access to our engineering team, and lifetime pricing lock. We're looking for forward-thinking enterprises to help us stress-test the platform before the stable 1.0 release.
Build With Us: The Design Partner Program
We are soon launching a stable 1.0 release. We are looking for 3 forward-thinking enterprises to help us stress-test our advanced document extraction and hybrid search indexing pipelines.
What to Expect (v0.8)
Early Access to Core Features: The foundational RAG engine is operational. You'll help us polish the UI and refine edge cases before the public launch.
Collaborative Feedback: Your insights are invaluable. We'll work closely with your team to optimize connector reliability and the overall user experience.
Safe Sandbox Deployment: To ensure zero risk to production data, we ask that you provide a dedicated test environment or mock dataset for our initial connection.
The Benefits
White-Glove Onboarding: Direct installation and identity provider setup by our founding engineering team.
Roadmap Influence: Your feature requests get bumped to the front of the dev queue.
Lifetime Pricing Lock: Design partners secure an exclusive, heavily discounted licensing rate in perpetuity.
Join the Waitlist
Be the first to know when CordonData 1.0 is stable and ready for enterprise deployment.