Document Processing Pipeline
Learn how your documents are transformed from files and web pages into searchable knowledge for your AI voice agent.
View status: Open your Knowledge Base to see document processing status and chunk counts.
Processing Overview
When you upload a document or crawl a URL, it goes through this pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ Document Processing Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input │
│ ├── PDF/DOCX Upload ─────┐ │
│ └── URL Crawl ───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Parsing │ Extract text from source │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Chunking │ Split into ~500 token pieces │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Embedding │ Convert to vectors (OpenAI) │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Indexing │ Store in Pinecone │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Pipeline Stages
1. Parsing
Purpose: Extract raw text from the source.
| Source | Parser | Output |
|---|---|---|
| PDF.js / pdfparse | Plain text | |
| DOCX | mammoth | Plain text + structure |
| TXT/MD | Direct read | Plain text |
| HTML (URL) | Cheerio | Cleaned text |
What's extracted:
- Main content text
- Headings and structure
- Lists and tables (as text)
What's ignored:
- Images (no OCR currently)
- JavaScript/CSS
- Navigation elements
- Headers/footers
2. Chunking
Purpose: Split text into smaller, retrievable pieces.
| Parameter | Value | Description |
|---|---|---|
| Chunk Size | ~500 tokens | Approximate size per chunk |
| Overlap | ~50 tokens | Context preserved between chunks |
| Boundary | Sentences | Chunks end at sentence boundaries |
Why chunk?
- LLM context limits: Can't send entire documents to LLM
- Precision: Smaller chunks = more targeted retrieval
- Relevance: Retrieve only relevant sections
Example chunking:
Original document (1500 words):
┌─────────────────────────────────────────────────┐
│ Section 1: Introduction (300 words) │
│ Section 2: Features (500 words) │
│ Section 3: Pricing (400 words) │
│ Section 4: FAQ (300 words) │
└─────────────────────────────────────────────────┘
After chunking (4 chunks):
┌────────────────┐ ┌────────────────┐
│ Chunk 1 │ │ Chunk 2 │
│ Intro + start │ │ Features │
│ of Features │ │ continued │
└────────────────┘ └────────────────┘
┌────────────────┐ ┌────────────────┐
│ Chunk 3 │ │ Chunk 4 │
│ Pricing │ │ FAQ │
└────────────────┘ └────────────────┘
3. Embedding
Purpose: Convert text chunks to vectors for semantic search.
| Setting | Value |
|---|---|
| Model | OpenAI text-embedding-3-small |
| Dimensions | 1536 |
| Batch Size | 100 chunks |
How it works:
Text: "Our return policy allows returns within 30 days"
│
▼
OpenAI Embedding API
│
▼
Vector: [0.023, -0.156, 0.089, ..., 0.045] (1536 dimensions)
Why embeddings?
- Similar meaning = similar vectors
- Enables semantic search (understanding, not just keywords)
- "return policy" matches "refund rules"
4. Indexing
Purpose: Store vectors for fast similarity search.
| Setting | Value |
|---|---|
| Database | Pinecone |
| Namespace | Per knowledge base |
| Metadata | Document ID, text, source |
Stored metadata per chunk:
{
"documentId": "doc_abc123",
"documentName": "Return Policy",
"knowledgeBaseId": "kb_xyz789",
"chunkIndex": 0,
"text": "Our return policy allows...",
"wordCount": 150,
"sourceType": "file",
"sourceUrl": null
}
Document Status
Each document displays a status badge:
| Status | Icon | Description | Duration |
|---|---|---|---|
| Pending | Clock | Queued for processing | Seconds |
| Processing | Spinner | Pipeline in progress | Seconds to minutes |
| Completed | Checkmark | Ready for retrieval | - |
| Failed | X | Error occurred | - |
| Stuck | Warning | Processing timed out (>10 min) | - |
Status Flow
PENDING ──► PROCESSING ──► COMPLETED
│
├──► FAILED (error)
│
└──► STUCK (timeout)
Handling Failed Documents
When a document fails:
- View error: Click the info icon to see the error message
- Fix issue: Address the cause (file format, content, etc.)
- Retry: Click the "Retry" button
Common failures:
- Invalid file format
- Empty content
- Embedding API timeout
- Network issues
See Troubleshooting for solutions.
Handling Stuck Documents
Documents stuck in "Processing" for over 10 minutes are marked as "Stuck":
- Identify: Look for yellow "Stuck" badge
- Retry: Click the "Retry" button
- Monitor: Watch for successful completion
Why documents get stuck:
- Server restart during processing
- Embedding API timeout
- Database connection lost
Retrieval Process
When your agent needs information during a call:
User: "What's your return policy?"
│
▼
┌───────────┐
│ Embed │ Convert question to vector
│ Query │
└─────┬─────┘
│
▼
┌───────────┐
│ Vector │ Find similar chunks in Pinecone
│ Search │
└─────┬─────┘
│
▼
┌───────────┐
│ Rank & │ Apply minScore filter
│ Filter │ Return top K results
└─────┬─────┘
│
▼
┌───────────┐
│ LLM │ Generate response using
│ Response │ retrieved context
└───────────┘
Configuration options:
| Setting | Default | Description |
|---|---|---|
topK |
5 | Number of chunks to retrieve |
minScore |
0.7 | Minimum relevance (0-1) |
Learn more: RAG Integration
Processing Time
Typical processing times:
| Document Type | Size | Processing Time |
|---|---|---|
| Small PDF | <1 MB | 5-15 seconds |
| Large PDF | 1-10 MB | 30-90 seconds |
| Single URL | - | 10-30 seconds |
| Sitemap (10 pages) | - | 2-5 minutes |
| Sitemap (50 pages) | - | 10-15 minutes |
Factors affecting time:
- File size and complexity
- Number of pages/chunks
- Current system load
- Embedding API response time
Monitoring Processing
Document List View
In your knowledge base, see:
- Document name and source
- Processing status
- Chunk count (when completed)
- Error messages (when failed)
Chunk Count
The chunk count indicates how many pieces your document was split into:
| Chunk Count | Document Size |
|---|---|
| 1-5 | Small document |
| 5-20 | Medium document |
| 20-50 | Large document |
| 50+ | Very large document |
Error Details
For failed documents:
- Click the info icon
- View full error message
- See processing metadata
- Check source URL (for crawled content)
Best Practices
For Better Chunking
- Use clear headings (H1, H2, H3)
- Write in paragraphs, not walls of text
- Keep related content together
- Avoid very long sentences
For Better Retrieval
- One topic per document
- Use specific, descriptive titles
- Include common question phrases
- Avoid abbreviations without definitions
For Monitoring
- Check processing status after uploads
- Retry failed documents promptly
- Review chunk counts for expected values
- Test retrieval with sample questions
Next Steps
- RAG Integration - Configure retrieval settings
- Troubleshooting - Fix processing issues
- Upload Documents - Import files
- Web Crawling - Import web content
Questions about processing? Check Troubleshooting or contact support.