Web Crawling for Knowledge Base
Import content directly from websites into your knowledge base. Crawl single pages, multiple URLs, or entire websites using sitemaps.
Start crawling: Open your Knowledge Base, select a knowledge base, and click Add Web Content.
Looking to upload files instead? See Upload Documents.
Why Crawl Websites?
| File Uploads | Web Crawling |
|---|---|
| Manual upload required | Automatic import |
| Point-in-time content | Can re-crawl for updates |
| Best for internal docs | Best for public content |
| Local files | Live websites |
Ideal for:
- FAQ pages that update frequently
- Product documentation websites
- Help center articles
- Blog posts and guides
- Pricing pages
Crawl Methods
Single Page
Import one specific page:
https://example.com/faq
Use when:
- You need a specific page
- Testing before full crawl
- Content is on one page
Sitemap Crawl
Import multiple pages from a sitemap:
https://example.com/sitemap.xml
Use when:
- Importing entire website sections
- Bulk importing many pages
- Content is well-organized
Step-by-Step Guide
1. Open Add Web Content
- Go to Knowledge Base in the sidebar
- Click on your knowledge base
- Find the Add Web Content tab (or expand the upload section)
2. Enter URL
Enter a URL or sitemap URL:
| Input | What Happens |
|---|---|
https://example.com/faq |
Crawls single page |
https://example.com/sitemap.xml |
Parses sitemap, shows pages |
https://example.com |
Attempts to find sitemap |
3. Configure Options (Sitemap)
For sitemap crawls, configure:
| Option | Default | Description |
|---|---|---|
| Max Pages | 50 | Maximum pages to crawl |
| Crawl Delay | 2 seconds | Delay between requests |
| Respect robots.txt | Yes | Honor site's crawl rules |
4. Preview & Select Pages
For sitemaps, you'll see a list of discovered URLs:
✓ /faq FAQ Page
✓ /pricing Pricing
✓ /features Features
□ /blog/post-1 Blog Post 1
□ /blog/post-2 Blog Post 2
Select which pages to import (all are selected by default, up to max pages).
5. Start Crawling
Click Start Crawling to begin:
- Crawling: Pages are fetched and content extracted
- Processing: Text is chunked and embedded
- Indexing: Vectors stored in Pinecone
6. Monitor Progress
Watch the crawl progress:
Crawling: 15/50 pages
├── ✓ Completed: 12
├── ⏳ Processing: 2
└── ✗ Failed: 1
Content Extraction
The crawler extracts:
| Extracted | Not Extracted |
|---|---|
| Page title | Images |
| Main content | Videos |
| Headings (H1-H6) | Scripts |
| Paragraphs | Styles |
| Lists | Navigation menus |
| Tables | Footers |
Extraction Quality
Best extraction:
- Clean HTML with semantic markup
- Blog posts and articles
- Documentation pages
- FAQ pages
Challenging extraction:
- Single-page apps (JavaScript rendered)
- Pages with mostly images
- Login-protected content
- Dynamically loaded content
Sitemap Requirements
Supported Formats
| Format | Extension | Example |
|---|---|---|
| XML Sitemap | .xml |
Standard format |
| Sitemap Index | .xml |
Links to other sitemaps |
| Text Sitemap | .txt |
One URL per line |
Example Sitemap
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/faq</loc>
<lastmod>2024-01-15</lastmod>
</url>
<url>
<loc>https://example.com/pricing</loc>
<lastmod>2024-01-10</lastmod>
</url>
</urlset>
Finding Your Sitemap
Common locations:
https://yoursite.com/sitemap.xmlhttps://yoursite.com/sitemap_index.xml- Check
robots.txtfor sitemap location
Document Status
Each crawled page shows a status:
| Status | Icon | Meaning | Action |
|---|---|---|---|
| Pending | Clock | Queued | Wait |
| Processing | Spinner | Being processed | Wait |
| Completed | Checkmark | Ready for retrieval | None |
| Failed | X | Crawl or processing error | Retry |
| Stuck | Warning | Processing timed out | Retry |
Retry Failed Documents
If a crawl fails:
- Click the Retry button next to the failed document
- The system will re-crawl the URL
- Re-process and re-index the content
Common failure reasons:
- Temporary network issues
- Rate limiting by the website
- Page structure changed
- Embedding API timeout
See Troubleshooting for more details.
Re-Crawling for Updates
To update crawled content:
- Delete the existing URL documents
- Re-crawl the same URLs
Tip: Keep track of which URLs you've crawled. The document list shows the source URL for each entry.
Best Practices
Start Small
- Crawl a single page first
- Verify content extraction quality
- Then crawl more pages
Respect Rate Limits
- Use the default 2-second delay
- Don't crawl too many pages at once
- Respect
robots.txtrules
Select Relevant Pages
Don't import everything. Focus on:
- FAQ and support pages
- Product documentation
- Policy pages
- Key landing pages
Avoid:
- Blog archives (hundreds of old posts)
- Image galleries
- Login/account pages
- Terms and privacy (unless needed)
Monitor Quality
After crawling:
- Check document status for failures
- Test RAG retrieval in the agent
- Remove low-quality extractions
Limits & Quotas
| Resource | Limit |
|---|---|
| Max pages per crawl | 50 |
| Max total documents | 100 per KB |
| Crawl delay | Minimum 1 second |
| Request timeout | 30 seconds per page |
Troubleshooting
"Failed to crawl URL"
Possible causes:
- Website blocks crawlers (check robots.txt)
- Page requires authentication
- Invalid URL format
- Network timeout
Solutions:
- Verify URL is accessible in browser
- Check if site allows crawling
- Try single page instead of sitemap
"No content extracted"
Possible causes:
- JavaScript-rendered content
- Page has no text content
- Content behind login
Solutions:
- Try a different page
- Use file upload for this content
- Check page loads without JavaScript
Stuck documents
Documents stuck in "Processing" for more than 10 minutes can be retried:
- Look for the yellow "Stuck" badge
- Click Retry to re-process
See Troubleshooting for more solutions.
Use Cases
Import Company FAQ
Sitemap: https://company.com/sitemap.xml
Selected pages:
- /faq
- /help/shipping
- /help/returns
- /help/payment
Import Product Docs
Sitemap: https://docs.company.com/sitemap.xml
Max pages: 50
Selected: All pages in /product-guide/
Import Knowledge Center
Single URL: https://support.company.com/knowledge-base
Then selectively crawl individual articles
Next Steps
- Document Processing - How content is indexed
- RAG Integration - Configure retrieval
- Troubleshooting - Fix crawl issues
- Upload Documents - Import local files
Ready to import web content? Open the Edesy Platform to start crawling.