Web Crawling for Knowledge Base

Import content directly from websites into your knowledge base. Crawl single pages, multiple URLs, or entire websites using sitemaps.

Start crawling: Open your Knowledge Base, select a knowledge base, and click Add Web Content.

Looking to upload files instead? See Upload Documents.

Why Crawl Websites?

File Uploads	Web Crawling
Manual upload required	Automatic import
Point-in-time content	Can re-crawl for updates
Best for internal docs	Best for public content
Local files	Live websites

Ideal for:

FAQ pages that update frequently
Product documentation websites
Help center articles
Blog posts and guides
Pricing pages

Crawl Methods

Single Page

Import one specific page:

https://example.com/faq

Use when:

You need a specific page
Testing before full crawl
Content is on one page

Sitemap Crawl

Import multiple pages from a sitemap:

https://example.com/sitemap.xml

Use when:

Importing entire website sections
Bulk importing many pages
Content is well-organized

Step-by-Step Guide

1. Open Add Web Content

Go to Knowledge Base in the sidebar
Click on your knowledge base
Find the Add Web Content tab (or expand the upload section)

2. Enter URL

Enter a URL or sitemap URL:

Input	What Happens
`https://example.com/faq`	Crawls single page
`https://example.com/sitemap.xml`	Parses sitemap, shows pages
`https://example.com`	Attempts to find sitemap

3. Configure Options (Sitemap)

For sitemap crawls, configure:

Option	Default	Description
Max Pages	50	Maximum pages to crawl
Crawl Delay	2 seconds	Delay between requests
Respect robots.txt	Yes	Honor site's crawl rules

4. Preview & Select Pages

For sitemaps, you'll see a list of discovered URLs:

✓ /faq                    FAQ Page
✓ /pricing                Pricing
✓ /features               Features
□ /blog/post-1            Blog Post 1
□ /blog/post-2            Blog Post 2

Select which pages to import (all are selected by default, up to max pages).

5. Start Crawling

Click Start Crawling to begin:

Crawling: Pages are fetched and content extracted
Processing: Text is chunked and embedded
Indexing: Vectors stored in Pinecone

6. Monitor Progress

Watch the crawl progress:

Crawling: 15/50 pages
├── ✓ Completed: 12
├── ⏳ Processing: 2
└── ✗ Failed: 1

Content Extraction

The crawler extracts:

Extracted	Not Extracted
Page title	Images
Main content	Videos
Headings (H1-H6)	Scripts
Paragraphs	Styles
Lists	Navigation menus
Tables	Footers

Extraction Quality

Best extraction:

Clean HTML with semantic markup
Blog posts and articles
Documentation pages
FAQ pages

Challenging extraction:

Single-page apps (JavaScript rendered)
Pages with mostly images
Login-protected content
Dynamically loaded content

Sitemap Requirements

Supported Formats

Format	Extension	Example
XML Sitemap	`.xml`	Standard format
Sitemap Index	`.xml`	Links to other sitemaps
Text Sitemap	`.txt`	One URL per line

Example Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/faq</loc>
    <lastmod>2024-01-15</lastmod>
  </url>
  <url>
    <loc>https://example.com/pricing</loc>
    <lastmod>2024-01-10</lastmod>
  </url>
</urlset>

Finding Your Sitemap

Common locations:

https://yoursite.com/sitemap.xml
https://yoursite.com/sitemap_index.xml
Check robots.txt for sitemap location

Document Status

Each crawled page shows a status:

Status	Icon	Meaning	Action
Pending	Clock	Queued	Wait
Processing	Spinner	Being processed	Wait
Completed	Checkmark	Ready for retrieval	None
Failed	X	Crawl or processing error	Retry
Stuck	Warning	Processing timed out	Retry

Retry Failed Documents

If a crawl fails:

Click the Retry button next to the failed document
The system will re-crawl the URL
Re-process and re-index the content

Common failure reasons:

Temporary network issues
Rate limiting by the website
Page structure changed
Embedding API timeout

See Troubleshooting for more details.

Re-Crawling for Updates

To update crawled content:

Delete the existing URL documents
Re-crawl the same URLs

Tip: Keep track of which URLs you've crawled. The document list shows the source URL for each entry.

Best Practices

Start Small

Crawl a single page first
Verify content extraction quality
Then crawl more pages

Respect Rate Limits

Use the default 2-second delay
Don't crawl too many pages at once
Respect robots.txt rules

Select Relevant Pages

Don't import everything. Focus on:

FAQ and support pages
Product documentation
Policy pages
Key landing pages

Avoid:

Blog archives (hundreds of old posts)
Image galleries
Login/account pages
Terms and privacy (unless needed)

Monitor Quality

After crawling:

Check document status for failures
Test RAG retrieval in the agent
Remove low-quality extractions

Limits & Quotas

Resource	Limit
Max pages per crawl	50
Max total documents	100 per KB
Crawl delay	Minimum 1 second
Request timeout	30 seconds per page

Troubleshooting

"Failed to crawl URL"

Possible causes:

Website blocks crawlers (check robots.txt)
Page requires authentication
Invalid URL format
Network timeout

Solutions:

Verify URL is accessible in browser
Check if site allows crawling
Try single page instead of sitemap

"No content extracted"

Possible causes:

JavaScript-rendered content
Page has no text content
Content behind login

Solutions:

Try a different page
Use file upload for this content
Check page loads without JavaScript

Stuck documents

Documents stuck in "Processing" for more than 10 minutes can be retried:

Look for the yellow "Stuck" badge
Click Retry to re-process

See Troubleshooting for more solutions.

Use Cases

Import Company FAQ

Sitemap: https://company.com/sitemap.xml
Selected pages:
- /faq
- /help/shipping
- /help/returns
- /help/payment

Import Product Docs

Sitemap: https://docs.company.com/sitemap.xml
Max pages: 50
Selected: All pages in /product-guide/

Import Knowledge Center

Single URL: https://support.company.com/knowledge-base
Then selectively crawl individual articles

Next Steps

Document Processing - How content is indexed
RAG Integration - Configure retrieval
Troubleshooting - Fix crawl issues
Upload Documents - Import local files

Ready to import web content? Open the Edesy Platform to start crawling.

Web Crawling for Knowledge Base

Import content directly from websites into your knowledge base. Crawl single pages, multiple URLs, or entire websites using sitemaps.

Start crawling: Open your Knowledge Base, select a knowledge base, and click Add Web Content.

Looking to upload files instead? See Upload Documents.

Why Crawl Websites?

File Uploads	Web Crawling
Manual upload required	Automatic import
Point-in-time content	Can re-crawl for updates
Best for internal docs	Best for public content
Local files	Live websites

Ideal for:

FAQ pages that update frequently
Product documentation websites
Help center articles
Blog posts and guides
Pricing pages

Crawl Methods

Single Page

Import one specific page:

https://example.com/faq

Use when:

You need a specific page
Testing before full crawl
Content is on one page

Sitemap Crawl

Import multiple pages from a sitemap:

https://example.com/sitemap.xml

Use when:

Importing entire website sections
Bulk importing many pages
Content is well-organized

Step-by-Step Guide

1. Open Add Web Content

Go to Knowledge Base in the sidebar
Click on your knowledge base
Find the Add Web Content tab (or expand the upload section)

2. Enter URL

Enter a URL or sitemap URL:

Input	What Happens
`https://example.com/faq`	Crawls single page
`https://example.com/sitemap.xml`	Parses sitemap, shows pages
`https://example.com`	Attempts to find sitemap

3. Configure Options (Sitemap)

For sitemap crawls, configure:

Option	Default	Description
Max Pages	50	Maximum pages to crawl
Crawl Delay	2 seconds	Delay between requests
Respect robots.txt	Yes	Honor site's crawl rules

4. Preview & Select Pages

For sitemaps, you'll see a list of discovered URLs:

✓ /faq                    FAQ Page
✓ /pricing                Pricing
✓ /features               Features
□ /blog/post-1            Blog Post 1
□ /blog/post-2            Blog Post 2

Select which pages to import (all are selected by default, up to max pages).

5. Start Crawling

Click Start Crawling to begin:

Crawling: Pages are fetched and content extracted
Processing: Text is chunked and embedded
Indexing: Vectors stored in Pinecone

6. Monitor Progress

Watch the crawl progress:

Crawling: 15/50 pages
├── ✓ Completed: 12
├── ⏳ Processing: 2
└── ✗ Failed: 1

Content Extraction

The crawler extracts:

Extracted	Not Extracted
Page title	Images
Main content	Videos
Headings (H1-H6)	Scripts
Paragraphs	Styles
Lists	Navigation menus
Tables	Footers

Extraction Quality

Best extraction:

Clean HTML with semantic markup
Blog posts and articles
Documentation pages
FAQ pages

Challenging extraction:

Single-page apps (JavaScript rendered)
Pages with mostly images
Login-protected content
Dynamically loaded content

Sitemap Requirements

Supported Formats

Format	Extension	Example
XML Sitemap	`.xml`	Standard format
Sitemap Index	`.xml`	Links to other sitemaps
Text Sitemap	`.txt`	One URL per line

Example Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/faq</loc>
    <lastmod>2024-01-15</lastmod>
  </url>
  <url>
    <loc>https://example.com/pricing</loc>
    <lastmod>2024-01-10</lastmod>
  </url>
</urlset>

Finding Your Sitemap

Common locations:

https://yoursite.com/sitemap.xml
https://yoursite.com/sitemap_index.xml
Check robots.txt for sitemap location

Document Status

Each crawled page shows a status:

Status	Icon	Meaning	Action
Pending	Clock	Queued	Wait
Processing	Spinner	Being processed	Wait
Completed	Checkmark	Ready for retrieval	None
Failed	X	Crawl or processing error	Retry
Stuck	Warning	Processing timed out	Retry

Retry Failed Documents

If a crawl fails:

Click the Retry button next to the failed document
The system will re-crawl the URL
Re-process and re-index the content

Common failure reasons:

Temporary network issues
Rate limiting by the website
Page structure changed
Embedding API timeout

See Troubleshooting for more details.

Re-Crawling for Updates

To update crawled content:

Delete the existing URL documents
Re-crawl the same URLs

Tip: Keep track of which URLs you've crawled. The document list shows the source URL for each entry.

Best Practices

Start Small

Crawl a single page first
Verify content extraction quality
Then crawl more pages

Respect Rate Limits

Use the default 2-second delay
Don't crawl too many pages at once
Respect robots.txt rules

Select Relevant Pages

Don't import everything. Focus on:

FAQ and support pages
Product documentation
Policy pages
Key landing pages

Avoid:

Blog archives (hundreds of old posts)
Image galleries
Login/account pages
Terms and privacy (unless needed)

Monitor Quality

After crawling:

Check document status for failures
Test RAG retrieval in the agent
Remove low-quality extractions

Limits & Quotas

Resource	Limit
Max pages per crawl	50
Max total documents	100 per KB
Crawl delay	Minimum 1 second
Request timeout	30 seconds per page

Troubleshooting

"Failed to crawl URL"

Possible causes:

Website blocks crawlers (check robots.txt)
Page requires authentication
Invalid URL format
Network timeout

Solutions:

Verify URL is accessible in browser
Check if site allows crawling
Try single page instead of sitemap

"No content extracted"

Possible causes:

JavaScript-rendered content
Page has no text content
Content behind login

Solutions:

Try a different page
Use file upload for this content
Check page loads without JavaScript

Stuck documents

Documents stuck in "Processing" for more than 10 minutes can be retried:

Look for the yellow "Stuck" badge
Click Retry to re-process

See Troubleshooting for more solutions.

Use Cases

Import Company FAQ

Sitemap: https://company.com/sitemap.xml
Selected pages:
- /faq
- /help/shipping
- /help/returns
- /help/payment

Import Product Docs

Sitemap: https://docs.company.com/sitemap.xml
Max pages: 50
Selected: All pages in /product-guide/

Import Knowledge Center

Single URL: https://support.company.com/knowledge-base
Then selectively crawl individual articles

Next Steps

Document Processing - How content is indexed
RAG Integration - Configure retrieval
Troubleshooting - Fix crawl issues
Upload Documents - Import local files

Ready to import web content? Open the Edesy Platform to start crawling.