Website Crawling
Website crawling lets you automatically import web content into your knowledge base. Once configured, the crawler fetches pages from your specified URL and keeps the content synchronized through scheduled updates.
Getting Started
To add a website to your knowledge base:
- Open your knowledge base
- Navigate to the Website Sources section
- Click Add Website Source
- Enter the URL you want to crawl and configure your settings
The system uses Firecrawl to handle the crawling process, ensuring clean extraction of page content suitable for AI processing.
Configuration Options
URL
Enter the starting URL for your crawl. This should be a valid HTTP or HTTPS address. For single pages, simply enter that page's URL. For crawling an entire site, start with the homepage or the section you want to index.
Crawl Mode
Choose how extensively the crawler should explore:
Single Page Only crawls just the URL you specify—perfect when you only need content from one specific page.
Entire Website follows links from your starting URL to discover and crawl multiple pages across the site. This is ideal for indexing documentation sites, blogs, or any multi-page resource.
Scan Frequency
Determine how often the crawler should revisit your sources to capture updated content:
| Frequency | Best For |
|---|---|
| Manual | Static content you control |
| Hourly | Rapidly changing content |
| Daily | News sites, active blogs |
| Weekly | Documentation, company sites |
| Monthly | Reference material, archives |
| Custom | Set your own interval (days, hours, minutes) |
You can also set a custom frequency specifying exact days, hours, and minutes between crawls.
Start with less frequent scans and adjust based on how often your source content actually changes. This helps manage your crawl quota efficiently.
Managing URL Discovery
When crawling entire websites, you can control which pages get included through URL mapping and exclusions.
Mapping URLs
Before a full website crawl begins, you can preview which pages will be discovered:
- Click Map URLs to let the system discover all accessible pages
- Review the list of discovered URLs, organized by path segments
- Select any URLs or path groups you want to exclude
- Save your exclusions
This preview helps you avoid crawling irrelevant sections like login pages, admin areas, or duplicate content.
Excluding URLs
After mapping, exclude pages by:
- Selecting individual URLs to skip
- Excluding entire path groups (e.g., exclude everything under
/archive/) - Using the select/deselect all options for quick bulk changes
Excluded URLs won't be crawled or indexed, keeping your knowledge base focused on relevant content.
Monitoring Your Crawls
Each website source displays its current status and history.
Status Indicators
| Status | Meaning |
|---|---|
| Success | Crawl completed without errors |
| Pending | Crawl is queued and waiting to start |
| In Progress | Currently crawling pages |
| Failed | Error occurred during crawl |
When a crawl fails, you'll see error details explaining what went wrong—often related to site accessibility, rate limiting, or authentication requirements.
Crawl Information
For each website source you can see:
- The URL being crawled
- When it was last updated
- When the next scheduled crawl will run (if automatic scheduling is enabled)
- The number of pages successfully indexed
Manual Refresh
Need updated content before the next scheduled crawl? Click Start Crawl to trigger an immediate refresh. This is useful when you know the source content has changed and you want those changes reflected in your knowledge base right away.
Editing and Removing Sources
Changing Settings
To modify a website source:
- Click on the website source to open its details
- Update the URL, crawl mode, schedule, or exclusions
- Changes save automatically
If you change the URL significantly, consider re-mapping URLs to update your exclusion list.
Deleting a Source
When you no longer need content from a website:
- Open the website source details
- Click Delete
- Confirm the removal
This removes both the source configuration and all content that was imported from it.
Best Practices
Scope Your Crawls Thoughtfully
Start with the most relevant section of a site rather than crawling everything. A focused knowledge base with high-quality content performs better than one filled with tangentially related pages.
Use Exclusions Strategically
Exclude pages that don't add value:
- Navigation-only pages and site maps
- Login, signup, and account pages
- Duplicate content under different paths
- Auto-generated archives or tag listings
Match Frequency to Content Velocity
Consider how often your sources actually update:
- API documentation might change weekly
- A company blog might post daily
- Reference archives rarely change at all
Setting appropriate frequencies keeps your knowledge base current without wasting crawl resources.
Test with Single Pages First
Before committing to a full site crawl, test with a single page to verify:
- The content extracts cleanly
- The information is useful for your agents
- There are no access issues
Troubleshooting
Crawl Keeps Failing
Verify the URL is publicly accessible—try opening it in an incognito browser window. Some sites block automated access or require authentication. If the site uses JavaScript heavily to render content, it may not crawl well.
Missing Expected Content
Check your exclusion list to ensure you haven't accidentally excluded the pages you need. Also verify that the crawl mode is set to "Entire Website" if you expect multiple pages to be indexed.
Content Seems Outdated
Check when the last successful crawl occurred. If it was recent but content still seems old, the source website may be caching or the specific pages you're looking for might have been excluded. Trigger a manual refresh to pull the latest content.
Too Many Pages Being Crawled
Use URL mapping to review what's being discovered, then add exclusions for sections you don't need. This is common with sites that have extensive archives or auto-generated category pages.