Website Crawling

Website crawling lets you automatically import web content into your knowledge base. Once configured, the crawler fetches pages from your specified URL and keeps the content synchronized through scheduled updates.

Getting Started

To add a website to your knowledge base:

Open your knowledge base
Navigate to the Website Sources section
Click Add Website Source
Enter the URL you want to crawl and configure your settings

The system uses Firecrawl to handle the crawling process, ensuring clean extraction of page content suitable for AI processing.

Configuration Options

URL

Enter the starting URL for your crawl. This should be a valid HTTP or HTTPS address. For single pages, simply enter that page's URL. For crawling an entire site, start with the homepage or the section you want to index.

Crawl Mode

Choose how extensively the crawler should explore:

Single Page Only crawls just the URL you specify—perfect when you only need content from one specific page.

Entire Website follows links from your starting URL to discover and crawl multiple pages across the site. This is ideal for indexing documentation sites, blogs, or any multi-page resource.

Scan Frequency

Determine how often the crawler should revisit your sources to capture updated content:

Frequency	Best For
Manual	Static content you control
Hourly	Rapidly changing content
Daily	News sites, active blogs
Weekly	Documentation, company sites
Monthly	Reference material, archives
Custom	Set your own interval (days, hours, minutes)

You can also set a custom frequency specifying exact days, hours, and minutes between crawls.

tip

Start with less frequent scans and adjust based on how often your source content actually changes. This helps manage your crawl quota efficiently.

Managing URL Discovery

When crawling entire websites, you can control which pages get included through URL mapping and exclusions.

Mapping URLs

Before a full website crawl begins, you can preview which pages will be discovered:

Click Map URLs to let the system discover all accessible pages
Review the list of discovered URLs, organized by path segments
Select any URLs or path groups you want to exclude
Save your exclusions

This preview helps you avoid crawling irrelevant sections like login pages, admin areas, or duplicate content.

Excluding URLs

After mapping, exclude pages by:

Selecting individual URLs to skip
Excluding entire path groups (e.g., exclude everything under /archive/)
Using the select/deselect all options for quick bulk changes

Excluded URLs won't be crawled or indexed, keeping your knowledge base focused on relevant content.

Monitoring Your Crawls

Each website source displays its current status and history.

Status Indicators

Status	Meaning
Success	Crawl completed without errors
Pending	Crawl is queued and waiting to start
In Progress	Currently crawling pages
Failed	Error occurred during crawl

When a crawl fails, you'll see error details explaining what went wrong—often related to site accessibility, rate limiting, or authentication requirements.

Crawl Information

For each website source you can see:

The URL being crawled
When it was last updated
When the next scheduled crawl will run (if automatic scheduling is enabled)
The number of pages successfully indexed

Manual Refresh

Need updated content before the next scheduled crawl? Click Start Crawl to trigger an immediate refresh. This is useful when you know the source content has changed and you want those changes reflected in your knowledge base right away.

Editing and Removing Sources

Changing Settings

To modify a website source:

Click on the website source to open its details
Update the URL, crawl mode, schedule, or exclusions
Changes save automatically

If you change the URL significantly, consider re-mapping URLs to update your exclusion list.

Deleting a Source

When you no longer need content from a website:

Open the website source details
Click Delete
Confirm the removal

This removes both the source configuration and all content that was imported from it.

Best Practices

Scope Your Crawls Thoughtfully

Start with the most relevant section of a site rather than crawling everything. A focused knowledge base with high-quality content performs better than one filled with tangentially related pages.

Use Exclusions Strategically

Exclude pages that don't add value:

Navigation-only pages and site maps
Login, signup, and account pages
Duplicate content under different paths
Auto-generated archives or tag listings

Match Frequency to Content Velocity

Consider how often your sources actually update:

API documentation might change weekly
A company blog might post daily
Reference archives rarely change at all

Setting appropriate frequencies keeps your knowledge base current without wasting crawl resources.

Test with Single Pages First

Before committing to a full site crawl, test with a single page to verify:

The content extracts cleanly
The information is useful for your agents
There are no access issues

Troubleshooting

Crawl Keeps Failing

Verify the URL is publicly accessible—try opening it in an incognito browser window. Some sites block automated access or require authentication. If the site uses JavaScript heavily to render content, it may not crawl well.

Missing Expected Content

Check your exclusion list to ensure you haven't accidentally excluded the pages you need. Also verify that the crawl mode is set to "Entire Website" if you expect multiple pages to be indexed.

Content Seems Outdated

Check when the last successful crawl occurred. If it was recent but content still seems old, the source website may be caching or the specific pages you're looking for might have been excluded. Trigger a manual refresh to pull the latest content.

Too Many Pages Being Crawled

Use URL mapping to review what's being discovered, then add exclusions for sections you don't need. This is common with sites that have extensive archives or auto-generated category pages.

Getting Started​

Configuration Options​

URL​

Crawl Mode​

Scan Frequency​

Managing URL Discovery​

Mapping URLs​

Excluding URLs​

Monitoring Your Crawls​

Status Indicators​

Crawl Information​

Manual Refresh​

Editing and Removing Sources​

Changing Settings​

Deleting a Source​

Best Practices​

Scope Your Crawls Thoughtfully​

Use Exclusions Strategically​

Match Frequency to Content Velocity​

Test with Single Pages First​

Troubleshooting​

Crawl Keeps Failing​

Missing Expected Content​

Content Seems Outdated​

Too Many Pages Being Crawled​