Skip to main content

Website Crawling

Website crawling lets you automatically import web content into your knowledge base. Once configured, the crawler fetches pages from your specified URL and keeps the content synchronized through scheduled updates.

Getting Started

To add a website to your knowledge base:

  1. Open your knowledge base
  2. Navigate to the Website Sources section
  3. Click Add Website Source
  4. Enter the URL you want to crawl and configure your settings

The system uses Firecrawl to handle the crawling process, ensuring clean extraction of page content suitable for AI processing.

Configuration Options

URL

Enter the starting URL for your crawl. This should be a valid HTTP or HTTPS address. For single pages, simply enter that page's URL. For crawling an entire site, start with the homepage or the section you want to index.

Crawl Mode

Choose how extensively the crawler should explore:

Single Page Only crawls just the URL you specify—perfect when you only need content from one specific page.

Entire Website follows links from your starting URL to discover and crawl multiple pages across the site. This is ideal for indexing documentation sites, blogs, or any multi-page resource.

Scan Frequency

Determine how often the crawler should revisit your sources to capture updated content:

FrequencyBest For
ManualStatic content you control
HourlyRapidly changing content
DailyNews sites, active blogs
WeeklyDocumentation, company sites
MonthlyReference material, archives
CustomSet your own interval (days, hours, minutes)

You can also set a custom frequency specifying exact days, hours, and minutes between crawls.

tip

Start with less frequent scans and adjust based on how often your source content actually changes. This helps manage your crawl quota efficiently.

Managing URL Discovery

When crawling entire websites, you can control which pages get included through URL mapping and exclusions.

Mapping URLs

Before a full website crawl begins, you can preview which pages will be discovered:

  1. Click Map URLs to let the system discover all accessible pages
  2. Review the list of discovered URLs, organized by path segments
  3. Select any URLs or path groups you want to exclude
  4. Save your exclusions

This preview helps you avoid crawling irrelevant sections like login pages, admin areas, or duplicate content.

Excluding URLs

After mapping, exclude pages by:

  • Selecting individual URLs to skip
  • Excluding entire path groups (e.g., exclude everything under /archive/)
  • Using the select/deselect all options for quick bulk changes

Excluded URLs won't be crawled or indexed, keeping your knowledge base focused on relevant content.

Monitoring Your Crawls

Each website source displays its current status and history.

Status Indicators

StatusMeaning
SuccessCrawl completed without errors
PendingCrawl is queued and waiting to start
In ProgressCurrently crawling pages
FailedError occurred during crawl

When a crawl fails, you'll see error details explaining what went wrong—often related to site accessibility, rate limiting, or authentication requirements.

Crawl Information

For each website source you can see:

  • The URL being crawled
  • When it was last updated
  • When the next scheduled crawl will run (if automatic scheduling is enabled)
  • The number of pages successfully indexed

Manual Refresh

Need updated content before the next scheduled crawl? Click Start Crawl to trigger an immediate refresh. This is useful when you know the source content has changed and you want those changes reflected in your knowledge base right away.

Editing and Removing Sources

Changing Settings

To modify a website source:

  1. Click on the website source to open its details
  2. Update the URL, crawl mode, schedule, or exclusions
  3. Changes save automatically

If you change the URL significantly, consider re-mapping URLs to update your exclusion list.

Deleting a Source

When you no longer need content from a website:

  1. Open the website source details
  2. Click Delete
  3. Confirm the removal

This removes both the source configuration and all content that was imported from it.

Best Practices

Scope Your Crawls Thoughtfully

Start with the most relevant section of a site rather than crawling everything. A focused knowledge base with high-quality content performs better than one filled with tangentially related pages.

Use Exclusions Strategically

Exclude pages that don't add value:

  • Navigation-only pages and site maps
  • Login, signup, and account pages
  • Duplicate content under different paths
  • Auto-generated archives or tag listings

Match Frequency to Content Velocity

Consider how often your sources actually update:

  • API documentation might change weekly
  • A company blog might post daily
  • Reference archives rarely change at all

Setting appropriate frequencies keeps your knowledge base current without wasting crawl resources.

Test with Single Pages First

Before committing to a full site crawl, test with a single page to verify:

  • The content extracts cleanly
  • The information is useful for your agents
  • There are no access issues

Troubleshooting

Crawl Keeps Failing

Verify the URL is publicly accessible—try opening it in an incognito browser window. Some sites block automated access or require authentication. If the site uses JavaScript heavily to render content, it may not crawl well.

Missing Expected Content

Check your exclusion list to ensure you haven't accidentally excluded the pages you need. Also verify that the crawl mode is set to "Entire Website" if you expect multiple pages to be indexed.

Content Seems Outdated

Check when the last successful crawl occurred. If it was recent but content still seems old, the source website may be caching or the specific pages you're looking for might have been excluded. Trigger a manual refresh to pull the latest content.

Too Many Pages Being Crawled

Use URL mapping to review what's being discovered, then add exclusions for sections you don't need. This is common with sites that have extensive archives or auto-generated category pages.