How to Reduce Training Characters When Scanning a Website in ChatLab

How to Reduce Training Characters When Scanning a Website in ChatLab
Why Less Training Can Be Better: RAG and the Problem of Noise
🚨 Why Too Much Data Hurts Performance
✅ High Signal, Low Noise = Better Answers
🔍 TL;DR
✅ Use Sitemap Instead of Full Website Scan
⚙️ Use Advanced Filters to Fine-Tune Scanning
🧩 Include Only Specific URLs
🚫 Exclude image links
🚫 Exclude Irrelevant URLs
🧼 Exclude Useless Page Elements
📄 Scrape documents
🌍 Simulate Visit From Specific Country (Optional)
🕐 Add Delay Between Requests (Optional)
❌ Example of using exclude options
✅ Summary: Best Practices to Reduce Training Characters

Why Less Training Can Be Better: RAG and the Problem of Noise

ChatLab uses a RAG (Retrieval-Augmented Generation) architecture. This means that instead of memorizing all website content, the AI searches the most relevant pieces of information at the time of the user's question — and only then uses them to generate an answer.

🚨 Why Too Much Data Hurts Performance

More data doesn't always mean better answers. In fact, adding irrelevant or redundant content can confuse the retrieval process. Here's why:

RAG works by scoring chunks of content for relevance to the user's question. If your site contains many pages or elements with repetitive, vague, or generic text (like menus, footers, press releases, or blog tag pages), they dilute the quality of search results.
When the chatbot retrieves multiple low-value chunks that share keywords (but lack helpful context), the model might generate inaccurate, overly general, or off-topic answers.
For example: If your footer contains links like "Home", "Contact", "Privacy", "FAQ" on every page — and these are scanned — they will flood the knowledge base with noise. When a user asks “How can I contact you?”, the bot might respond with a generic paragraph from the footer instead of a helpful phone number or email.

✅ High Signal, Low Noise = Better Answers

By scanning only useful content — like:

Product descriptions
Real FAQ pages
Support articles
Policy documents
Key landing pages

...you help the AI retrieve clear, specific, high-quality context for each user question. This leads to more accurate, confident, and helpful responses.

🔍 TL;DR

More content = more noise
More noise = worse search and worse answers
Focused, clean, relevant content = best chatbot results

Would you like a visual diagram or graphic to go with this section? It could help illustrate the “signal vs noise” issue in a simple way.

When training your chatbot on a website, it's important to limit the amount of scanned data to what's most useful. This helps reduce token usage, speeds up training, and improves answer quality by focusing only on relevant content.

Here’s how you can optimize your scan using the Advanced settings in the "Train on Website" screen:

✅ Use Sitemap Instead of Full Website Scan

Recommended: Use the sitemap option whenever possible.

📌 Why? Full website scan follows all visible links on the page, including links from footers, menus, and sidebars — which often leads to scanning repetitive or non-valuable content. These sections are typically present on every page and rarely include unique information useful for chat answers.

By using a sitemap (usually located at https://yourdomain.com/sitemap.xml), you:

Avoid crawling unnecessary pages (e.g. legal, social links, repeated blog tags)
Control exactly which URLs are scanned
Prevent accidental overuse of characters from menus, footers, and headers

⚙️ Use Advanced Filters to Fine-Tune Scanning

🧩 Include Only Specific URLs

Field: Advanced Setttings→Include only URLs that contain

Usage: Add semicolon-separated keywords to include only specific URLs in the scan.

Example:

To scan only English help pages:

bash
CopyEdit
/en;help

This will only scan pages containing /en or help in the URL — saving tokens by skipping irrelevant sections.

🚫 Exclude image links

Field: Advanced Setttings→Exclude image links

Usage: Turn on “Exclude image links” if you want to exclude all the image links for the training dataset.

Example:

Thanks to this tool your amount of characters will drop significantly.

By default we will extract all the image links for the training dataset. If you want to exclude them, please enable this option.

🚫 Exclude Irrelevant URLs

Field: Advanced Setttings→Exclude URLs that contain

Usage: Add semicolon-separated keywords to skip sections like blogs, images, or news.

Example:

bash
CopyEdit
/news;/pictures;/press

This avoids scanning content-heavy but chatbot-irrelevant sections like news or image galleries.

🧼 Exclude Useless Page Elements

Field: Advanced Setttings→Exclude element IDs

Usage: Prevent scanning of headers, footers, and other layout elements. This also stops links from those areas from being followed in the full website scan.

Examples:

less
CopyEdit
header;footer;#menu;.sidebar

This will remove HTML blocks that are repeated on every page and generally contain no useful content for your chatbot.

⚠️ Important: If you use full website scan and exclude elements like header, it may prevent the scanner from discovering deeper links (menus are often inside headers). In this case, using a sitemap is the safest and most efficient method.

📄 Scrape documents

Field: Advanced Setttings→Scrape documents

Usage: When enabled the system will also extract and process document content. Currently, this option only supports PDF files found on the webiste.

Examples:

🌍 Simulate Visit From Specific Country (Optional)

If your site shows different content depending on visitor location, you can simulate scanning from a specific country using its 2-letter country code (e.g. US, PL, DE). You can find this option in advanced settings for Website training.

🕐 Add Delay Between Requests (Optional)

Use the Delay field to prevent overloading your server. For example, add 5 seconds between each scanned page. You can find this option in advanced settings for Website training.

❌ Example of using exclude options

Let's do a small experiment using "Exclude Image links" and "Exclude element IDs" on the page https://help.chatlab.com/. We will check the number of characters before applying the excludes and verify by what percentage the character count decreases after their implementation.

First, we will check the character count without applying any excludes.

As we can see, the character count is 300,042. The Help Center page is not among the largest, but it can certainly be reduced. Let's move on to the next step.

Application of selected options.

We've chosen very basic options. We want to get rid of image links, as well as the header and footer. The "Exclude element IDs" functionality can be much more extensively developed to include classes and IDs. We, however, opted for just two classes.

Effects.

We observed a nearly 25% decrease in the character count! A very simple option, whose implementation took no more than 10 seconds, is capable of reducing the character count by approximately 1/4 for a page that has 300,000 characters.

Let's remember that the reduction primarily depends on the type of page we are dealing with, what content it contains, and if there are many images – lots of variables influence this. It's impossible to consistently reduce the character count by 25% on every page using such simple options. On one page, it might be 10%, while on another, it could be 60%. It all depends on the page we are scanning.

✅ Summary: Best Practices to Reduce Training Characters

🗂️ Use sitemap instead of full scan for precise control.
➕ Use Include URLs to target only valuable content.
➖ Use Exclude URLs to skip irrelevant or heavy pages.
🚫 Use Exclude element IDs to remove repeated page parts like headers and footers.
🧠 Focus the scan on useful content for your bot (e.g. FAQs, product pages, support).