How to Reduce Training Characters When Scanning a Website in ChatLab
- How to Reduce Training Characters When Scanning a Website in ChatLab
- Why Less Training Can Be Better: RAG and the Problem of Noise
- 🚨 Why Too Much Data Hurts Performance
- ✅ High Signal, Low Noise = Better Answers
- 🔍 TL;DR
- ✅ Use Sitemap Instead of Full Website Scan
- ⚙️ Use Advanced Filters to Fine-Tune Scanning
- 🧩 Include Only Specific URLs
- 🚫 Exclude image links
- 🚫 Exclude Irrelevant URLs
- 🧼 Exclude Useless Page Elements
- 📄 Scrape documents
- 🌍 Simulate Visit From Specific Country (Optional)
- 🕐 Add Delay Between Requests (Optional)
- ❌ Example of using exclude options
- ✅ Summary: Best Practices to Reduce Training Characters
Why Less Training Can Be Better: RAG and the Problem of Noise
ChatLab uses a RAG (Retrieval-Augmented Generation) architecture. This means that instead of memorizing all website content, the AI searches the most relevant pieces of information at the time of the user's question — and only then uses them to generate an answer.
🚨 Why Too Much Data Hurts Performance
More data doesn't always mean better answers. In fact, adding irrelevant or redundant content can confuse the retrieval process. Here's why:
- RAG works by scoring chunks of content for relevance to the user's question. If your site contains many pages or elements with repetitive, vague, or generic text (like menus, footers, press releases, or blog tag pages), they dilute the quality of search results.
- When the chatbot retrieves multiple low-value chunks that share keywords (but lack helpful context), the model might generate inaccurate, overly general, or off-topic answers.
- For example: If your footer contains links like "Home", "Contact", "Privacy", "FAQ" on every page — and these are scanned — they will flood the knowledge base with noise. When a user asks “How can I contact you?”, the bot might respond with a generic paragraph from the footer instead of a helpful phone number or email.
✅ High Signal, Low Noise = Better Answers
By scanning only useful content — like:
- Product descriptions
- Real FAQ pages
- Support articles
- Policy documents
- Key landing pages
...you help the AI retrieve clear, specific, high-quality context for each user question. This leads to more accurate, confident, and helpful responses.
🔍 TL;DR
More content = more noiseMore noise = worse search and worse answers
Focused, clean, relevant content = best chatbot results
Would you like a visual diagram or graphic to go with this section? It could help illustrate the “signal vs noise” issue in a simple way.
When training your chatbot on a website, it's important to limit the amount of scanned data to what's most useful. This helps reduce token usage, speeds up training, and improves answer quality by focusing only on relevant content.
Here’s how you can optimize your scan using the Advanced settings in the "Train on Website" screen:
✅ Use Sitemap Instead of Full Website Scan
Recommended: Use the sitemap option whenever possible.
📌 Why? Full website scan follows all visible links on the page, including links from footers, menus, and sidebars — which often leads to scanning repetitive or non-valuable content. These sections are typically present on every page and rarely include unique information useful for chat answers.
By using a sitemap (usually located at https://yourdomain.com/sitemap.xml
), you:
- Avoid crawling unnecessary pages (e.g. legal, social links, repeated blog tags)
- Control exactly which URLs are scanned
- Prevent accidental overuse of characters from menus, footers, and headers
⚙️ Use Advanced Filters to Fine-Tune Scanning
🧩 Include Only Specific URLs
Field: Advanced Setttings→Include only URLs that contain
Usage: Add semicolon-separated keywords to include only specific URLs in the scan.
Example:
To scan only English help pages:
bash
CopyEdit
/en;help
This will only scan pages containing /en
or help
in the URL — saving tokens by skipping irrelevant sections.
🚫 Exclude image links
Field: Advanced Setttings→Exclude image links
Usage: Turn on “Exclude image links” if you want to exclude all the image links for the training dataset.
Example:
Thanks to this tool your amount of characters will drop significantly.
By default we will extract all the image links for the training dataset. If you want to exclude them, please enable this option.
🚫 Exclude Irrelevant URLs
Field: Advanced Setttings→Exclude URLs that contain
Usage: Add semicolon-separated keywords to skip sections like blogs, images, or news.
Example:
bash
CopyEdit
/news;/pictures;/press
This avoids scanning content-heavy but chatbot-irrelevant sections like news or image galleries.
🧼 Exclude Useless Page Elements
Field: Advanced Setttings→Exclude element IDs
Usage: Prevent scanning of headers, footers, and other layout elements. This also stops links from those areas from being followed in the full website scan.
Examples:
less
CopyEdit
header;footer;#menu;.sidebar
This will remove HTML blocks that are repeated on every page and generally contain no useful content for your chatbot.
⚠️ Important: If you use full website scan and exclude elements like header, it may prevent the scanner from discovering deeper links (menus are often inside headers). In this case, using a sitemap is the safest and most efficient method.
📄 Scrape documents
Field: Advanced Setttings→Scrape documents
Usage: When enabled the system will also extract and process document content. Currently, this option only supports PDF files found on the webiste.
Examples:
🌍 Simulate Visit From Specific Country (Optional)
If your site shows different content depending on visitor location, you can simulate scanning from a specific country using its 2-letter country code (e.g. US
, PL
, DE
). You can find this option in advanced settings for Website training.
🕐 Add Delay Between Requests (Optional)
Use the Delay field to prevent overloading your server. For example, add 5 seconds between each scanned page. You can find this option in advanced settings for Website training.
❌ Example of using exclude options
Let's do a small experiment using "Exclude Image links" and "Exclude element IDs" on the page https://help.chatlab.com/. We will check the number of characters before applying the excludes and verify by what percentage the character count decreases after their implementation.
- First, we will check the character count without applying any excludes.
As we can see, the character count is 300,042. The Help Center page is not among the largest, but it can certainly be reduced. Let's move on to the next step.
- Application of selected options.
We've chosen very basic options. We want to get rid of image links, as well as the header and footer. The "Exclude element IDs" functionality can be much more extensively developed to include classes and IDs. We, however, opted for just two classes.
- Effects.
We observed a nearly 25% decrease in the character count! A very simple option, whose implementation took no more than 10 seconds, is capable of reducing the character count by approximately 1/4 for a page that has 300,000 characters.
Let's remember that the reduction primarily depends on the type of page we are dealing with, what content it contains, and if there are many images – lots of variables influence this. It's impossible to consistently reduce the character count by 25% on every page using such simple options. On one page, it might be 10%, while on another, it could be 60%. It all depends on the page we are scanning.
✅ Summary: Best Practices to Reduce Training Characters
- 🗂️ Use sitemap instead of full scan for precise control.
- ➕ Use
Include URLs
to target only valuable content. - ➖ Use
Exclude URLs
to skip irrelevant or heavy pages. - 🚫 Use
Exclude element IDs
to remove repeated page parts like headers and footers. - 🧠 Focus the scan on useful content for your bot (e.g. FAQs, product pages, support).