How to Reduce Training Characters When Scanning a Website in ChatLab
- How to Reduce Training Characters When Scanning a Website in ChatLab
- Why Less Training Can Be Better: RAG and the Problem of Noise
- 🚨 Why Too Much Data Hurts Performance
- ✅ High Signal, Low Noise = Better Answers
- 🔍 TL;DR
- ✅ Use Sitemap Instead of Full Website Scan
- ⚙️ Use Advanced Filters to Fine-Tune Scanning
- 🧩 Include Only Specific URLs
- 🚫 Exclude Irrelevant URLs
- 🧼 Exclude Useless Page Elements
- 🌍 Simulate Visit From Specific Country (Optional)
- 🕐 Add Delay Between Requests (Optional)
- ✅ Summary: Best Practices to Reduce Training Characters
Why Less Training Can Be Better: RAG and the Problem of Noise
ChatLab uses a RAG (Retrieval-Augmented Generation) architecture. This means that instead of memorizing all website content, the AI searches the most relevant pieces of information at the time of the user's question — and only then uses them to generate an answer.
🚨 Why Too Much Data Hurts Performance
More data doesn't always mean better answers. In fact, adding irrelevant or redundant content can confuse the retrieval process. Here's why:
- RAG works by scoring chunks of content for relevance to the user's question. If your site contains many pages or elements with repetitive, vague, or generic text (like menus, footers, press releases, or blog tag pages), they dilute the quality of search results.
- When the chatbot retrieves multiple low-value chunks that share keywords (but lack helpful context), the model might generate inaccurate, overly general, or off-topic answers.
- For example: If your footer contains links like "Home", "Contact", "Privacy", "FAQ" on every page — and these are scanned — they will flood the knowledge base with noise. When a user asks “How can I contact you?”, the bot might respond with a generic paragraph from the footer instead of a helpful phone number or email.
✅ High Signal, Low Noise = Better Answers
By scanning only useful content — like:
- Product descriptions
- Real FAQ pages
- Support articles
- Policy documents
- Key landing pages
...you help the AI retrieve clear, specific, high-quality context for each user question. This leads to more accurate, confident, and helpful responses.
🔍 TL;DR
More content = more noiseMore noise = worse search and worse answers
Focused, clean, relevant content = best chatbot results
Would you like a visual diagram or graphic to go with this section? It could help illustrate the “signal vs noise” issue in a simple way.
When training your chatbot on a website, it's important to limit the amount of scanned data to what's most useful. This helps reduce token usage, speeds up training, and improves answer quality by focusing only on relevant content.
Here’s how you can optimize your scan using the Advanced settings in the "Train on Website" screen:
✅ Use Sitemap Instead of Full Website Scan
Recommended: Use the sitemap option whenever possible.
📌 Why? Full website scan follows all visible links on the page, including links from footers, menus, and sidebars — which often leads to scanning repetitive or non-valuable content. These sections are typically present on every page and rarely include unique information useful for chat answers.
By using a sitemap (usually located at https://yourdomain.com/sitemap.xml
), you:
- Avoid crawling unnecessary pages (e.g. legal, social links, repeated blog tags)
- Control exactly which URLs are scanned
- Prevent accidental overuse of characters from menus, footers, and headers
⚙️ Use Advanced Filters to Fine-Tune Scanning
🧩 Include Only Specific URLs
Field: Advanced Setttings→Include only URLs that contain
Usage: Add semicolon-separated keywords to include only specific URLs in the scan.
Example:
To scan only English help pages:
bash
CopyEdit
/en;help
This will only scan pages containing /en
or help
in the URL — saving tokens by skipping irrelevant sections.
🚫 Exclude Irrelevant URLs
Field: Advanced Setttings→Exclude URLs that contain
Usage: Add semicolon-separated keywords to skip sections like blogs, images, or news.
Example:
bash
CopyEdit
/news;/pictures;/press
This avoids scanning content-heavy but chatbot-irrelevant sections like news or image galleries.
🧼 Exclude Useless Page Elements
Field: Advanced Setttings→Exclude element IDs
Usage: Prevent scanning of headers, footers, and other layout elements. This also stops links from those areas from being followed in the full website scan.
Examples:
less
CopyEdit
header;footer;#menu;.sidebar
This will remove HTML blocks that are repeated on every page and generally contain no useful content for your chatbot.
⚠️ Important: If you use full website scan and exclude elements like header, it may prevent the scanner from discovering deeper links (menus are often inside headers). In this case, using a sitemap is the safest and most efficient method.
🌍 Simulate Visit From Specific Country (Optional)
If your site shows different content depending on visitor location, you can simulate scanning from a specific country using its 2-letter country code (e.g. US
, PL
, DE
). You can find this option in advanced settings for Website training.
🕐 Add Delay Between Requests (Optional)
Use the Delay field to prevent overloading your server. For example, add 5 seconds between each scanned page. You can find this option in advanced settings for Website training.
✅ Summary: Best Practices to Reduce Training Characters
- 🗂️ Use sitemap instead of full scan for precise control.
- ➕ Use
Include URLs
to target only valuable content. - ➖ Use
Exclude URLs
to skip irrelevant or heavy pages. - 🚫 Use
Exclude element IDs
to remove repeated page parts like headers and footers. - 🧠 Focus the scan on useful content for your bot (e.g. FAQs, product pages, support).