Indexing Content
A very important part of making a good search experience is to select only the relevant content for indexing. Eg. avoid adding text which is repeated on multiple pages like menus, footers etc.
Findkit crawler automatically extracts the relevant page content on best effort basis but for production use cases it is recommended to explicitly select the content for indexing with the following methods.
Data Attribute
If you are just building the site you should just mark the content right in the
html to be indexed using the data-fdk-content
data attribute.
Example
<div class="content" data-fdk-content>The content text...</div>
For full list of data-attributes see data-attributes.
You can use the attribute multiple times on a single page. The crawler will
select the text content from all elements with it and combine them to the page
index. You can use the findkit inspect <url>
CLI command to view currently indexed content
and findkit crawl test <url>
to see what content would be selected to the index.
Custom CSS Selector
If you cannot add the data attributes you can use the content_selector
field
in the findkit.toml
to select content using CSS selectors.
Example
[[targets]]
host = "example.com"
content_selector = ".content, .post"
You can define multiple selectors by separating them with commas. If multiple elements are matched the content is indexed from all of them.
Cleaning Content
Sometimes the selected content can contain some inner elements that you don't
want to index. These elements can be removed from indexed content with the skip data attribute
data-fdk-skip
or the clean up selector.
<div class="content" data-fdk-content>
The content text...
<div class="author-bio" data-fdk-skip>
Author bio that is also on the dedicated author page.
</div>
</div>
or in the findkit.toml
[[targets]]
host = "example.com"
content_selector = ".content, .post"
cleanup_selector = ".author-bio"
Title Selector
Findkit tries to determine page title automatically using the built-in default selectors, but its behaviour can be customized via options listed below.
Default title selectors in the order of importance:
- findkit data attribute
data-fdk-title
<meta og:title></meta>
element<title></title>
element
Meta Tag
Title can be defined directly in page meta
Meta tag CSS Selector
Title CSS-selector can be passed in page meta
Data Attribute
Title can be defined with default data attributes.
<h1 data-fdk-title>Custom Title</h1>
CSS Selector in findkit.toml
You can pass CSS selector in findkit.toml. First elements content matching the selector is chosen as title.
Example
[[targets]]
host = "example.com"
title_selector = "h1"
Modifying selected title
After choosing the title, the title can be modified using one of the options below.
Meta tag titleSelectorRegex
Chosen title can be modified with regex defined in page meta
Config title_selector_regex
Chosen title can be modified with regex defined in findkit.toml
Example
<h1>Title - Unwanted</h1>
[[targets]]
host = "example.com"
title_selector = "h1"
title_selector_regex = "([^ ]+)"
Skipping Pages
Your website might include pages that you do not want to be included in the search index. For this purpose you can use one of the options listed below
Robots.txt
Place /robots.txt
to website root.
For more info refer to documentation.
You can instruct Findkit Crawler to not to respect /robots.txt
rules with
respect_robots_txt configuration option.
Meta Robots Tag
By placing robots meta tag on an individual page you can instruct crawlers, findkit included, to not index the page or not to follow links on the page.
For more information see Google's documentation.
You can instruct Findkit Crawler to not to respect robots meta rules respect_robots_meta.
Deny Patterns
You can define patterns for crawler to skip in findkit.toml
.
Patterns can be strings or regexes in string staring with reg:
. Regexes need to be written as strings without the encapsulating //
Strings are matched to start of any URL pathname. Regular Expression patterns are matched against the whole URL pathname.
You can define as many deny patterns as you need, but there are better options available for skipping individual urls.
Example deny pattern string:
[[targets]]
host = "example.com"
content_selector = ".content, .post"
deny_patterns = ["/secret/"]
Would cause crawler to skip:
example.com/secret/
example.com/secret/first_secret
But would not skip:example.com/secret
extraneous/
at the end of deny patternexample.com/some_folder/secret/
deny pattern is matched against the start of the URL pathname
Example deny pattern regular expression:
[[targets]]
host = "example.com"
content_selector = ".content, .post"
deny_patterns = ["reg:/secret"]
Would cause crawler to skip:
example.com/secret
example.com/secret/
example.com/secret/first_secret
example.com/some_folder/secret/
example.com/some_folder/secret/first_secret
example.com/secretariat/big_red