Options
Options for the findkit.toml
configuration file.
Top-Level Options
id: string
The project identifier within the organization.
name: string
Human readable name of project. Displayed in the Hub.
description: string
Description of the project. Displayed in the Hub.
ml_model: "openai"
Machine learning model to use in text embedding vector generation for Semantic AI search. Required with the semantic
search param.
ml_model = "openai"
Available only in some subscription plans. See pricing for details.
This option must be set before the first crawl since it is used when initializing the index. If project is crawled before setting this option the index must be reset in the project settings to change the model.
schedule_full_crawl: string
Run the full crawl automatically with the given schedule.
Allowed values
weekly
daily
every-3-days
every-2-days
This is the same as running findkit crawl start
. Read more from "Running
Crawls".
schedule_partial_crawl: string
This is the same as running findkit crawl start --partial
.
Read more from "Running Crawls".
Allowed values
weekly
daily
every-3-days
every-2-days
targets: object[]
List "targets" aka domains to crawl content from. See [[targets]]
[[targets]]
Options for [[targets]]
sections.
This is an array of tables. See the TOML docs on Arrays.
host: string
Target host to crawl. Just a plain domain name without the https://
prefix.
use_sitemap: boolean
Use sitemaps for the site crawling. Reads Sitemap:
entries from /robots.txt
and if no entries are found /sitemap.xml
is used.
Follows sitemap index files and http redirects.
Defaults to true
.
walk_links: boolean
Find site pages by walking the links.
Disabled by default but automatically enabled if no sitemaps are found.
This behaviour can be explictly disabled by setting to false
.
Setting this to true disables sitemap reading if not explicitly set to use_sitemap = true
.
Eg. if you want both sitemap and link walking you need to set use_sitemap = true
and walk_links = true
.
For full control of what links to walk it is possible to modify the links
array in the html
worker handler.
walk_query_strings: boolean
Walk links with query strings as unique pages when walk_links
is enabled.
Defaults to true
.
start_paths: string[]
List of pages where to start link walking when walk_links
is
enabled.
Defaults to /
content_selector: string
CSS selector used to select the text content for indexing.
Read more from the Indexing Content page.
content_no_highlight_selector: string
Get value for contentNoHighlight
using a CSS selector.
cleanup_selector: string
Remove elements matching this selector from the DOM tree before extracting text.
Read more from the Indexing Content page.
respect_robots_meta: boolean
Respect robots meta tags like
<meta name="robots" content="noindex, nofollow" />
Pages get "Denied by robots meta" message when they are not indexed because of this tag.
Defaults to true
.
respect_robots_txt: boolean
Respect /robots.txt
rules.
Defaults to true
.
sitemaps: string[]
Explicitly use sitemaps from these paths. When defined the sitemap entries in robots.txt are ignored. The paths may reference an actual sitemap or a sitemap index. Supports only paths.
If this option is not defined and robots.txt does not contain sitemap entries /sitemap.xml
is read.
Example
sitemaps = ["/custom/sitemap.xml", "/another/sitemap.xml"]
When possible it is recommended to use the Sitemap:
entries in the robots.txt
file instead of this option.
It is the de facto standard for sitemap discovery and is
supported by other crawlers, like Google, as well.
Note that the Sitemap:
entries can be defined multiple times in the robots.txt
file.
User-agent: *
Disallow: /api/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap2.xml
deny_patterns: string[]
Skip paths matching the given pattern. Matches against the url pathname.
Supports string prefixes and regexes. See Indexing Content for details.
max_pages: number
Max pages to crawl. This is a safety limit to make sure the crawler stops in the case where the site generates pages and links inifinitely.
cache_bust: boolean
Add random query string to the crawl http requests. This can cause a lot of load to the target webserver as the caches will very likely be bypassed but it can be used to ensure that the crawler always sees the latest version of the pages.
Defaults to false
.
crawl_pdfs: boolean
Crawl PDF files too. See the PDF docs for details.
Defaults to false
.
tags: Array
Array tagging matchers. Documented on the dedicated page.
request_headers: Object
Request headers to be sent with the http requests the crawler sends out. These can be for example used to authenticate the crawler with non public websites.
Example
[[targets]]
host = "intra.example.com"
# Send basic auth header
request_headers = { Authorization = "Basic ZmluZGtpdDpodW50ZXIyCg==" }
concurrency: number
How many concurrent requests to make on your sites. Defaults to 5
but if you
encounter "429 Too Many Requests" or "503 Service Unavailable" errors you might
want to lower this value.
crawl_delay: number
If lowering concurrency
to 1
is not enough you can try add additional delay
between the requests. Note that this is counted towards your subscription crawl
time. The delay is set in milliseconds.
Example
[[targets]]
host = "example.com"
# Send only one request every 500ms
concurrency = 1
crawl_delay = 500
request_timeout: number
Set request timeout in milliseconds. Defaults to 10000
(10 seconds).
Generally you should avoid setting this to a very high value as it can cause
the crawler to burn through your crawl time if your website is very slow
to respond.
Example
[[targets]]
host = "example.com"
request_timeout = 60000
[search-endpoint]
Search endpoint configuration. Search endpoint configurations changes might take up 10 minutes to propagate.
origin_domains: string[]
List of origin domains from which the search endpoint can be accessed eg. the domains where the Findkit UI library can installed. The domain is validated using the Origin header sent by the browsers.
This defaults to the first target host.
Example
[search-endpoint]
origin_domains = ["mysite.example"]
private: boolean
Make search endpoint private by requiring a JWT token. Must be combined with
public_key
.
Defaults to false
.
public_key: boolean
When private
is set to true
this RS256 public key is used to validate the
JWT tokens in the search requests.
See our WordPress plugin for full integration.
allow_content: boolean
Allow usage of the content
query in the search params.
Defaults to false
.
Example
[search-endpoint]
allow_content = true