Options

Options for the findkit.toml configuration file.

Top-Level Options

`id: string`

The project identifier within the organization.

`name: string`

Human readable name of project. Displayed in the Hub.

`description: string`

Description of the project. Displayed in the Hub.

`ml_model: "openai"`

Machine learning model to use in text embedding vector generation for Semantic AI search. Required with the semantic search param.

ml_model = "openai"

Available only in some subscription plans. See pricing for details.

caution

This option must be set before the first crawl since it is used when initializing the index. If project is crawled before setting this option the index must be reset in the project settings to change the model.

`schedule_full_crawl: string`

Run the full crawl automatically with the given schedule.

Allowed values

weekly
daily
every-3-days
every-2-days

This is the same as running findkit crawl start. Read more from "Running Crawls".

`schedule_partial_crawl: string`

This is the same as running findkit crawl start --partial. Read more from "Running Crawls".

Allowed values

weekly
daily
every-3-days
every-2-days

`targets: object[]`

List "targets" aka domains to crawl content from. See [[targets]]

`[[targets]]`

Options for [[targets]] sections.

This is an array of tables. See the TOML docs on Arrays.

`host: string`

Target host to crawl. Just a plain domain name without the https:// prefix.

`use_sitemap: boolean`

Use sitemaps for the site crawling. Reads Sitemap: entries from /robots.txt and if no entries are found /sitemap.xml is used.

Follows sitemap index files and http redirects.

Defaults to true.

`walk_links: boolean`

Find site pages by walking the links.

Disabled by default but automatically enabled if no sitemaps are found. This behaviour can be explictly disabled by setting to false.

Setting this to true disables sitemap reading if not explicitly set to use_sitemap = true. Eg. if you want both sitemap and link walking you need to set use_sitemap = true and walk_links = true.

For full control of what links to walk it is possible to modify the links array in the html worker handler.

`walk_query_strings: boolean`

Walk links with query strings as unique pages when walk_links is enabled.

Defaults to true.

`start_paths: string[]`

List of pages where to start link walking when walk_links is enabled.

Defaults to /

`content_selector: string`

CSS selector used to select the text content for indexing.

Read more from the Indexing Content page.

`content_no_highlight_selector: string`

Get value for contentNoHighlight using a CSS selector.

`cleanup_selector: string`

Remove elements matching this selector from the DOM tree before extracting text.

Read more from the Indexing Content page.

`respect_robots_meta: boolean`

Respect robots meta tags like

<meta name="robots" content="noindex, nofollow" />

Pages get "Denied by robots meta" message when they are not indexed because of this tag.

Defaults to true.

`respect_robots_txt: boolean`

Respect /robots.txt rules.

Defaults to true.

`sitemaps: string[]`

Explicitly use sitemaps from these paths. When defined the sitemap entries in robots.txt are ignored. The paths may reference an actual sitemap or a sitemap index. Supports only paths.

If this option is not defined and robots.txt does not contain sitemap entries /sitemap.xml is read.

Example

sitemaps = ["/custom/sitemap.xml", "/another/sitemap.xml"]

caution

When possible it is recommended to use the Sitemap: entries in the robots.txt file instead of this option. It is the de facto standard for sitemap discovery and is supported by other crawlers, like Google, as well.

Note that the Sitemap: entries can be defined multiple times in the robots.txt file.

User-agent: *
Disallow: /api/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap2.xml

`deny_patterns: string[]`

Skip paths matching the given pattern. Matches against the url pathname.

Supports string prefixes and regexes. See Indexing Content for details.

`max_pages: number`

Max pages to crawl. This is a safety limit to make sure the crawler stops in the case where the site generates pages and links inifinitely.

`cache_bust: boolean`

Add random query string to the crawl http requests. This can cause a lot of load to the target webserver as the caches will very likely be bypassed but it can be used to ensure that the crawler always sees the latest version of the pages.

Defaults to false.

`crawl_pdfs: boolean`

Crawl PDF files too. See the PDF docs for details.

Defaults to false.

`tags: Array`

Array tagging matchers. Documented on the dedicated page.

`request_headers: Object`

Request headers to be sent with the http requests the crawler sends out. These can be for example used to authenticate the crawler with non public websites.

Example

[[targets]]
host = "intra.example.com"
# Send basic auth header
request_headers = { Authorization = "Basic ZmluZGtpdDpodW50ZXIyCg==" }

`concurrency: number`

How many concurrent requests to make on your sites. Defaults to 5 but if you encounter "429 Too Many Requests" or "503 Service Unavailable" errors you might want to lower this value.

`crawl_delay: number`

If lowering concurrency to 1 is not enough you can try add additional delay between the requests. Note that this is counted towards your subscription crawl time. The delay is set in milliseconds.

Example

[[targets]]
host = "example.com"
# Send only one request every 500ms
concurrency = 1
crawl_delay = 500

`request_timeout: number`

Set request timeout in milliseconds. Defaults to 10000 (10 seconds). Generally you should avoid setting this to a very high value as it can cause the crawler to burn through your crawl time if your website is very slow to respond.

Example

[[targets]]
host = "example.com"
request_timeout = 60000

`[search-endpoint]`

Search endpoint configuration. Search endpoint configurations changes might take up 10 minutes to propagate.

`origin_domains: string[]`

List of origin domains from which the search endpoint can be accessed eg. the domains where the Findkit UI library can installed. The domain is validated using the Origin header sent by the browsers.

This defaults to the first target host.

Example

[search-endpoint]
origin_domains = ["mysite.example"]

`private: boolean`

Make search endpoint private by requiring a JWT token. Must be combined with public_key.

Defaults to false.

`public_key: boolean`

When private is set to true this RS256 public key is used to validate the JWT tokens in the search requests.

See our WordPress plugin for full integration.

`allow_content: boolean`

Allow usage of the content query in the search params.

Defaults to false.

Example

[search-endpoint]
allow_content = true

Options

Top-Level Options​

id: string​

name: string​

description: string​

ml_model: "openai"​

schedule_full_crawl: string​

schedule_partial_crawl: string​

targets: object[]​

[[targets]]​

host: string​

use_sitemap: boolean​

walk_links: boolean​

walk_query_strings: boolean​

start_paths: string[]​

content_selector: string​

content_no_highlight_selector: string​

cleanup_selector: string​

respect_robots_meta: boolean​

respect_robots_txt: boolean​

sitemaps: string[]​

deny_patterns: string[]​

max_pages: number​

cache_bust: boolean​

crawl_pdfs: boolean​

tags: Array​

request_headers: Object​

concurrency: number​

crawl_delay: number​

request_timeout: number​

[search-endpoint]​

origin_domains: string[]​

private: boolean​

public_key: boolean​

allow_content: boolean​

Top-Level Options

`id: string`

`name: string`

`description: string`

`ml_model: "openai"`

`schedule_full_crawl: string`

`schedule_partial_crawl: string`

`targets: object[]`

`[[targets]]`

`host: string`

`use_sitemap: boolean`

`walk_links: boolean`

`walk_query_strings: boolean`

`start_paths: string[]`

`content_selector: string`

`content_no_highlight_selector: string`

`cleanup_selector: string`

`respect_robots_meta: boolean`

`respect_robots_txt: boolean`

`sitemaps: string[]`

`deny_patterns: string[]`

`max_pages: number`

`cache_bust: boolean`

`crawl_pdfs: boolean`

`tags: Array`

`request_headers: Object`

`concurrency: number`

`crawl_delay: number`

`request_timeout: number`

`[search-endpoint]`

`origin_domains: string[]`

`private: boolean`

`public_key: boolean`

`allow_content: boolean`