Skip to main content

Options

Top-Level Options

id: string

The project identifier within the organization.

name: string

Human readable name of project. Displayed in the Hub.

description: string

Description of the project. Displayed in the Hub.

schedule_full_crawl: string

Run the full crawl automatically with the given schedule.

Allowed values

  • weekly
  • daily
  • every-3-days
  • every-2-days

This is the same as running findkit crawl start. Read more from "Running Crawls".

schedule_partial_crawl: string

This is the same as running findkit crawl start --partial. Read more from "Running Crawls".

Allowed values

  • weekly
  • daily
  • every-3-days
  • every-2-days

targets: object[]

List "targets" aka domains to crawl content from. See [[targets]]

[[targets]]

Options for [[targets]] sections.

This is an array of tables. See the TOML docs on Arrays.

host: string

Target host to crawl. Just a plain domain name without the https:// prefix.

use_sitemap: boolean

Read the site sitemap.

Defaults to true.

Find site pages by walking the links.

Disabled by default but automatically enabled if no sitemaps are found. Fallback behaviour can be disabled by setting to false.

start_paths: string[]

List of pages where to start link walking when walk_links is enabled.

Defaults to /

content_selector: string

CSS selector used to select the text content for indexing.

Read more from the Indexing Content page.

content_no_highlight_selector: string

Get value for contentNoHighlight using a CSS selector.

cleanup_selector: string

CSS selector used to skip elements from indexing.

Read more from the Indexing Content page.

respect_robots_meta: boolean

Respect robots meta tags like

<meta name="robots" content="noindex, nofollow" />

Defaults to true.

respect_robots_txt: boolean

Respect /robots.txt rules.

Defaults to true.

sitemaps: string[]

Explicitly use sitemaps from these paths. When defined the sitemap entries in robots.txt are ignored. The paths may reference an actual sitemap or a sitemap index. Supports only paths.

If this option is not defined and robots.txt does not contain sitemap entries /sitemap.xml is read.

Example

sitemaps = ["/custom/sitemap.xml", "/another/sitemap.xml"]

deny_patterns: string[]

Skip paths matching the given pattern. Matches against the url pathname.

Supports string prefixes and regexes. See Indexing Content for details.

max_pages: number

Max pages to crawl. This is a safety limit to make sure the crawler stops in the case where the site generates pages and links inifinitely.

cache_bust: boolean

Add random query string to the crawl http requests. This can cause a lot of load to the target webserver as the caches will very likely be bypassed but it can be used to ensure that the crawler always sees the latest version of the pages.

Defaults to false.

crawl_pdfs: boolean

Crawl PDF files too. See the PDF docs for details.

Defaults to false.

tags: Array

Array tagging matchers. Documented on the dedicated page.

request_headers: Object

Request headers to be sent with the http requests the crawler sends out. These can be for example used to authenticate the crawler with non public websites.

Example

[[targets]]
host = "intra.example.com"
# Send basic auth header
request_headers = { Authorization = "Basic ZmluZGtpdDpodW50ZXIyCg==" }

concurrency: number

How many concurrent requests to make on your sites. Defaults to 5 but if you encounter "429 Too Many Requests" or "503 Service Unavailable" errors you might want to lower this value.

crawl_delay: number

If lowering concurrency to 1 is not enough you can try add additional delay between the requests. Note that this is counted towards your subscription crawl time. The delay is set in milliseconds.

Example

[[targets]]
host = "example.com"
# Send only one request every 500ms
concurrency = 1
crawl_delay = 500

request_timeout: number

Set request timeout in milliseconds. Defaults to 10000 (10 seconds). Generally you should avoid setting this to a very high value as it can cause the crawler to burn through your crawl time if your website is very slow to respond.

Example

[[targets]]
host = "example.com"
request_timeout = 60000

[search-endpoint]

Search endpoint configuration.

origin_domains: string[]

List of origin domains from which the search endpoint can be accessed eg. the domains where the Findkit UI library can installed. The domain is validated using the Origin header sent by the browsers.

This defaults to the first target host.

Example

[search-endpoint]
origin_domains = ["mysite.example"]

private: boolean

Make search endpoint private by requiring a JWT token. Must be combined with public_key.

Defaults to false.

public_key: boolean

When private is set to true this RS256 public key is used to validate the JWT tokens in the search requests.

See our WordPress plugin for full integration.