Options
Top-Level Options
id: string
The project identifier within the organization.
name: string
Human readable name of project. Displayed in the Hub.
description: string
Description of the project. Displayed in the Hub.
schedule_full_crawl: string
Run the full crawl automatically with the given schedule.
Allowed values
weekly
daily
every-3-days
every-2-days
This is the same as running findkit crawl start
. Read more from "Running
Crawls".
schedule_partial_crawl: string
This is the same as running findkit crawl start --partial
.
Read more from "Running Crawls".
Allowed values
weekly
daily
every-3-days
every-2-days
targets: object[]
List "targets" aka domains to crawl content from. See [[targets]]
[[targets]]
Options for [[targets]]
sections.
This is an array of tables. See the TOML docs on Arrays.
host: string
Target host to crawl. Just a plain domain name without the https://
prefix.
use_sitemap: boolean
Read the site sitemap.
Defaults to true
.
walk_links: boolean
Find site pages by walking the links.
Disabled by default but automatically enabled if no sitemaps are found.
Fallback behaviour can be disabled by setting to false
.
start_paths: string[]
List of pages where to start link walking when walk_links
is
enabled.
Defaults to /
content_selector: string
CSS selector used to select the text content for indexing.
Read more from the Indexing Content page.
cleanup_selector: string
CSS selector used to skip elements from indexing.
Read more from the Indexing Content page.
respect_robots_meta: boolean
Respect robots meta tags like
<meta name="robots" content="noindex, nofollow" />
Defaults to true
.
respect_robots_txt: boolean
Respect /robots.txt
rules.
Defaults to true
.
deny_patterns: string[]
Skip paths matching the given pattern. Matches against the url pathname.
Supports string prefixes and regexes. See Indexing Content for details.
max_pages: number
Max pages to crawl. This is a safety limit to make sure the crawler stops in the case where the site generates pages and links inifinitely.
cache_bust: boolean
Add random query string to the crawl http requests. This can cause a lot of load to the target webserver as the caches will very likely be bypassed but it can be used to ensure that the crawler always sees the latest version of the pages.
Defaults to false
.
crawl_pdfs: boolean
Crawl PDF files too. See the PDF docs for details.
Defaults to false
.
tags: Array
Array tagging matchers. Documented on the dedicated page.
request_headers: Object
Request headers to be sent with the http requests the crawler sends out. These can be for example used to authenticate the crawler with non public websites.
Example
[[targets]]
host = "intra.example.com"
# Send basic auth header
request_headers = { Authorization = "Basic ZmluZGtpdDpodW50ZXIyCg==" }
concurrency: number
How many concurrent requests to make on your sites. Defaults to 5
but if you
encounter "429 Too Many Requests" or "503 Service Unavailable" errors you might
want to lower this value.
crawl_delay: number
If lowering concurrency
to 1
is not enough you can try add additional delay
between the requests. Note that this is counted towards your subscription crawl
time. The delay is set in milliseconds.
Example
[[targets]]
host = "example.com"
# Send only one request every 500ms
concurrency = 1
crawl_delay = 500
[search-endpoint]
Search endpoint configuration.
origin_domains: string[]
List of origin domains from which the search endpoint can be accessed eg. the domains where the Findkit UI library can installed. The domain is validated using the Origin header sent by the browsers.
This defaults to the first target host.
Example
[search-endpoint]
origin_domains = ["mysite.example"]
private: boolean
Make search endpoint private by requiring a JWT token. Must be combined with
public_key
.
Defaults to false
.
public_key: boolean
When private
is set to true
this RS256 public key is used to validate the
JWT tokens in the search requests.
See our WordPress plugin for full integration.