Running Crawls
Findkit can crawl pages in three modes. Full, partial and manual and the crawls can be started from the Hub, CLI, REST API or from the schedules depending on the crawl mode.
Full Crawl
Findkit crawler uses following algorithm to find pages to crawl in full mode:
- It checks for
/robots.txt
for what pages it is allowed to crawl - It reads the sitemaps reported in the
Sitemap:
entries in/robots.txt
- If sitemaps are found it only crawls the pages reported by the sitemaps
- If no sitemaps are found it fallbacks to walking links starting from
/
- The start paths can be customised by setting
start_paths
- The start paths can be customised by setting
This behaviour can be customized by setting use_sitemap
and/or walk_links
explicitly.
This will re-index every page on the site every time.
Full crawl can be started from the Hub, CLI, REST API and from a schdule defined in the toml config.
Example
findkit crawl start
Partial Crawl
Partial crawl only checks for sitemaps and only crawls the pages where the last
modified timestamp has changed since the last crawl. This can greatly save
crawl quotas compared to full crawls. The walk_links
option is always
ignored.
Partial crawl can be started from the Hub, CLI, REST API or from a schdule.
Example
findkit crawl start --partial
Manual Crawl
Manual crawl will crawl only the urls you specify when starting the crawl. Manual crawls can be started using CLI and REST API.
Example
findkit crawl start --manual "https://www.example.com/page"
Scheduling
Schedule crawls in the findkit.toml
file using the
schedule_full_crawl
and
schedule_partial_crawl
fields.
The schedules will be active immediately when the TOML file
is deployed with findkit deploy
. If you need to disable the
schedules just comment them out and run the deploy.