Testing Crawls
When working with crawlers you need to be able to check that crawler works correctly on your site without actually updating the production index.
For this we provide the findkit crawl test
subcommand. Example
findkit crawl test "https://docs.findkit.com/crawler/running-crawls/"
This will read current the findkit.toml
file from the current working
directory or one pointed by --path
and will ask Findkit crawler
infrastructure to crawl the given page with current toml config. Eg. you can
test toml changes that are not yet deployed with findkit deploy
. It will report
how it crawled the page.
The output is following
HTTP Status: 200
Crawl Status: ok
Message:
URL: https://docs.findkit.com/crawler/running-crawls/
Title: Running Crawls
Language: en
Created: 2023-10-24T07:54:15.483Z
Modified: 2023-10-24T07:54:15.483Z
Tags: language/en, crawler
Content: Running Crawls Findkit can crawl pages in three modes…
[add --content to show all 1634 characters]
It can also detect bad urls
$ findkit crawl test "https://docs.findkit.com/bad"
HTTP Status: 404
Crawl Status: skipped
Message: Gone
or toml bad toml configs
$ findkit crawl test "https://docs.findkit.com/crawler/running-crawls"
Crawl Status: skipped
Message: Denied by deny pattern: /crawler
or even toml invalid toml configs
$ findkit crawl test "https://docs.findkit.com/crawler/running-crawls"
[ERROR] Invalid TOML syntax at findkit.toml
[ERROR] Unterminated string at row 13, col 25, pos 345:
[ERROR] 12: [[targets]]
[ERROR] 13> host = "docs.findkit.com
[ERROR] ^
[ERROR] 14:
Testing local dev and staging sites
You can run the crawl against localhost with --local-http
and --target
.
findkit crawl test --local-http --target docs.findkit.com "http://localhost:3000/crawler/running-crawls/"
This uses the docs.findkit.com
target from the toml config and makes the HTTP request to
http://localhost:3000/crawler/running-crawls/
directly from the CLI process.
The HTML string is then passed to the Findkit crawler to be parsed. It will see
an artificial response to http://localhost:3000/crawler/running-crawls/
with
the locally fetched HTML. It does itself not make any outgoing network
requests.
If your toml file has only one target the --target
option can omitted when
using --local-http
This can be used crawl any site accessible from you local machine. For example staging sites that are behind firewalls etc.
Testing local files
It is also possible to crawl local files with --file
findkit crawl test --file page.html "https://docs.findkit.com/crawler/running-crawls/"
This can be useful if you need to quickly try out how to fix production site.
For example use curl to download a page
curl "https://docs.findkit.com/crawler/running-crawls/" > page.html
Try making some fixes to the html with your favorite editor
vim page.html
and see if it worked
findkit crawl test --file page.html "https://docs.findkit.com/crawler/running-crawls/"