Worker Handlers
A worker Javascript file can define following handlers to control the crawling process.
start(target)
Invoked when a target crawl starts. The crawl is stopped if the handler throws an error.
Params
context
host: string
Thehost
entry defined in the toml filemode: string
The crawl mode (full, partial, manual, test)
Return: Promise<void>
Example
export default {
async start(context) {
console.log("Starting crawl on", context.host);
},
};
end(target)
Emitted when a target crawl ends.
Params
context
host: string
Thehost
entry defined in the toml filemode: string
The crawl mode (full, partial, manual, test)
Return: Promise<void>
Example
export default {
async end(context) {
console.log("Crawl completed for", context.host);
},
};
fetch(request, context, next)
Emitted on every request made by the crawler. Can be used to modify or skip the outgoing request and the incoming response.
Params
request: Request
A Fetch API Request objectcontext
url: string
The request urlmode: string
The crawl mode (full, partial, manual)
next: (request: Request) => Promise<Response>
Return: Promise<Response>
A Fetch API Response object
Examples
Modify the request
export default {
async fetch(request, context, next) {
request.headers.set("Authorization", "***");
const response = await next(request);
return response;
},
};
Skip the network connection on certain URLs
export default {
async fetch(request, context, next) {
const url = new URL(request.url);
// Hide robots.txt from the crawler
if (url.pathname === "/robots.txt") {
return new Response("not found", {
status: 404,
});
}
// Let other requests go out as normal
return await next(request);
},
};
html(page, context, next)
Emitted when the crawler has parsed a 200 status HTML page to a DOM document.
Returns the result for indexing. Can be used to modify the result. For example
by adding tags, custom fields or clean up the title or content. It is also possible
to skip page from indexing by setting status
to "skipped"
. The next()
call
executes the clean up selectors on the DOM document. So custom DOM operations should
be done before the next()
call in order to have them to be executed in the original
DOM document.
Params
page
window
: Browser like Window object with a DOM document. Read more here
context
next
:(page) => Promise<Result>
Return: Promise<Result>
Result
status: "ok" | "skipped"
The value isok
when the page can be added to the index andskipped
when noturl: string
title: string
The title shown on the search resultscontent: string
Content used to index the pagelanguage: string
Language of the page. Language analyzer is picked using this valuetags: string[]
links: string[]
Links found on the page. Thewalk_links
option uses this array for link walking. The array can be modified to include only the links you want to walk.customFields: CustomFields
A CustomFields objectfragments: Fragments[]
See Fragment Pages
Example
export default {
async html(page, context, next) {
const price = page.window.document.querySelector(".price")?.innerText;
const result = await next(page);
result.customFields.price = {
type: "number",
value: Number(price),
};
return result;
},
};
pdf(pdf, context, next)
Emitted when the crawler has parsed a PDF request with 200 status
Params
pdf
text(): Promise<string>
: Get the PDF content as a string
context
next
:(page) => Promise<Result>
Return: Promise<Result>
Result
The result object is the same as in html
handler.
Example
export default {
async pdf(pdf, context, next) {
const result = await next(pdf);
const url = new URL(context.request.url);
if (url.pathname.startsWith("/docs")) {
results.tags.push("docs");
}
return result;
},
};
index(result, context, next)
Emitted before the document is added to the index. Not called on test crawls.
Use cases
- Send the result to external systems using a custom
fetch()
call - Skip indexing to the Findkit Index
Params
result: Result
Result returned fromhtml
orpdf
handlerscontext
url: string
The request urlmode: string
The crawl mode (full, partial, manual)
next: (request: Result) => Promise<Result>
Return: The results object