Worker Handlers
A worker Javascript file can define following handlers to control the crawling process.
start(target)
Invoked when a target crawl starts. The crawl is stopped if the handler throws an error.
Params
contexthost: stringThehostentry defined in the toml filemode: stringThe crawl mode (full, partial, manual, test)
Return: Promise<void>
Example
export default {
async start(context) {
console.log("Starting crawl on", context.host);
},
};
end(target)
Emitted when a target crawl ends.
Params
contexthost: stringThehostentry defined in the toml filemode: stringThe crawl mode (full, partial, manual, test)
Return: Promise<void>
Example
export default {
async end(context) {
console.log("Crawl completed for", context.host);
},
};
fetch(request, context, next)
Emitted on every request made by the crawler. Can be used to modify or skip the outgoing request and the incoming response.
Params
request: RequestA Fetch API Request objectcontexturl: stringThe request urlmode: stringThe crawl mode (full, partial, manual)
next: (request: Request) => Promise<Response>
Return: Promise<Response> A Fetch API Response object
Examples
Modify the request
export default {
async fetch(request, context, next) {
request.headers.set("Authorization", "***");
const response = await next(request);
return response;
},
};
Skip the network connection on certain URLs
export default {
async fetch(request, context, next) {
const url = new URL(request.url);
// Hide robots.txt from the crawler
if (url.pathname === "/robots.txt") {
return new Response("not found", {
status: 404,
});
}
// Let other requests go out as normal
return await next(request);
},
};
html(page, context, next)
Emitted when the crawler has parsed a 200 status HTML page to a DOM document.
Returns the result for indexing. Can be used to modify the result. For example
by adding tags, custom fields or clean up the title or content. It is also possible
to skip page from indexing by setting status to "skipped". The next() call
executes the clean up selectors on the DOM document. So custom DOM operations should
be done before the next() call in order to have them to be executed in the original
DOM document.
Params
pagewindow: Browser like Window object with a DOM document. Read more here
contextnext:(page) => Promise<Result>
Return: Promise<Result>
Result
status: "ok" | "skipped"The value isokwhen the page can be added to the index andskippedwhen noturl: stringtitle: stringThe title shown on the search resultscontent: stringContent used to index the pagelanguage: stringLanguage of the page. Language analyzer is picked using this valuetags: string[]links: string[]Links found on the page. Thewalk_linksoption uses this array for link walking. The array can be modified to include only the links you want to walk.customFields: CustomFieldsA CustomFields objectfragments: Fragments[]See Fragment Pages
Example
export default {
async html(page, context, next) {
const price = page.window.document.querySelector(".price")?.innerText;
const result = await next(page);
result.customFields.price = {
type: "number",
value: Number(price),
};
return result;
},
};
pdf(pdf, context, next)
Emitted when the crawler has parsed a PDF request with 200 status
Params
pdftext(): Promise<string>: Get the PDF content as a string
contextnext:(page) => Promise<Result>
Return: Promise<Result>
Result
The result object is the same as in html handler.
Example
export default {
async pdf(pdf, context, next) {
const result = await next(pdf);
const url = new URL(context.request.url);
if (url.pathname.startsWith("/docs")) {
results.tags.push("docs");
}
return result;
},
};
index(result, context, next)
Emitted before the document is added to the index. Not called on test crawls.
Use cases
- Send the result to external systems using a custom
fetch()call - Skip indexing to the Findkit Index
Params
result: ResultResult returned fromhtmlorpdfhandlerscontexturl: stringThe request urlmode: stringThe crawl mode (full, partial, manual)
next: (request: Result) => Promise<Result>
Return: The results object