Worker Handlers

A worker Javascript file can define following handlers to control the crawling process.

`start(target)`

Invoked when a target crawl starts. The crawl is stopped if the handler throws an error.

Params

context
- host: string The host entry defined in the toml file
- mode: string The crawl mode (full, partial, manual, test)

Return: Promise<void>

Example

export default {
    async start(context) {
        console.log("Starting crawl on", context.host);
    },
};

`end(target)`

Emitted when a target crawl ends.

Params

context
- host: string The host entry defined in the toml file
- mode: string The crawl mode (full, partial, manual, test)

Return: Promise<void>

Example

export default {
    async end(context) {
        console.log("Crawl completed for", context.host);
    },
};

`fetch(request, context, next)`

Emitted on every request made by the crawler. Can be used to modify or skip the outgoing request and the incoming response.

Params

request: Request A Fetch API Request object
context
- url: string The request url
- mode: string The crawl mode (full, partial, manual)
next: (request: Request) => Promise<Response>

Return: Promise<Response> A Fetch API Response object

Examples

Modify the request

export default {
    async fetch(request, context, next) {
        request.headers.set("Authorization", "***");
        const response = await next(request);
        return response;
    },
};

Skip the network connection on certain URLs

export default {
    async fetch(request, context, next) {
        const url = new URL(request.url);

        // Hide robots.txt from the crawler
        if (url.pathname === "/robots.txt") {
            return new Response("not found", {
                status: 404,
            });
        }

        // Let other requests go out as normal
        return await next(request);
    },
};

`html(page, context, next)`

Emitted when the crawler has parsed a 200 status HTML page to a DOM document. Returns the result for indexing. Can be used to modify the result. For example by adding tags, custom fields or clean up the title or content. It is also possible to skip page from indexing by setting status to "skipped". The next() call executes the clean up selectors on the DOM document. So custom DOM operations should be done before the next() call in order to have them to be executed in the original DOM document.

Params

page
- window: Browser like Window object with a DOM document. Read more here
context
- url: string
- request: Request
- response: Response
next: (page) => Promise<Result>

Return: Promise<Result>

Result

status: "ok" | "skipped" The value is ok when the page can be added to the index and skipped when not
url: string
title: string The title shown on the search results
content: string Content used to index the page
language: string Language of the page. Language analyzer is picked using this value
tags: string[]
links: string[] Links found on the page. The walk_links option uses this array for link walking. The array can be modified to include only the links you want to walk.
customFields: CustomFields A CustomFields object
fragments: Fragments[] See Fragment Pages

Example

export default {
    async html(page, context, next) {
        const price = page.window.document.querySelector(".price")?.innerText;

        const result = await next(page);

        result.customFields.price = {
            type: "number",
            value: Number(price),
        };

        return result;
    },
};

`pdf(pdf, context, next)`

Emitted when the crawler has parsed a PDF request with 200 status

Params

pdf
- text(): Promise<string>: Get the PDF content as a string
context
- url: string
- request: Request
- response: Response
next: (page) => Promise<Result>

Return: Promise<Result>

Result

The result object is the same as in html handler.

Example

export default {
    async pdf(pdf, context, next) {
        const result = await next(pdf);

        const url = new URL(context.request.url);

        if (url.pathname.startsWith("/docs")) {
            results.tags.push("docs");
        }

        return result;
    },
};

`index(result, context, next)`

Emitted before the document is added to the index. Not called on test crawls.

Use cases

Send the result to external systems using a custom fetch() call
Skip indexing to the Findkit Index

Params

result: Result Result returned from html or pdf handlers
context
- url: string The request url
- mode: string The crawl mode (full, partial, manual)
next: (request: Result) => Promise<Result>

Return: The results object

Worker Handlers

start(target)​

end(target)​

fetch(request, context, next)​

html(page, context, next)​

pdf(pdf, context, next)​

index(result, context, next)​

`start(target)`

`end(target)`

`fetch(request, context, next)`

`html(page, context, next)`

`pdf(pdf, context, next)`

`index(result, context, next)`