Worker Runtime

Findkit Workers is a custom V8 runtime which implements few browser APIs.

We use a recent version of V8 meaning most modern Javascript features are available but it should be noted that the runtime is not Node.js or a web browser. So there's no require(), import or any other Node.js APIs available. Also it does not execute the Javascript present on the web pages. Only the code you provide is executed.

Javascript API

In addition to standard Javascript APIs present in V8 the runtime has following browser APIs:

Partial Fetch API
- fetch(), Request, Response, Headers
- Not all Fetch API features are supported but we are working on adding more
- If you hit any limitations please contact us so we known what to prioritize, thanks!
URL and URLSearchParams
URLPattern
btoa and atob
AbortSignal and AbortController
TextEncoder and TextDecoder
structuredClone

Worker DOM

In the html Worker handler you get access to a brower like window object which contains document with .querySelector(), .innerText and other common DOM APIs. This DOM implementation is not full browser DOM but a subset for Findkit Workers. For example it does not execute <script> tags or intrepret CSS styles.

`.innerText`

The innerText property available on HTMLElement nodes is the way to extract text for indexing. Findkit Crawler internally uses this too. It will generate line breaks between block level elements but not for between inline elements.

For example <div>hello<span>world</span></div> will have innerText of hello\nworld and <span>hello</span><span>world</span> will have innerText of helloworld. The caveat is that it does not regonize CSS defined display property. So it is important to use semantically correct HTML elements on the page.

If it is not possible to use real block elements you may use the Worker DOM API to manually add spaces or line breaks between elements.

Cleanup selector

The cleanup_selector actually removes the matched elements from the DOM. So when the worker code needs to access these elements be sure to read the DOM before calling next()

With following config

cleanup_selector = ".price"

export default {
    async html(page, context, next) {
       // Read the dom before `next()`
        const price = page.window.document.querySelector(".price")?.innerText;

        // The cleanup_selector removes the elements here
        const result = await next(page);

        // The .price is not available anymore

        return result;
    },
};

Limitations

The Worker DOM API does not support the instanceof operator for DOM nodes. Ex. element instanceof HTMLDivElement. Instead check for the element tag name using element.tagName === "DIV".

Using npm modules

If you need to use a npm module you can use a bundler to include it within your code. Just point the workers = [] to the output bundle. We recommend esbuild for bundling. Use the ESM output format.

Worker Runtime

Javascript API​

Worker DOM​

.innerText​

Cleanup selector​

Limitations​

Using npm modules​