Worker Runtime
Findkit Workers is a custom V8 runtime which implements few browser APIs.
We use a recent version of V8 meaning most modern Javascript features are
available but it should be noted that the runtime is not Node.js or a
web browser. So there's no require()
, import or any other Node.js APIs available. Also it
does not execute the Javascript present on the web pages. Only the code you
provide is executed.
Javascript API
In addition to standard Javascript APIs present in V8 the runtime has following browser APIs:
- Partial Fetch API
- fetch(), Request, Response, Headers
- Not all Fetch API features are supported but we are working on adding more
- If you hit any limitations please contact us so we known what to prioritize, thanks!
- URL and URLSearchParams
- URLPattern
- btoa and atob
- AbortSignal and AbortController
- TextEncoder and TextDecoder
- structuredClone
Worker DOM
In the html
Worker handler you get access to a brower like
window
object which contains document
with .querySelector()
, .innerText
and
other common DOM APIs. This DOM implementation is not full browser DOM but a subset
for Findkit Workers. For example it does not execute <script>
tags or intrepret
CSS styles.
.innerText
The innerText
property available on HTMLElement
nodes is the way to extract
text for indexing. Findkit Crawler internally uses this too.
It will generate line breaks between block level elements but not
for between inline elements.
For example <div>hello<span>world</span></div>
will have innerText
of hello\nworld
and
<span>hello</span><span>world</span>
will have innerText
of helloworld
. The caveat is
that it does not regonize CSS defined display
property. So it is important to use semantically
correct HTML elements on the page.
If it is not possible to use real block elements you may use the Worker DOM API to manually add spaces or line breaks between elements.
Cleanup selector
The cleanup_selector
actually removes the matched elements
from the DOM. So when the worker code needs to access these elements be sure to read the DOM
before calling next()
With following config
cleanup_selector = ".price"
export default {
async html(page, context, next) {
// Read the dom before `next()`
const price = page.window.document.querySelector(".price")?.innerText;
// The cleanup_selector removes the elements here
const result = await next(page);
// The .price is not available anymore
return result;
},
};
Limitations
The Worker DOM API does not support the instanceof
operator for DOM nodes.
Ex. element instanceof HTMLDivElement
. Instead check for the element tag name using element.tagName === "DIV"
.
Using npm modules
If you need to use a npm module you can use a bundler to include it
within your code. Just point the workers = []
to the output bundle. We
recommend esbuild for bundling. Use the ESM output format.