25 June 2023

Generating Tree-sitter and Grammar Wasm Binaries with Emscripten

While developing Edita I had to get pretty familiar with WASM concepts and Emscripten, which is what compiles the source files (e.g. C) into WASM and JavaScript files. This post documents some of the issues I encountered and serves as a guide to building and using WASM files using recent versions of Emscripten on Linux.

Tree-sitter

Edita uses a fork of tree-sitter with the following changes:

  • Fix environment detection in lib/binding_web/binding.js (the old check checked for Node globals like process, which are available in the renderer process in Electron).

  • Statically link parsers in script/build-wasm to support the following grammars, which use system libraries that aren’t in the main lib/binding_web/exports.json list:

    • Ruby
    • Svelte

    See:

    There are two other solutions suggested in the emscripten docs:

    • add EMCC_FORCE_STDLIBS=1 and -s EXPORT_ALL=1 to the emcc command in tree-sitter, or

    • add missing symbols to exports.json (as described in #949)

    but neither of these worked.

    The only disadvantage of static linking is that the grammars will be loaded unconditionally on startup, taking up some of the startup time that we want to use for loading common grammars.

  • Add support for statically linked modules to Language.load in lib/binding_web/binding.js.

Fork: https://github.com/gushogg-blake/tree-sitter.

emscripten/emsdk/llvm

A recent version of emcc (emscripten) must also be used, in order to get this change: https://github.com/emscripten-core/emscripten/pull/18382 (use locateFile in dynamic module loader):

emcc -v # 3.1.43-git

Otherwise you won’t be able to use locateFile to control the path that gets requested for side module wasm files, and it will default to something like /name.wasm—which obviously won’t work if your side modules are kept somewhere like /tree-sitter/langs/name.wasm.

This probably has to be installed via emsdk:

  • Clone the latest emsdk from https://github.com/emscripten-core/emsdk

  • From there run:

    ./emsdk install emscripten-main-64bit
    ./emsdk activate emscripten-main-64bit

    Note: I’m not sure if emsdk requires an installation step—if you get an error here, check the docs for that.

    This will give instructions on getting the latest emcc onto $PATH.

The latest version of emcc also depends on the latest version of llvm:

./emsdk install llvm-git-main-64bit
./emsdk activate llvm-git-main-64bit

Tools available for installation with emsdk can be seen by running ./emdsk list.

The tree-sitter build scripts should now use the latest version of emcc (you’ll have to reload the shell or run the command again to get it onto $PATH).

Creating tree-sitter.wasm and tree-sitter.js

cd projects/tree-sitter
./script/build-wasm --static

The --static option indicates static linking. The list of grammars to statically link is hard-coded in build-wasm.

The files (tree-sitter.js and tree-sitter.wasm) are created in lib/binding_web.

Creating Grammar Wasm Files

Wasm files can be created with a non-patched tree-sitter installed as tree-sitter-cli:

git clone https://github.com/.../tree-sitter-[lang]
npx tree-sitter build-wasm tree-sitter-[lang]

The language wasm file will be created in the current directory.

Misc

Emscripten settings: https://github.com/emscripten-core/emscripten/blob/main/src/settings.js.