Generating Tree-sitter and Grammar Wasm Binaries with Emscripten
While developing Edita I had to get pretty familiar with WASM concepts and Emscripten, which is what compiles the source files (e.g. C) into WASM and JavaScript files. This post documents some of the issues I encountered and serves as a guide to building and using WASM files using recent versions of Emscripten on Linux.
Tree-sitter
Edita uses a fork of tree-sitter with the following changes:
Fix environment detection in
lib/binding_web/binding.js
(the old check checked for Node globals likeprocess
, which are available in the renderer process in Electron).Statically link parsers in
script/build-wasm
to support the following grammars, which use system libraries that aren’t in the mainlib/binding_web/exports.json
list:- Ruby
- Svelte
See:
- https://github.com/emscripten-core/emscripten/issues/8308
- https://emscripten.org/docs/compiling/Dynamic-Linking.html (System Libraries section)
- https://github.com/tree-sitter/tree-sitter/issues/949
There are two other solutions suggested in the emscripten docs:
add
EMCC_FORCE_STDLIBS=1
and-s EXPORT_ALL=1
to theemcc
command in tree-sitter, oradd missing symbols to exports.json (as described in #949)
but neither of these worked.
The only disadvantage of static linking is that the grammars will be loaded unconditionally on startup, taking up some of the startup time that we want to use for loading common grammars.
Add support for statically linked modules to
Language.load
inlib/binding_web/binding.js
.
Fork: https://github.com/gushogg-blake/tree-sitter.
emscripten/emsdk/llvm
A recent version of emcc
(emscripten) must also be used, in order to get this change: https://github.com/emscripten-core/emscripten/pull/18382 (use locateFile
in dynamic module loader):
emcc -v # 3.1.43-git
Otherwise you won’t be able to use locateFile
to control the path that gets requested for side module wasm files, and it will default to something like /name.wasm
—which obviously won’t work if your side modules are kept somewhere like /tree-sitter/langs/name.wasm
.
This probably has to be installed via emsdk
:
Clone the latest
emsdk
from https://github.com/emscripten-core/emsdkFrom there run:
./emsdk install emscripten-main-64bit ./emsdk activate emscripten-main-64bit
Note: I’m not sure if emsdk requires an installation step—if you get an error here, check the docs for that.
This will give instructions on getting the latest
emcc
onto$PATH
.
The latest version of emcc also depends on the latest version of llvm:
./emsdk install llvm-git-main-64bit
./emsdk activate llvm-git-main-64bit
Tools available for installation with emsdk can be seen by running ./emdsk list
.
The tree-sitter build scripts should now use the latest version of emcc
(you’ll have to reload the shell or run the command again to get it onto $PATH
).
Creating tree-sitter.wasm
and tree-sitter.js
cd projects/tree-sitter
./script/build-wasm --static
The --static
option indicates static linking. The list of grammars to statically link is hard-coded in build-wasm
.
The files (tree-sitter.js
and tree-sitter.wasm
) are created in lib/binding_web
.
Creating Grammar Wasm Files
Wasm files can be created with a non-patched tree-sitter installed as tree-sitter-cli
:
git clone https://github.com/.../tree-sitter-[lang]
npx tree-sitter build-wasm tree-sitter-[lang]
The language wasm file will be created in the current directory.
Misc
Emscripten settings: https://github.com/emscripten-core/emscripten/blob/main/src/settings.js.