A long hiatus for a full summer and then some contract work.
Some benchmarks of the latest tokenization algorithm shows that pure-ASCII tokenization is about 20% faster and UTF-8 tokenization is about 2% slower. So I’ll take it.
Benchmark was performed by using perl/docmaker.pl to generate 100 random “docs” in both encodings (ASCII and random UTF-8) and then timed using swish_lint with and without the -t option.
Also recently fixed some failing tests on Linux and a memory warning.
Leave a Reply
You must be logged in to post a comment.