プレーンな Aho-Corasick を実装したけどそんなに早くなかった

TL;DR

TypeScript で Aho-Corasick を実装しました https://github.com/tanishiking/aho-corasick-js
ベンチマークとってみたところだいたいのケースで普通のTrieで複数パターンマッチするほうが早かった https://github.com/tanishiking/aho-corasick-benchmark
(プレーンな) Aho-Corasick は Trie の最悪時間計算量を緩和するものの、実際の検索では failure state への遷移が非常に多く、ループの数は減らせるものの時間がかかってしまう(推測)
状態遷移表(やダブル配列)を使ったより高速な Aho-Corasick の実装や、Boyer-Moore と Aho-Corasick が合体した Commentz-Walter とか、より高速に複数パターンの検索を実装する方法はある

参考

文字列高速検索なんもわからん人は僕の過去のブログ見てみてね
- Coursera の Algorithms on Strings 受けました - たにしきんぐダム
Aho-Corasick を学ぶために(元論文読んでも良いけどそんなにわかり易くない)
AhoCorasick の高速化
- Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配-码农场
- ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog
より高速な複数パターンの検索
- https://www.researchgate.net/publication/2566530_A_Fast_Algorithm_For_Multi-Pattern_Searching

Aho-Corasick on TypeScript

JSで複数の文字列パターンを高速に検索したいことがあり、有名かつ実装が割と簡単という Aho-Corasick を TypeScript で実装してみた(ナイーブに実装されたものはいくつかnpmに公開されているが、しっかりメンテされていて安心して使えそうなものはなかった)。

github.com www.npmjs.com

(実装してみたと言っても GitHub - robert-bor/aho-corasick: Java implementation of the Aho-Corasick algorithm for efficient string matching による良く出来た Java 実装を部分的に TypeScript に移植しただけなのだが...)

ベンチマーク環境

複数のキーワードを文書から検索する(文書内に複数キーワードが含まれていた場合すべて列挙する)というタスク(rerere から re を検索 -> 検索結果は [re, re, re] というような感じ)
ベンチマークを以下の3つの実装に対して行いました。
- BruteForce(キーワードごとにそれぞれ文書内からindexOfで検索)
  - aho-corasick-benchmark/index.ts at 272b341c64252e1a41c260ee2ead23f06869c324 · tanishiking/aho-corasick-benchmark · GitHub
  - これでは出現箇所ごとに発見したキーワードを列挙することが出来ないが、参考までに
- プレーンな Trie 実装
  - aho-corasick-benchmark/trie.ts at master · tanishiking/aho-corasick-benchmark · GitHub
  - 文書に対して二重ループをまわしていく
- AhoCorasick
  - https://github.com/tanishiking/aho-corasick-js
ベンチマークコードはここに置いてます GitHub - tanishiking/aho-corasick-benchmark: Benchmark for https://github.com/tanishiking/aho-corasick-js
- chance.js を使ってランダムな文字列、ランダムなキーワードを生成し、それをベンチマークに使っています。
- ベンチマークには benchmark.js を使った。

ベンチマーク結果

いくつかのパターンに対してベンチマークをかけてみた。

自然言語っぽい sentence に対するベンチマーク

chance.sentence によって生成した sentence が検索対象
- ちなみにsentenceにより生成されるwordはそれぞれ平均5,6文字だそう。
chance.word で生成した単語(?)郡がキーワード

長さ10000単語のsentenceに対して 10000キーワードを検索

❯ node index.js -k sentence --keywords 10000 --length 10000             
indexOf loop x 2.99 ops/sec ±1.81% (12 runs sampled)
simple trie x 111 ops/sec ±2.07% (67 runs sampled)
aho corasick x 33.11 ops/sec ±2.60% (54 runs sampled)

長さ10000単語のsentenceに対して 500000キーワードを検索

❯ node index.js -k sentence --keywords 500000 --length 10000             
indexOf loop x 0.04 ops/sec ±34.98% (5 runs sampled)
simple trie x 25.42 ops/sec ±14.62% (47 runs sampled)
aho corasick x 11.85 ops/sec ±17.90% (27 runs sampled)

長さ10000単語のsentenceに対して 100キーワードを検索

❯ node index.js -k sentence --keywords 100 --length 10000
indexOf loop x 347 ops/sec ±1.53% (82 runs sampled)
simple trie x 224 ops/sec ±1.03% (82 runs sampled)
aho corasick x 94.35 ops/sec ±2.48% (67 runs sampled)

キーワード数が少ないと単純な indexOf のループが一番早い(それはそう)
Aho-Corasick は総じて Trie に負けてる

機械的な文字列集合(DNA配列とか)(文字種少なめ) に対するベンチマーク

文字種が少なくキーワードが多い場合は Aho-Corasick の failure 遷移がよく効いてくるのではないかと思い、Aho-Corasick に有利な条件のベンチマークを設定してみた。

長さ1000の文字列に対して、10000キーワードで検索

❯ node index.js -k string --keywords 10000 --length 1000 --pool abcdefg 
indexOf loop x 54.19 ops/sec ±15.41% (45 runs sampled)
simple trie x 2,825 ops/sec ±9.39% (75 runs sampled)
aho corasick x 4,001 ops/sec ±3.11% (74 runs sampled)

長さ10000の文字列に対して、10000キーワードで検索

❯ node index.js -k string --keywords 10000 --length 10000 --pool abcdef
indexOf loop x 6.46 ops/sec ±28.38% (20 runs sampled)
simple trie x 305 ops/sec ±2.82% (72 runs sampled)
aho corasick x 223 ops/sec ±16.66% (62 runs sampled)

やはり文字種の少ない環境かつ検索キーワードが多い環境だと failure 遷移が効いてくる(一方で Trie は無駄な遷移が目立ってくる)のか、Aho-Corasick に軍配が上がった
しかし検索対象が長くなると Trie に逆転される...

プレーンな Aho-Corasick が Trie よりも遅い理由の推測

Aho and Corasick [AC75] presented a linear-time algorithm for this problem, based on an automata approach. This algorithm serves as the basis for the UNIX tool fgrep. A linear-time algorithm is optimal in the worst case, but as the regular string-searching algorithm by Boyer and Moore [BM77] demonstrated, it is possible to actually skip a large portion of the text while searching, leading to faster than linear algorithms in the average case https://www.researchgate.net/publication/2566530_A_Fast_Algorithm_For_Multi-Pattern_Searching

とあるように、Aho-Corasick は Trie による検索の最悪ケース (最長パターンのギリギリまで遷移して、受理状態に到達するギリギリで遷移に失敗するのを繰り返す場合 abcdefg というキーワードを abcdef という文字列から検索するみたいな) の対策にはなっていそうではある。
しかし実際の文字列検索では、確かに Trie は検索対象の文書に対して二重ループを走らせる必要があるものの、内側のループ(Trieで構築されたオートマトン上を遷移する過程)ではだいたいの場合、早々に遷移に失敗するはず
- そのため最悪ケースでは二重ループを繰り返すものの、実際はそんなに時間かからないのではないか。
一方 Aho-Corasick は二重ループを取り除き、一回の検索対象文字列に対する走査ですべての検索キーワードを見つけることができるが、遷移に失敗した場合に(failure遷移先から)遷移に成功するまでfailure遷移を複数回繰り返すことになる。
- そのため検索対象文字列と、検索するキーワードがよく似ており、遷移が成功しやすい環境では failure 遷移による高速化がよく効くものの、遷移が失敗しやすい環境では(複数の) failure 遷移が多くなり、それによる検索速度の劣化が著しいのではないか。

このような原因でプレーンな Aho-Corasick 実装が Trie 実装に負けたのではないかと考えています(もしかすると単純に僕の実装がミスってて不要な状態遷移とかが起きてるのかもしれないが...何度読みなおしてもそのようなことにはなってないように見える)。

詳しいかたいらっしゃったら教えて下さい or 読むべき本/論文などあれば教えて下さい...。

さらに高速な複数文字列パターン検索のために

Aho-Corasick とは別の(亜種の)アルゴリズム

例えば GNU grep には Commentz-Walter algorthm という Aho-Corasick と Boyer-Moore が合体したようなアルゴリズムが使われていたり、いろいろ考えられいる。より高みを目指すならこれらのアルゴリズムを実装したほうが良いのだろう。

こういう文字列アルゴリズムをいっきに学べる良い本ないかなと思っていたら Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences をおすすめされたので買いました。

Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences https://t.co/aQaE0ot4IL
この本おすすめ
— takuya-a (@takuya_b) 2020年5月17日

ところで、これらのアルゴリズムは割りと歴史が長く、モダンなCPUが開発された現代に最適な複数文字列検索アルゴリズムはないのだろうかと思っていたらあった。

Intelの開発している Hyperscan という複数文字列検索のための正規表現エンジンが存在し、ripgrep も SIMD が有効な場合は hyperscan の考えをベースに実装した Teddy というライブラリを使って高速な複数文字列検索を行うらしい。

REGEX SET SCANNING WITH HYPERSCAN AND RE2::SET hyperscan 早そう

Aho-Corasick の高速化

では Aho-Corasick はもうpracticalには役に立たないのかというとそんなことはなく、Aho-Corasick にはまだ高速化の余地があるようです。

例えば、今回実装したプレーンなAho-Corasickでは状態遷移部分をHashMapを使って実装していたが

Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配-码农场では状態遷移をダブル配列によって実装することで圧倒的な高速化(状態遷移がひとつ(ふたつ)のテーブルで管理されることになるのでCPUに優しく、キャッシュにも乗りやすい)しているし
ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog では ripgrep も複数文字列検索では、SIMD命令が使える場合は Teddy を使うが、SIMDが有効でない場合は Aho-Corasick の状態遷移を状態遷移表を使ったバージョンにfallbackしているとある。
- failure 遷移をギュッとコンパイルして状態遷移表に詰め込むことで、状態遷移に失敗した時のfailure遷移回数を1回に抑えるとあるが、どうやっているのだろう...実装を見てみないとわからない GitHub - BurntSushi/aho-corasick: A fast implementation of Aho-Corasick in Rust.
- Commentz-Walter algorthm を使わず、高速化した Aho-Corasick を使ったのはなぜなのだろうか。そっちのほうが早いのか?

We do still have a few tricks up our sleeve though. For example, many Aho-Corasick implementations are built as-if they were tries with back-pointers for their failure transitions. We can actually do better than that. We can compile all of its failure transitions into a DFA with a transition table contiguous in memory. This means that every byte of input corresponds to a single lookup in the transition table to find the next state. We never have to waste time chasing pointers or walking more than one failure transition for any byte in the search text. https://blog.burntsushi.net/ripgrep/

暇があれば Aho-Corasick の高速化や Commentz-Walter の実装と比較。また hyperscan について調べたりしてみたいな。俺たちの高速文字列検索はまだまだこれからだ！