Introduction

In this article, I will describe the implementation of an efficient Aho-Corasick algorithm for pattern matching. In simple words, this algorithm can be used for searching a text for specified keywords. The following code is useful when you have a set of keywords and you want to find all occurrences of a keywords in the text or check if any of the keywords is present in the text. You should use this algorithm especially if you have a large number of keywords that don't change often, because in this case, it is much more efficient than other algorithms that can be simply implemented using the .NET class library.

Aho-Corasick algorithm

In this section, I'll try to describe the concept of this algorithm. For more information and for a more exact explanation, please take a look at the links at the end of this article. The algorithm consists of two parts. The first part is the building of the tree from keywords you want to search for, and the second part is searching the text for the keywords using the previously built tree (state machine). Searching for a keyword is very efficient, because it only moves through the states in the state machine. If a character is matching, it follows goto function otherwise it follows fail function.

Tree building

In the first phase of the tree building, keywords are added to the tree. In my implementation, I use the classStringSearch.TreeNode, which represents one letter. The root node is used only as a place holder and contains links to other letters. Links created in this first step represents the goto function, which returns the next state when a character is matching.
During the second phase, the fail and output functions are found. The fail function is used when a character is not matching and the output function returns the found keywords for each reached state. For example, in the text "SHIS", the failure function is used to exit from the "SHE" branch to "HIS" branch after the first two characters (because the third character is not matching). During the second phase, the BFS (breadth first search) algorithm is used for traversing through all the nodes. Functions are calculated in this order, because the fail function of the specified node is calculated using the fail function of the parent node.
Building of the keyword tree (figure 1 - after the first step, figure 2 - tree with the fail function)

Searching

As I already mentioned, searching only means traversing the previously built keyword tree (state machine). To demonstrate how this algorithm works, let's look at the commented method which returns all the matches of the specified keywords:
// Searches passed text and returns all occurrences of any keyword
// Returns array containing positions of found keywords
public StringSearchResult[] FindAll(string text)
{
  ArrayList ret=new ArrayList(); // List containing results
  TreeNode ptr=_root;            // Current node (state)
  int index=0;                   // Index in text

  // Loop through characters
  while(index//