Word level processing

It's not easy for a word to get into the search index unscathed. There are several imposed limitations and transformations that can happen to an unsuspecting word, and these all lurk in the benign sounding search_simplify function. To keep things in context, search_simplify gets called from search_index_split, which in turn gets called in during the text token phase of search_index it's a bit like playing mental twister, isn't it Let's look at the whole search_simplify function

The searchtotal table

Take a look at what the search_total table contains after indexing example nodes 2 and 3 raysql gt select from search_total order by count desc word 1 count j____________i___________ Figure 20 Global normalized totals for the words in the index. After each time that indexing occurs, all of the words that have been marked by search_dirty will be updated in the search_total table. The count value is a normalization according to Zipf's law that says a word's value to the search index is inversely...

The searchnodelinks table

The tokens in Figure 3 didn't contain any links. Here's some text that links back to node 2 the node that talks about ponies and badgers lt a of badgers and ponies lt a gt This is in node 3, and the link references node 2. Now look at the contents of the search_node_links table after node 3 has been indexed mysql gt select from search_node_links Figure 18 search_node_links has caption text that will be added to a node's text at indexing time. The utility of this text only makes sense if we also...

The searchdataset table

Here's what is in the search_dataset table after indexing the tokens in Figure 3 mysql gt select from search_dataset mysql gt select from search_dataset Figure 17 The contents of the search_dataset table after processing the tokens in Figure 3. Figure 17 The contents of the search_dataset table after processing the tokens in Figure 3. As you can see, the data column of search_dataset is all of the words that were encountered in this document, in the order that they were encountered. This string...