How the Indexer Works

The indexer has a preprocessing mode where text is filtered through a set of rules to assign scores. Such rules include dealing with acronyms, URLs, and numerical data. During the preprocessing phase, other modules have a chance to add logic to this process in order to perform their own data manipulations. This comes in handy during language-specific tweaking, as shown here using the contributed Porter-Stemmer module:

resume >■ resume (accent removal)

Another such language preprocessing example is word splitting for the Chinese, Japanese, and Korean languages to ensure the character text is correctly indexed.

■Tip The Porter-Stemmer module (http://drupal.org/project/porterstemmer) is an example of a module that provides word stemming to improve English language searching. Likewise, the Chinese Word Splitter module (http://drupal.org/project/csplitter) is an enhanced preprocessor for improving Chinese, Japanese, and Korean searching. A simplified Chinese word splitter is included with the search module and can be enabled on the search settings page.

After the preprocessing phase, the indexer uses HTML tags to find more important words (called tokens) and assigns them adjusted scores based on the default score of the HTML tags and the number of occurrences of each token. These scores will be used to determine the ultimate relevancy of the token. Here's the full list of the default HTML tag scores (they are defined in search_index()):

<h1> = 25 <h2> = 18 <h3> = 15 <h4> = 12 <a> = 10 <h5> = 9 <h6> = 6 <b> = 3 <strong> = 3 <i> = 3 <em> = 3 <u> = 3

Let's grab a chunk of HTML and run it through the indexer to better understand how it works. Figure 12-6 shows an overview of the HTML indexer parsing content, assigning scores to tokens, and storing that information in the database.

Incoming HTML HTML Indexer Indexed Content

Drupal 1.0 was released on 01/01/2001. Drupal is most accurately described as a <em>Content Management Framework</em>,

Drupal 1.0 was released on 01/01/2001 Drupal is most accurately described as a <em>Content Management Framework</em>.

Drupal 1.0 was released on 01/01/2001, Drupal is most accurately described as a <e/ri.--Content Management Framework </em>.

Figure 12-6. Indexing a chunk of HTML and assigning token scores

When the indexer encounters numerical data separated by punctuation, the punctuation is removed and numbers alone are indexed. This makes elements such as dates, version numbers, and IP addresses easier to search for. The middle process in Figure 12-6 shows how a word token is processed when it's not surrounded by HTML. These tokens have a weight of 1. The last row shows content that is wrapped in an emphasis (<em>) tag. The formula for determining the overall score of a token is as follows:

Number of matches x Weight of the HTML tag

It should also be noted that Drupal indexes the filtered output of nodes so, for example, if you have an input filter set to automatically convert URLs to hyperlinks, or another filter to convert line breaks to HTML break and paragraph tags, the indexer sees this content with all the markup in place and can take the markup into consideration and assign scores accordingly. A greater impact of indexing filtered output is seen with a PHP node, which as you may know is simply another input filter option within Drupal. Indexing dynamic content could be a real hassle, but because Drupal's indexer sees only the output of the PHP nodes, dynamic PHP nodes are automatically fully searchable.

When the indexer encounters internal links, they too are handled in a special way. If a link points to another node, then the link's words are added to the target node's content, making

Drupal 1.0 was released on 01/01/2001. Drupal is most accurately described as a <em>Content Management Framework</em>,

Drupal 1.0 was released on 01/01/2001 Drupal is most accurately described as a <em>Content Management Framework</em>.

Drupal 1.0 was released on 01/01/2001, Drupal is most accurately described as a <e/ri.--Content Management Framework </em>.

Figure 12-6. Indexing a chunk of HTML and assigning token scores answers to common questions and relevant information easier to find. There are two ways to hook into the indexer:

• nodeapi(' update index'): You can add data to a node that is otherwise invisible in order to tweak search relevancy. You can see this in action within the Drupal core for taxonomy terms and comments, which technically aren't part of the node object but should influence the search results. These items are added to nodes during the indexing phase using the nodeapi('update index') hook. You may recall that hook_nodeapi() only deals with nodes.

• hook_update_index(): You can use the indexer to index HTML content that is not part of a node using hook_update_index(). For a Drupal core implementation of hook_ update_index(), see node_update_index() in modules/node/node.module.

Both of these hooks are called during cron runs in order to index new data. Figure 12-7 shows the order in which these hooks run.

Figure 12-7. Overview of HTML indexing hooks

We'll look at these hooks in more detail in the sections that follow.

0 0

Post a comment