Thursday, September 17, 2009

Integrating Lucene search engine library with PHP

Introduction

This article gives you an overview of Java lucene search engine library. Here I describe the lucene search engine and why should you go for lucene search engine library instead of MySQL full text search.

Lets start with MySQL fulltext search. It requires you to define a table column as text or varchar and create index on it as a fulltext. Words in this columns are indexed if they are over certain length. Well this is not very good because if you want to change this you should restart mysql server with option:

[mysqld] ft_min_word_len=# The minimum length of the word to be included in a FULLTEXT index. Note: FULLTEXT indexes must be rebuilt

One benefit of using mysql fulltext is that all complex logic on indexing is hidden inside database. Mysql fulltext is much slower than lucene. You need to use MATCH AGAINST SQL queries.

Lucene is java framework, it uses implementation of index storage. It is complex to use, especially if you want to implement your own index storage in database. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the lucene index. But creation of a new object is not a major overhead. Though it sometimes does slow down the process of searching.

Comparison:

Here is the comparison between mysql fulltext and lucene search library,

a. speed of fulltext search in lucene is much faster as compared to mysql and It searches the keyword in multi-level languages. And there is no limitation with keywords.

b. lucene is much more complex to use as compared to mysql. But there are some advanced options like,

  • proximity search - find documents where there is one word between searchword1 and searchword2

  • wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

  • fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

  • Term boosting - you can boost a term to move relevant documents to the top.

  • Sorting and Paging - You can sort the search results as well as you can do the pagination by using topDocs and scoreDocs
Integration

Here I explained, how did i integrate the lucene with php.

a. For integrating php with java, I used PHP-java bridge library. It is faster more reliable than direct communication via the Java Native Interface, and it requires no additional components to invoke Java procedures from PHP or PHP procedures from Java. (For more details, http://php-java-bridge.sourceforge.net/)

b. Configuared Tomcat 6 for PHP 5. To install a PHP we application into Tomcat, we need to do the following steps,

  1. Copy the PHP web application JavaBridgeTemplate.war to the Tomcat webapps directory.
  2. Wait two seconds until Tomcat has loaded the web application.
  3. Browse to http://localhost:8080/JavaBridgeTemplate552 and http://127.0.0.1:8080/JavaBridgeTemplate552/test.php to see the PHP info page.
c. Once you configured the javabridge, you can see the javabridge directory in the deployment folder. Create a lucene jar file and place it in the deployment directory of the tomcat server(probably in C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\JavaBridge\WEB-INF\lib).

d. You need to create your own jar file as per your requirement and place it in the deployment directory of the tomcat server. To create a jar file, you can refer the following tutorial, http://www.ibm.com/developerworks/java/library/j-lucene/

d. Create a PHP file and place it in the javabridge folder(C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\JavaBridge). The following PHP snippet should help you to call the java function from the PHP,



require_once ("java/Java.inc");

//inlcude the lucencesearch.jar to search the class and the request.
java_require("WEB-INF/lib/lucenesearch.jar");

//create an object for AddressBookSearcher Class
$file = new java("AddressBookSearcher");



I hope, This tutorial should help you. Thanks for reading my blog.