One of my customers wanted to make their site searchable. They have a lot of content in different places (physical files, other websites, database, etc.). Trying to search all of these places real time would be a nightmare…and incredibly slow! So instead, I decided to build a web spider that caches the website content to a local drive on the web server and indexes the content into a Lucene index. The searches are now incredibly fast (under 1/2 sec), have relevancy scores and are merged. Some of the code was example code found at DotLucene (http://www.dotlucene.net/download/)…most of it is original.
This application shows you how to use Lucene to create a custom advanced search engine. There is way too much code to go over every part so I'm only discussing the important parts.
Download code
Admin page – Allows user to index/re-index websites, delete indexed websites and index physical hard drive folders.
"http://www.brianpautsch.com" is indexed in less than 5 seconds.
"C:\_Websites\" is indexed in less than 5 seconds.
Lines 138-153: AddWebPage method: The spider calls this method for each link found. This method strips off any bookmarks, verifies the file extension is in our list of valid extensions and ensures the site is within our base URL. If all of these tests pass, an instance of WebPageState is created (URL loaded in constructor), a unique ID is assigned and the page is put in the queue of pages that need to be visited and indexed/cached.
Lines 171-214 : Process method: This method makes a WebRequest to the URL, checks the status code, stores the HTML and sets the process success flags.
Lines 261-269 : HandleLinks method: This method uses a regular expression to find all URL links on the page.
Lines 272-285: AddWebPageToIndex method: Once all of the information for the page is gathered this method is called to add the information to the index. Note that some fields are added as "UnIndexed", "Text", etc. Here's a little explanation on each:
Field.Keyword – The data is stored and indexed but not tokenized – (Last modified date, filename)
Field.Text – The data is stored, indexed, and tokenized – Searchable small text (Title)
Field.UnStored – The data is not stored but it is indexed and tokenized. (File content)
Field.UnIndexed – The data is stored but not indexed or tokenized. (URL)
Index/Cache Storage
As new websites are indexed, they are stored in separate folders under the "…\LuceneWeb\WebUI\cache\" and "…\LuceneWeb\WebUI\index\" folders. The current version only allows for one index of the hard drive and it's stored in the "…\LuceneWeb\WebUI\index\folders\" folder. This logic could be easily changed to have multiple indices of the hard drive.
Search page – Allows user to select an index and search it.
"http://www.gotdotnet.com" is searched for "microsoft" – 158 results found in .26 seconds.
"C:\_Websites\" is searched for "search" – 10 results found in .63 seconds.
Lines 145-217 : PerformSearch method – This is the main method for this class (code-behind). It starts off by determining the index location and creating an empty datatable of results. A basic query is performed (no special filtering, i.e. datetime, buckets) and returns a "Hits" object. A QueryHighlighter object is created and as each result is extracted the contents are highlighted. Finally, the datarow is saved to the datatable and later bound to the repeater.