Lucene is a jewel in the open source world – its a highly scalable, fast search engine written in Java. Its class-per-class C# cousin – Lucene.Net – is also a precious stone in the .NET universe. Over the years, Lucene.Net has gained considerable popularity and is used in a wide range of products.
C|NET, Fogbugz on Demand and Wikipedia are just a few sites that use Lucene.Net for indexing and searching their vast content. Even Microsoft has been forced to deal with a Lucene powered competitor Outlook extension that embarrasses the built-in search engine.
It's an extremely useful tool in the belt of any software engineer. Not just for it's design and implementation of modern information retrieval theories, but for it's simple API too.
From my experiences, libraries with simple concepts have simple API's. Lucene is no different because the search engine concept is very simple. A search engine is a system for finding text from a large set very quickly.
At the heart of the engine is the index (similar to a database table) with fields (like database columns) that contains documents (like database rows). To search one must write a query and give it to the engine to finding matching documents. The query language for a database is SQL and for Lucene it's a query object (you can construct complex queries by composing an object graph of query instances).
Search engines are super fast for finding text because documents are stored in an inverted index (where the terms of each field is tokenized, hashed and sorted at index time).
In contrast to database queries, search engines can calculate relevance scores when searching. This is because they use a better querying model called the vector space model instead of the classical boolean model. In the vector model, documents and queries are represented as vectors. The similarity between a query and any document can be calculated with simple vector operations. Documents with a higher similarity will appear higher in the results. Conversely, databases only know if rows meets the where criteria or not and cannot compute a relevance score – this true/false classification is how the boolean model got it's name.
Search engines are rad!! And with Lucene any developer can easily add a search engine to their application.
To satiate the rampant LINQ junkie within, I've been contributing to the LINQ to Lucene project – an open source LINQ provider framework.
My recent contribution to the project makes creating indexes and searching them even easier. Today, I'm going to demonstrate some features I've added so you too can easily Lucene-ify your application.
A common use of Lucene in applications is to complement the database with a Lucene index for searching. LINQ to Lucene offers this functionality out of the box with LINQ to SQL from data generated classes.
Let's start with the Northwind database. By following Scott Gu's instructions, you can create a DBML file with generated classes from the database schema. This demonstration only needs the Customer and order table.
The next step is to inform LINQ to Lucene which classes to index and how they are indexed using attributes. We do this extending the Customer and order generated classes and decorating them with attributes. If you don't know how to create partial classes, see Chris Sainty's post.
- [Document]
- public partial class Customer {
- }
- [Document]
- public partial class Order {
- }
Now we've marked these two classes as documents for indexing. The next step is to specify how each field is indexed. But, there is a bit of a problem with properties in generated classes. There is no way to add attributes to generated properties without spoiling the automatically generated files.
Our solution to this is to add the MetadataType property on the DocumentAttribute which tells Lucene to look at another class for the field attributes it needs. Like so:
- [Document(Metadatatype= typeof(CustomerMetadata))]
- public partial class Customer {
- }
- [Document(Metadatatype= typeof(OrderMetadata))]
- public partial class Order {
- }
Now we can create fake properties on our metadata types to specify field characteristics. The return type of the fake properties doesn't matter, only the name has to match.
- public class CustomerMetadata {
- [Field(FieldIndex.Tokenized, FieldStore.Yes, IsDefault = true)]
- public object ContactName { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes, IsKey= true)]
- public object CustomerID { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object ContactTitle { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object CompanyName { get; set; }
- }
- public class OrderMetadata {
- [Field(FieldIndex.Tokenized, FieldStore.Yes, IsKey = true)]
- public object OrderID { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes, IsDefault = true)]
- public object CustomerID { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object ShipName { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object ShipAddress { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object ShipCity { get; set; }
- [Field(FieldIndex.Tokenized, FieldStore.Yes)]
- public object ShipCountry { get; set; }
- }
Now we've told Lucene.Net which properties to index and how to index them.
- The FieldIndex property indicates whether or not the field is tokenized (i.e. split into parts). By default, this is UnTokenized to save index space.
- The FieldStore property tells Lucene whether or not to store the original value in the index. Again, by default, this is NO to save index space.
- IsKey should be true for the fields primary key. Only one property can be marked as the key, so LINQ to SQL classes with composite keys should have a new uniquely identifying property.
- IsDefault tells Lucene which field is the default field for searching.
Now we're ready to create the index.
To create the index from a LINQ to SQL Data Context, we use the DatabaseIndexSet like so:
- var dbi = new DatabaseIndexSet(
- @"C:\index\", // index path
- new NorthwindDataContext() // data context instance
- );
- dbi.Write();
By running this code, we'll create an index for the entire contents of the Customer and order tables. LINQ to Lucene is smart enough to collect all the relevant data from the Northwind database, convert each row to a Lucene Document and add it to the index.
Sweet! We've got an index of the Customer and order tables.
Now we can search for Customers or orders.
Let's find all the Customers who are Marketing Managers…
- var mmCustomers = from c in dbi.Get()
- where c.ContactTitle == "Marketing Manager"
- select c;
- Console.WriteLine("Marketing Manager Customers: Found {0}", mmCustomers.Count());
- foreach (var customer in mmCustomers) {
- Console.WriteLine(customer.ContactName);
- }
This very simple example demonstrates the query equality operator, but many other types of query are possible (including wildcards, prefixes, proximities). You can find more ways to query on the LINQ to Lucene homepage.
In future posts, I'll demonstrate more complex query examples and how to supply custom formatters for properties.
To see the source of LINQ to Lucene, you can get the latest release. The sample project from this post is available for download here.
NOTE: The sample project uses the Northwind database as a data source. Northwind can be downloaded and installed to your SQL Server instance from here.