Lucene Boot Camp Preclass Training

Welcome Lucene Boot Campers,I am glad you have signed up to take the Lucene Boot Camp training. We look forward to meeting you and working with you to create awesome Lucene applications. In order for the day to go smoothly, please take the time to follow the instructions below.

Setup

A couple of things need to be done for setup. The first involves setting up Lucene in your IDE, the second involves getting a document collection.

Boot Camp Files

We will provide you with skeleton files and other information to allow you to focus on Lucene without having to worry about setting up infrastructure. Commands to be run on a command line are in italics.

  1. Download and install Maven 2.0.7 or later (don’t worry, we will tell you all the Maven commands you will need to know) and make sure it is in your path
  2. Check out the training material from Subversion the Lucene Boot Camp training material here. Don’t look at the *Finished files if you can help yourself!
    1. svn co http://www.lucenebootcamp.com/LuceneBootCamp/
  3. Open a command line and change directories to the directory you unpacked these files into.
  4. mvn idea:idea or mvn eclipse:eclipse (warning: I am not an Eclipse user, so I may not be able to help with Eclipse issues.) Alternately, if you are using IntelliJ 7.x create a new project and use Maven as the project model.
    1. Eclipse Users:  See here for information on getting the code to work in Eclipse.
  5. Open the newly created project in your IDE if you haven’t already.
  6. You may want to associate the source code for Lucene with your project by downloading the source at http://www.apache.org/dyn/closer.cgi/lucene/java/ and unpacking it.

Content

Please download and unpack the following collections and have them available on your hard drive:

  1. Reuters Collection: http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
  2. Wikipedia Collection (optional): http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

You may also want to create a folder containing 10-20 files in the common file formats of: HTML, PDF, MS Word, XML and plain text.

Using the Reuters collection, change into the directory where you checked out the Boot Camp files and run:

mvn exec:exec -DreutersColl=<PATH TO REUTERS> -Dout=<PATH TO OUTPUT DIRECTORY>

This command will unpack the Reuters collection that was just downloaded and put it into the output directory specified.

After this runs, set aside 500 of the Reuters files into a “Reuters Small” directory.

My directory structure looks like:

  • training - contains the SVN downloaded code
  • reuters - the original Reuters SGML files
  • reuters-out - The unpacked files created by mvn exec:exec
  • reuters-small - The first 500 files from reuters-out

Luke

Get Andrzej Bialecki’s excellent tool named Luke. Luke is a handy tool for browsing, debugging and testing your Lucene index.

Questions

Email me at the lucenebootcamp.com domain, my username is trainer.

Mailing List

I have setup a mailing list for discussing issues related to Lucene Boot Camp training. This list is not for discussing general Lucene issues, you should use the Lucene mailing lists for that. Also, you must be a current or former student of Lucene Boot Camp to subscribe.