— gavinmhackeling.com/blog

Deleting all documents from the local index:

curl http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml" --data-binary '*:*'

Adding documents to a remote index:

java -Durl=http://ec2-...amazonaws.com:8983/solr/update -jar post.jar

Installing:

  • http://www.shayanderson.com/linux/install-solr-on-ubuntu-1104-server.htm#comment-524855678
  • Build both Solr (ant clean test) and the example (ant example)
  • Make sure the appropriate port is opened in the EC2 security group
  • To run Jetty/Solr as a background process:
    /example$ java -jar start.jar &
  • to daemonize, add to supervisor.conf:
    [program:apache_solr]
    command=/path/to/my/scripts/folder/apache-solr-supervisor-run.sh
  • make a script start the Solr .jar in Jetty:
    #!/bin/bash
    # enter solr dir
    cd /path/to/my/apache-solr-1.4.1/installation
    # start solr
    java -jar start.jar
  • sudo chmod +x scriptName.sh

    to make it executable

  • Reload and restart supervisor processes

Schema notes:

UniqueKey field must be a string.

  • Design schema for documents in collection
  • Preprocess documents, write .xml
  • POST .xml to server
  • access with Sunburnt from Tornado
Read More
  • Lucene looks great. However:
  1. I absolutely cannot get PyLucene to build on Ubuntu 12.04. There is a build in a PPA somewhere, but it is for an old version of PyLucene. There are eggs for Windows. There does not seem to be much activity, lots of people are having problems building, and I have not found many examples.
  2. LuPyne installs, but is broken and throws a cryptic message. I have not found any people using LuPyne so I have not worked much on fixing it.
  3. I would use Lucene with Jython, but Jython does not mix with my server, Tornado. AFAIK Jython still implements Python 2.5.
  4. I would run a Solr server separately and query it from the Tornado server using something like Celery, but I do not know how. I need to evaluate whether Tornado would potentially provide performance benefits for my application.
  • Alternately, I could use an existing service, like Amazon CloudSearch. However, it is not free and I would rather confirm that this part of the application is viable before paying for it.
  • I am going to try Whoosh. It is search in pure Python. It apparently is much slower than Lucene, but it installs, should work with Tornado, and should be the easiest to prototype with. I’ll try building PyLucene again, or try to use the old version from the PPA.

EDIT:

Nevermind, I am proceeding with Solr and a Python interface (probably Sunburnt).

  1. Use post.jar to upload and index documents. It is also to POST JSON documents or import a .csv
Read More

Knowetry is a web app that runs on Tornado. It extends the answer-type classifier I discussed previously.

It works like this:

  • The user enters a question
  • The app predicts the type of the answer for the question; this could be “individual,” “vehicle,” “description,” etc.
  • The app looks up a string template for a nonsensical poem associated with the answer type, and completes the template with the question text.
Read More

Viewable at http://23.21.89.91/questionansweranalysis

Source forthcoming.

This project is part of an open-domain question-answering system for natural language questions. To effectively retrieve a correct answer to a question it is helpful to constrain the possible types of answers. A QA system may not employ uniform methods for information retrieval for all types of answers; Evi, for instance, variously queries API’s, structured data sources, and unstructured text depending on the submitted question. This project uses machine learning to classify questions by their expected answer types. I am using the hierarchical classification approach discussed in this paper.

Read More

What are Support Vector Machines?

‘Support Vector Machines’ sounds like the type of robots that killed your family during the Great Robot Uprising. Actually, SVM are machine-learning methods for classification and regression analysis that use supervised learning.

Supervised learning is the task of predicting an external variable Y (a vector of integers representing classes or labels) from observed data matrix X for a set of samples. Classification is associating one class from a finite set of classes to a sample; for instance, identifying a tweet as being positive or negative is a classification problem. Predicting a continuous target variable is called regression. An example of a regression problem is predicting the price of a house based on its location and size.

From http://scikit-learn.org/stable/modules/svm.html:

http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html

Advantages include:

SVM are more effective than Bayesian classifiers in high-dimensional spaces.

Support vectors are a subset of training points used in the decision function (kernel functions).

Scikit-learn for Python includes default kernels.

Disadvantages include:
The number of features must not greatly exceed the number of samples

Read More

I am making these notes public, but some links are to private documents; sorry.

Read More

http://www.joelonsoftware.com/articles/Unicode.html

Read More

In Part 1 of this series I (incompletely) described the process for creating an EC2 instance and installing an Ubuntu/Nginx/Tornado/Supervisor software stack with the NLTK library for Python. In Part 2 I described configuration settings for Nginx and Supervisor and wrote a basic Tornado application that uses routes, templates and NLTK. In this post I describe how to forward a domain name to your EC2 instance’s address, and run multiple instances on your application on multiple processor threads.

Forwarding a domain name to your EC2 instance

First you must map a public IP address to your EC2 private address. In the EC2 tab of the AWS Management Console, select “Elastic IP” from “My Resources” and allocate a new IP address. Then associate the IP address to your EC2 instance. Note that doing so will change your ec2-n-n-n-n.compute-1.amazonaws.com address to reflect the elastic IP. Caution: Elastic IP addresses that are not associated to an EC2 instance will cost you $0.01 ($0.02?) per hour that the address is not associated.

Next, configure your domain name with your registrar to forward the domain to the elastic IP address. It should take only a few minutes for the association of your elastic IP and instance to propagate, and your domain-forwarding should take effect immediately. (Question: if your domain registrar supports forwarding to a domain name rather than to an IP address, can you forward to your EC2 public domain without associating an elastic IP?)

Running multiple instances of Tornado

Pre-forking in Tornado:
This will automatically create a Tornado thread for each CPU core. I have not tried this method yet.

Using Nginx as a reverse proxy for multiple Tornado processes:
I create multiple instances of my Tornado process (one for each core) and make each process listen to a different port. The port is passed as an argument:

python app.py --port=8005

Nginx works as a load-balancing reverse-proxy for the several Tornado instances.

Read More

The following is the app.py that is Nginx forwards requests to:

import os.path
import tornado.ioloop
import tornado.web
import nltk

class MainHandler(tornado.web.RequestHandler):
def post(self):
text = self.get_argument("rawtext")
entities = []
postags = nltk.pos_tag(nltk.word_tokenize(text))
#print postags
for chunk in nltk.ne_chunk(postags):
if hasattr(chunk, 'node'):
tmp_tree = nltk.Tree(chunk.node, [(' '.join(c[0] for c in chunk.leaves()))])
entities.append(tmp_tree)
#print entities
self.render("home_post.html", text=text, entities=entities)
def get(self):
self.render("home_get.html")

handlers = [
(r"/", MainHandler),
]

settings = dict(
template_path=os.path.join(os.path.dirname(__file__), "templates"),
)

application = tornado.web.Application(handlers, **settings)

if __name__ == "__main__":
application.listen(8000)
tornado.ioloop.IOLoop.instance().start()
Read More