Natural Language Processing and Machine Learning, some pointers

October 14th, 2012  |  Published in data

I’ve been doing a lot of natural-language machine-learning work both for clients and in side-projects recently. Mark Needham asked me on Twitter for some pointers to good introductory material. Here’s what I wrote for him:

Nearly all text processing starts by transforming text into vectors:

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms):

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. “wishy-washy” in English) – I use this to group words into n-gram tokens because many NLP techniques consider each word as if it’s independent of all the others in a document, ignoring order:

When you’ve got a lot of text and you don’t know what the patterns in it are, you can run an “unsupervised” clustering using Latent Dirichlet allocation:

Or if you know how your data is divided into topics, otherwise known as “labeled data”, then you can run “supervised” techniques such as training a classifier to predict the labels of new similar data. I can’t find a really good page on this – I picked up a lot in IM with my friend Ben who is writing a book coming out next year:

Here are the tools I’ve mostly been using:

Some blogs I like:

MetaOptimize Q+A is the Stack Overflow of ML:

The Mahout In Action book is quite good and practical:

Comments are closed.