New Term Extraction tool

We’re happy to announce a new version of our Term Extraction tool. You can try it now.

The older version was in Python, and many of our users struggled to get it set up. We’re not Python experts ourselves, so couldn’t do much to help. This new version, in line with the rest of our code, is now PHP. Set up should be very simple (upload what’s in the package to your server and you should be up and running).

Topia’s Term Extractor

To extract terms from a piece of content, we use Topia’s Term Extractor (thanks Stephan Richter, Russ Ferriday and the Zope Community), which describes the extraction process as follows:

This package determines important terms within a given piece of content. It uses linguistic tools such as Parts-Of-Speech (POS) and some simple statistical analysis to determine the terms and their strength.

Topia’s Term Extractor tries to produce results somewhere between a POS tagger like TreeTagger and Yahoo Keyword Extraction.

Since we are only interested in nouns, a very simple POS tagging algorithm can be deployed, which will provide good results most of the time. We then use some simple statistics and linguistics to produce a narrow but strong list of terms for the content.

The core component in our new version is a PHP port of Topia’s Term Extractor with some of Joseph Turian’s changes applied.

If you’re only interested in the PHP port, it’s free to download from our code repository.

Use as a web service: alternative to Yahoo’s Term Extraction

Our goal with this tool is to allow it to be run as a web service, similar to Yahoo’s Term Extractor, but one which you can control (no corporate APIs or restrictive Terms of Service).

In this version we’ve added support for multiple output formats (JSON, XML, HTML, plain text, serialised PHP) and a Yahoo compatibility mode. If you decide to switch over from Yahoo’s service, it’s as simple as updating the base URL for your requests.

For example, let’s say we want to extract terms from the following piece of text (the example used by Yahoo):

“Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration.”

Here’s what the request might look like for Yahoo:

To switch to Term Extraction from, you would simply change the base URL to point it to your own copy:

This would return the following response:

{"ResultSet":{"Result":["italian sculptors","virgin mary","painters","renaissance","inspiration"]}}

Note: in this case exactly the same terms are returned by both services, but Yahoo compatibility mode does not mean you’ll always get the same results as Yahoo’s service, only that the way the results are formatted should match Yahoo’s.