Google Alerts RSS Feed classification

Google Alerts RSS Feeds: retrieval, analysis and classification

Google Alerts is a simple and useful way to leverage on Google's web search infrastructure to monitor news story or keep current on competitors or industries.

Get the comments

Python's package feedparser makes the retrieval of RSS feeds a breeze. To keep this exercise simple, I process only the feed titles. Because feed titles often come with some HTML tags, I use the package BeautifulSoup to extract the text, without any tags.

import feedparser
from bs4 import BeautifulSoup
# print the titles of the feeds
d = feedparser.parse('http://www.google.com/alerts/feeds/09028826661750699008/13616906508932128630')
for item in d['items']:
    html = item['title']
    soup = BeautifulSoup(html)
    text_parts = soup.findAll(text=True)
    text = ''.join(text_parts)
    print text

Running this code results in a series of feed titles like the following:

Lebron's Nike's Prototyped Using 3D Printer
Your 3D Printer Could Eat Empty Milk Jugs Instead of Expensive Plastic
Giant NASA spider robots could 3D print lunar base
3D printed railroad engine model kits made from insanely hi-rez scans
Stratasys Aims To Lead 3D Printing After Strong Q4
Nanoscribe claims world's fastest commercially available nano-3D

Hundreds of news appear weekly about 3D printing: let's classify them into the following 3 classes:

Innovative fields of application
Hot companies
New materials

First of all, we need to save the feed title into a Comma-Separated-Value (CSV) file:

with open('google_alert_rss_feed.csv', 'wb') as csvfile:
    feed_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
    for item in d['items']:
        html = item['title']
        soup = BeautifulSoup(html)
        text_parts = soup.findAll(text=True)
        title = ''.join(text_parts)
        print text
        feed_writer.writerow(['class label', title])

The lines in the saved file have this format:

"class label","Giant NASA spider robots could 3D print lunar base"

You now need to edit each row's "class label" and replace it with one of the following three labels:

"application"
"company"
"material"

for example:

"application","Giant NASA spider robots could 3D print lunar base"

The Ghib.li algorithm underlying the text classification can learn to classify new titles from the manually-classified examples that you provide with this file.

Getting started with Ghib.li's Smart Text Classification

Open a Mashape account

Ghib.li Web services are managed via Mashape. To get access to the Ghib.li Smart Text Classification API you need to create a (free) account at Mashape and contact info@ghib.li to get access to it. You'll also find the documentation describing all Smart Text Classification API's endpoints as well as the Terms for its use.

Optional: Download a client library

Mashape offer for free client libraries in several popular language (See the Mashape documentation, after being authorized). Alternatively, you can use plain REST, which is what I'm going to do in example.

Train a classifier

To train a classifier you just need to POST the training file you just created to the following endpoint: https://stc.p.mashape.com/<model_id>/train, where '<model_id>' has to be replaced by the name you want to give to the classification model.

SERVER = 'https://stc.p.mashape.com'
TEST_MODEL_NAME = 'google_alerts_model_1'
TEST_TRAINING_FILE = 'google_alert_rss_feed.csv'
FORMAT = {'SERVER': SERVER, 'TEST_MODEL_NAME': TEST_MODEL_NAME}
HEADERS = {'X-Mashape-Authorization': '<your own Mashape authorization key>'} 

response = requests.post('%(SERVER)s/%(TEST_MODEL_NAME)s/train' % FORMAT,
                        files={'file': open(TEST_TRAINING_FILE, 'r')}, headers=HEADERS)

The response.text should look like

{
    "kind": "prediction#training",
    "id": "google_alerts_model_1"
}

Wait for training completion

It usually takes a few minutes until the classification model is ready: you can check its status by GETting the following endpoint https://stc.p.mashape.com/<model_id>/status. The response JSON contains several pieces of information about the model, as well as a trainingStatus, which is RUNNING if the classification model's training has not finished yet and DONE if it's ready to be used.

Check the classification accuracy

By GETting https://stc.p.mashape.com/<model_id>/analysis you can have a summary of the statistical properties of the classification model. You'll find more information in the documentation of the Ghib.li Smart Text Classification hosted at Mashape.

Classify a new comment

Finally, you can have your freshly-trained classifier classify a new comment. The text to be classified needs to turned into JSON format and POSTed to https://stc.p.mashape.com/<model_id>/classify in this way:

import simplejson as json
headers = HEADERS
headers['content-type'] = 'application/json'
payload = json.dumps({'text': 'Stratasys Q4 Tops Street Expectations; Shares Rally'})
response = requests.post('%(SERVER)s/%(TEST_MODEL_NAME)s/classify' % FORMAT,
                             data=payload, headers=headers)

The result of the classification is described in the API documentation hosted at Mashape.

How to go on

Would you like to give it a try? Contact info@ghib.li

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Alerts RSS Feed classification

Google Alerts RSS Feeds: retrieval, analysis and classification

Get the comments

Getting started with Ghib.li's Smart Text Classification

Open a Mashape account

Optional: Download a client library

Train a classifier

Wait for training completion

Check the classification accuracy

Classify a new comment

How to go on

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally