-
Notifications
You must be signed in to change notification settings - Fork 2
Google Alerts RSS Feed classification
Google Alerts is a simple and useful way to leverage on Google's web search infrastructure to monitor news story or keep current on competitors or industries.
Python's package feedparser makes the retrieval of RSS feeds a breeze. To keep this exercise simple, I process only the feed titles. Because feed titles often come with some HTML tags, I use the package BeautifulSoup to extract the text, without any tags.
import feedparser
from bs4 import BeautifulSoup
# print the titles of the feeds
d = feedparser.parse('http://www.google.com/alerts/feeds/09028826661750699008/13616906508932128630')
for item in d['items']:
html = item['title']
soup = BeautifulSoup(html)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
print textRunning this code results in a series of feed titles like the following:
Lebron's Nike's Prototyped Using 3D Printer
Your 3D Printer Could Eat Empty Milk Jugs Instead of Expensive Plastic
Giant NASA spider robots could 3D print lunar base
3D printed railroad engine model kits made from insanely hi-rez scans
Stratasys Aims To Lead 3D Printing After Strong Q4
Nanoscribe claims world's fastest commercially available nano-3D
Hundreds of news appear weekly about 3D printing: let's classify them into the following 3 classes:
- Innovative fields of application
- Hot companies
- New materials
First of all, we need to save the feed title into a Comma-Separated-Value (CSV) file:
with open('google_alert_rss_feed.csv', 'wb') as csvfile:
feed_writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
for item in d['items']:
html = item['title']
soup = BeautifulSoup(html)
text_parts = soup.findAll(text=True)
title = ''.join(text_parts)
print text
feed_writer.writerow(['class label', title])The lines in the saved file have this format:
"class label","Giant NASA spider robots could 3D print lunar base"
You now need to edit each row's "class label" and replace it with one of the following three labels:
"application""company""material"
for example:
"application","Giant NASA spider robots could 3D print lunar base"
The Ghib.li algorithm underlying the text classification can learn to classify new titles from the manually-classified examples that you provide with this file.
Ghib.li Web services are managed via Mashape. To get access to the Ghib.li Smart Text Classification API you need to create a (free) account at Mashape and contact info@ghib.li to get access to it. You'll also find the documentation describing all Smart Text Classification API's endpoints as well as the Terms for its use.
Mashape offer for free client libraries in several popular language (See the Mashape documentation, after being authorized). Alternatively, you can use plain REST, which is what I'm going to do in example.
To train a classifier you just need to POST the training file you just created to the following endpoint: https://stc.p.mashape.com/<model_id>/train, where '<model_id>' has to be replaced by the name you want to give to the classification model.
SERVER = 'https://stc.p.mashape.com'
TEST_MODEL_NAME = 'google_alerts_model_1'
TEST_TRAINING_FILE = 'google_alert_rss_feed.csv'
FORMAT = {'SERVER': SERVER, 'TEST_MODEL_NAME': TEST_MODEL_NAME}
HEADERS = {'X-Mashape-Authorization': '<your own Mashape authorization key>'}
response = requests.post('%(SERVER)s/%(TEST_MODEL_NAME)s/train' % FORMAT,
files={'file': open(TEST_TRAINING_FILE, 'r')}, headers=HEADERS)The response.text should look like
{
"kind": "prediction#training",
"id": "google_alerts_model_1"
}It usually takes a few minutes until the classification model is ready: you can check its status by GETting the following endpoint https://stc.p.mashape.com/<model_id>/status. The response JSON contains several pieces of information about the model, as well as a trainingStatus, which is RUNNING if the classification model's training has not finished yet and DONE if it's ready to be used.
By GETting https://stc.p.mashape.com/<model_id>/analysis you can have a summary of the statistical properties of the classification model. You'll find more information in the documentation of the Ghib.li Smart Text Classification hosted at Mashape.
Finally, you can have your freshly-trained classifier classify a new comment. The text to be classified needs to turned into JSON format and POSTed to https://stc.p.mashape.com/<model_id>/classify in this way:
import simplejson as json
headers = HEADERS
headers['content-type'] = 'application/json'
payload = json.dumps({'text': 'Stratasys Q4 Tops Street Expectations; Shares Rally'})
response = requests.post('%(SERVER)s/%(TEST_MODEL_NAME)s/classify' % FORMAT,
data=payload, headers=headers)The result of the classification is described in the API documentation hosted at Mashape.
Would you like to give it a try? Contact info@ghib.li