Errata for Programming Collective Intelligence
I'm making my way through Toby Segaran's excellent new book "Programming Collective Intelligence," and I'm posting here some of the errata I've found in the code thus far that hasn't been reported or published on the O'Reilly site yet. I'll report them but also want to explain them here. (I can't get the Python code to indent using the code markup plugin. Please let me know if you have suggestions.)
Chapter 3, Discovering Groups
generatefeedvector.py
The main body of this file bombs on
title,wc=getwordcounts(feedurl)
because the URL http://www.techeblog.com/index.php/feed/ toward the bottom of
http://kiwitobes.com/clusters/feedlist.txt
no long returns an RSS feed. We could remove that URL from feedlist.txt, find the working RSS URL for techeblog, or make our code more robust to deal with this problem in general. To enable the last option, encapsulate getwordcounts in Python's error apparatus:
try: title,wc=getwordcounts(feedurl) except AttributeError: continue
The variable feedlist in the line
frac=float(bc)/feedlist
is referenced but not initialized or computed before that.
The fix is initialize feedlist and increment it as each feedurl is processed:
feedlist = 0 for feedurl in file('feedlist.txt'): try: title,wc=getwordcounts(feedurl) except AttributeError: continue feedlist += 1 wordcounts[title]=wc for word,count in wc.items(): apcount.setdefault(word,0) if count>1: apcount[word]+=1
Lastly for Chapter 3, the string handling chokes on a character from one of the feeds that doesn't bridge the ascii and unicode worlds. I googled for a solution and came up with this one simple fix:
out = open('blogdata.txt','w') out.write('Blog')
to
out = codecs.open('blogdata.txt','wb','utf-8') out.write('Blog')
You must
import codecs
I'm not up to speed on unicode so don't ask me how it works; it works.
That's it for Chapter 3. More later as I make my way through the book. Btw, I just checked Toby's blog and found that you can download the source code.
Popularity: 34% [?]








