Python

The Temboz RSS aggregator

2013-03-14: Google’s announcement that their Reader service will be discontinued has spurred interest in Temboz. This software is not dead, in fact I use it daily, but have not made an official release in a long time. You should use the version from Github instead. There are currently a number of bugs which can lead to Temboz locking up and requiring a restart. I am planning on completing my long overdue overhaul before Google’s July deadline.

Contents

Introduction

Temboz is a RSS aggregator. It is inspired by FeedOnFeeds (web-based personal aggregator), Google News (two column layout) and TiVo (thumbs up and down). I have been using FeedOnFeeds for some time now, but that software seems to have stopped evolving, and I had a number of optimizations to the user experience I wanted to make.

Features

Already implemented:

  • Multithreaded, download feeds in parallel.
  • Built-in web server.
  • Two-column user interface for better readability and information density. Automatic reflow using CSS.
  • Ratings system for articles
  • Real-time hunter-gatherer user interface: items flagged with a “Thumbs down” disappear immediately off the screen (using Dynamic HTML), making room for new articles. No laborious flagging of items as in FeedOnFeeds.
  • Filtering entries (using Python syntax, e.g. ‘Salon’ in feed_title and title == “King Kaufman’s Sports Daily”, or simply by selecting keywords/phrases and hitting “Thumbs down”).
  • Ability to generate a RSS feeds from “Thumbs Up” articles, which is why Temboz would be a true aggregator, not just a reader.
  • Ad filtering
  • Automatic garbage collection: every day between 3AM and 4AM, uninteresting articles (by default those older than 7 days) are purged of their contents (but not metadata such as titles, permalinks or timestamps) to keep the database size manageable. After 6 months (by default), they are deleted altogether
  • Automatic database backups daily (immediately after garbage collection)

On the to do list:

  • Write better documentation
  • Handle permanent HTTP redirects for feed XML URLs
  • Automatic pacing of feed polling intervals using the average and standard deviation of observed feed item inter-arrival times, to reduce bandwidth usage and load for both client and server. Most feeds should be polled on a daily rather than hourly interval (e.g. my own, since I update once a week on average), but the mechanisms for a feed to indicate its polling rate preferences are quite inconsistent from one flavor of RSS/Atom to another.
  • “Survivor mode” – vote feeds that no longer perform off the aggregator based on relevance statistics.
  • Ability to cluster together articles (I tried a heuristic of looking for common URLs they are all pointing to, but this didn’t work well in practice).
  • Portability to Windows, distribution as a standalone package.

History

I have been using it successfully for well over a year. It still has rough edges, with some administration functions only doable using the SQLite command-line utility. Here is a screen shot showing the reader user interface. The article highlighted in yellow was given a “Thumbs Up”. You can also see the user interface at work in a view of the last 50 articles I flagged as “thumbs up” among the feeds I read.

Screen shots

Click on a screen shot thumbnail for a full-sized version

The first screen shot shows the article reading interface, using a two-column layout. Clicking on the “Thumbs down” icon makes the article disappear, bringing a new one in its place (if available). Clicking on the “Thumbs up” icon highlights it in yello and flags it as interesting in the database.

view itemsThe feed summary page shows statistics on feeds, starting with feeds with unread articles, then by alphabetical order. Feeds can be sorted based on other metrics. You have the option of “catching up” with a feed (marking all the articles as read). Feeds with errors are highlighted in red (not shown).

view feedsClicking on the “details” link for a feed brings this page, which allows you to change title or feed URL, and shows the RSS or Atom fields accessible for filtering.

feed detailsFeeds can be filtered using Python expressions.

filtering rules

Known bugs

You can check outstanding bug reports, change requests and more at the public CVStrac site.

Credits

Temboz is written in Python, and leverages Mark Pilgrim’s Ultra-liberal feed parser, SQLite 2.x, Cheetah.

Download

You can download the current version: temboz-0.8.tar.gz I welcome any feedback you may have, specially as concerns improving installation.

The CVS version is far ahead of 0.8 in features. I have not yet had the time to test and document the migration procedure from 0.8 to 1.0, but if you are a new Temboz user I strongly advise you to get a nightly CVS snapshot instead (they are what I run on my own server): temboz-CVS.tar.gz or temboz-CVS.zip.

Updates

For news on Temboz, please subscribe to the RSS feed.

Temboz has a CVStrac where you can submit bug reports or change requests, and a Wiki, where all future documentation will ultimately reside.

Post scriptum

The name “Temboz” is a reference to Malima Temboz, “The mountain that walks”, an elephant whose tormented spirit is the object of Mike Resnick’s excellent SF novel, Ivory.

Data mining Outlook for fun and profit

For a few years now, I have owned the domain name majid.fm. Dot-fm stands for the Federated States of Micronesia, a micro-state in the Pacific Ocean, and they market their domain names to FM radio stations. Those are also my initials. Unfortunately, the registration fees are quite expensive ($200 every two years), and the domain is redundant now that I have acquired majid.info and majid.org (majid.com is reserved by a Malaysian cybersquatter who is demanding a couple thousand dollars for it – I may be vain, but not that vain). I have decided to let the domain lapse when it expires on April 1st.

I used the majid-dot-FM domain for my emails, and set it up so emails sent to anything @majid.fm would be sent to my primary mailbox fazal@majid.fm. For instance, if I registered with Dell, I would give them the email address dell@majid.fm. This was helpful in tracing where I got my email from, and blacklisting companies that started spamming me (they shall remain nameless to protect the guilty yet litigious).

Unfortunately, spammers and some worms attempt dictionary attacks by trying all possible combinations like jim@majid.fm, smith@majid.fm, and so on. My spam filter would catch some, but not all of them, and it would be a terrible hassle. I do not want to have an auto-responder send emails back to people who email me at the old address, as this would at best flood innocent people whose addresses spammers are impersonating, and at worst actually give my new address to the spammers.

My solution to this dilemma is to produce a Python script that scans through all the emails in my Outlook personal folder (PST) files of archived emails, flag all those who sent me an email, and them manually send them a change of address notification (or in the case of websites and online stores, update my contact info online).

Simply using Outlook’s advanced search function will not work, as in many cases the To: header is set to something other than the address the email is delivered to, such as undisclosed-recipients, or the sender’s address when they send the email to multiple Bcc: recipients (the proper way to proceed when you want to send an email to multiple recipients without giving everyone in the list the email addresses of the other recipients). I actually have to sift through the raw message headers to see the envelope destination address.

Here is a simplified version of olmine.py, the script I used. It requires Python 2.x with the win32all extensions, and Outlook 2000 with the Collaboration Data Objects (CDO) option installed (this is not the default). CDO is required to access the full headers. Of course, this script can be useful for all sorts of social network analysis fun on your own Outlook files, or more prosaically to generate a whitelist of email addresses for your spam filter.

import re, win32com.client

srcs = {}
dsts = {}
pairs = {}

# regular expression that scans for valid email addresses in the headers
m_re = re.compile(r'[-A-Za-z0-9.,_]*@majid\.fm')
# regular expression that strips out headers that can cause false positives
strip_re = re.compile(r'(Message-Id:.*$|In-Reply-To:.*$|References:.*$)',
                      re.IGNORECASE | re.MULTILINE)

def dump_folder(folder):
  """Iterate recursively over the given folder and its subfolders"""
  print '-' * 72
  print folder.Name
  print '-' * 72
  for i in range(1, folder.Messages.Count + 1):
    try:
      # PR_SENDER_EMAIL_ADDRESS
      _from = folder.Messages[i].Fields[0x0C1F001F].Value
      # PR_TRANSPORT_MESSAGE_HEADERS
      headers = folder.Messages[i].Fields[0x7d001e].Value
    except:
      # ignore non-email objects like contacts or calendar entries
      continue
    stripped_headers = strip_re.sub('', headers)
    for _to in m_re.findall(stripped_headers):
      srcs[_from] = srcs.get(_from, 0) + 1
      dsts[_to] = dsts.get(_to, 0) + 1
      if (_from, _to) not in pairs:
        print _from, '->', _to
      pairs[_from, _to] = pairs.get((_from, _to), 0) + 1
  # recurse
  for i in range(1, folder.Folders.Count + 1):
    dump_folder(folder.Folders[i])

# connect to Outlook via CDO
cdo = win32com.client.Dispatch('MAPI.Session')
cdo.Logon()
# iterate over all the open PST files
for i in range(1, cdo.InfoStores.Count + 1):
  store = cdo.InfoStores[i]
  root = store.RootFolder
  m = root.Messages
  store.ID
  print '#' * 72
  print store.Name
  print '#' * 72
  dump_folder(root)
cdo.Logoff()

Debugging DCOracle2 applications

DCOracle2 is the Oracle interface module for Python I use most often. It is advertised as “beta”, but quite suitable for production use, aside from a few minor rough edges. There are a few others, most notably cx_oracle, but I can’t vouch for them.

Debugging applications that make use of DCOracle2 can be challenging, as with any database environment, specially in a multi-threaded server context. I have developed a small utility module to aid in development. When it is imported, it will automatically trace all database calls made through DCOracle2, including arguments such as bind variables. More interestingly, it will also automatically run EXPLAIN PLAN on queries taking longer than 2 seconds (by default), to aid in tuning SQL statements. As a side bonus, if run by itself, it provides a (very basic) SQL shell that does offer command-line history and editing, something Oracle hasn’t managed to provide in SQL*Plus in almost 30 years 🙂

This code works with Python 2.2 and DCOracle2 1.1 and 1.3 beta. It will not work with 2.1 and earlier.

The latest version of the module file can be downloaded here: debug_ora.py, as well as the RCS repository debug_ora.py,vfor those who care about this kind of stuff.

An example run of the module:

% python debug_ora.py scott/tiger@repos
SQL> select ename, job, dname from emp, dept where emp.deptno=dept.deptno;
SQL: Oct-03-2003 17:32:39:897
select ename, job, dname from emp, dept where emp.deptno=dept.deptno
ARG: () {}
SQL: !!!!!!!!!!!!!!!! slow query, time = 0.0 sec
SQL: !!!!!!!!!!!!!!!! execution plan follows
000      SELECT STATEMENT Optimizer=CHOOSE
001        NESTED LOOPS
002 001      TABLE ACCESS (FULL) ON EMP
003 001      TABLE ACCESS (BY INDEX ROWID) ON DEPT
004 003        INDEX (UNIQUE SCAN) ON PK_DEPT

ENAME  JOB       DNAME
------ --------- ----------
SMITH  CLERK     RESEARCH
ALLEN  SALESMAN  SALES
WARD   SALESMAN  SALES
JONES  MANAGER   RESEARCH
MARTIN SALESMAN  SALES
BLAKE  MANAGER   SALES
CLARK  MANAGER   ACCOUNTING
SCOTT  ANALYST   RESEARCH
KING   PRESIDENT ACCOUNTING
TURNER SALESMAN  SALES
ADAMS  CLERK     RESEARCH
JAMES  CLERK     SALES
FORD   ANALYST   RESEARCH
MILLER CLERK     ACCOUNTING
SQL>

Obtaining tracebacks on other threads than the current thread

Note: this entry was superseded and is maintained only for historical purposes. Among others, the restriction of not being able to find the stack frame for a specific thread has been lifted with changes in Python 2.3.

David Beazley added advanced debugging functions to the Python interpreter, and they have been folded into the 2.2 release.

I used these hooks to build a debugging module that is useful when you are looking for deadlocks in a multithreaded application. It basically has a single function that will return a list of the stack frames for all Python interpreter threads in the process.

Unfortunately, I was unable to find a way to get a stack frame for a specific thread (either by the thread ID or using threading Thread objects), as Python does not save the thread ID in its thread state.

Of course, I disclaim any liability if this code should crash your system, erase your homework, eat your dog (who also ate your homework) or otherwise have any undesirable effect.

Building and installing

Download threadframe-0.1.tar.gz. You can use the Makefile. I’ve built and tested this only on Solaris 8/x86 and Windows 2000, but the code should be pretty portable. There is a small test program test.py that illustrates how to use this module to dump stack frames of all the Python interpreter threads. A sample run is available for your perusal.

For Windows users, a pre-compiled binary for the standard Python 2.2.1 distribution is available: threadframe.pyd. Just copy this file in any location in your Python path and you should be able to run the test script test.py.

Objects are aristotelician

One of the unquestioned assumptions behind object-oriented programming is that objects are instances of a class, and thus implicitly stay that way. This is akin to the philosophical concept of nature, as in an invariant quality of something, that cannot be changed:

But is there any one thus intended by nature to be a slave, and for whom such a condition is expedient and right, or rather is not all slavery a violation of nature?

There is no difficulty in answering this question, on grounds both of reason and of fact. For that some should rule and others be ruled is a thing not only necessary, but expedient; from the hour of their birth, some are marked out for subjection, others for rule.

Again, the male is by nature superior, and the female inferior; and the one rules, and the other is ruled; this principle, of necessity, extends to all mankind.

It is clear, then, that some men are by nature free, and others slaves, and that for these latter slavery is both expedient and right.

Aristotle, Politics I, 5 (emphasis mine)

Needless to say, this concept is reactionary. One may well object that given slavery’s omnipresence in antiquity, even a great philosopher such as Aristotle could not be entirely free of the prejudices of his time. This conveniently ignores the fact Aristotle was a pupil of Plato, himself a disgruntled aristocrat who collaborated with Spartans when they overthrew Athenian democracy after the Peloponnesian war, and is arguably one of the theoretical founders of the totalitarian state. I would say it is rather the presumed greatness of Aristotle that should be reexamined, but I digress. For more on this subject, read Karl Popper’s The Open Society and its Enemies – Volume 1, The Spell of Plato.

Thus, OOP carries within it the conservatism of Plato and Aristotle, people who resented how the young Athenian democracy had usurped the aristocracy’s natural (in their eyes) right to rule over others. This is not just an academic consideration. Computer programmers influence society, specially those who work for governmental information systems, and if you consider the Sapir-Whorf hypothesis, the language they use affects the way they think.

This is why I like Python’s ability to morph an object from one class to another:

Python 2.2.1 (#1, Apr 18 2002, 13:06:27)
[GCC 2.95.3 20010315 (release)] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> class Slave:
...     def whip(self):
...             return 'Yes, master'
...
>>> class Freeman:
...     def whip(self):
...             return 'Die, fascist scum!'
...
>>> man = Slave()
>>> man.whip()
'Yes, master'
>>> man.__class__ = Freeman
>>> man.whip()
'Die, fascist scum!'