Jun 30, 2008

Yet Another Feed Importer

On Friday I posted another extension to TER. It is called "Yet Another Feed Importer".

Why having "yet another"?

I needed an importer that will be easily customizable and extendable from the outside. In "yafi" (extension key for the new importer) I provide a framework and use plugin pattern to do imports. Take a look to the diagram:



Framework reads RSS and ATOM streams from the Internet. Each stream is parsed into articles. Articles are passed one by one to the importers. Importers convert the article to whatever they wish (tt_news item, page, etc). Articles come as PHP objects, so importers do not have to care about parsing anything.

Import process is highly configurable. Editor can enter feed URL and feed title. He can specify how often he wants to import the feed in a human-like language ("+1 hour 15 minutes"). Backend record for the feed has two informational fields for editor's convenience. They show when feed was last imported and what was last article's time stamp.

A single importer is provided now with extension. It imports articles to tt_news. The importer is highly configurable too. Editor can specify what tt_news article type he wants (import full or just make an external URL), what tt_news category and storage folder to use, when item is expired, etc. It is really very flexible.

Import runs either through command line script or through a silent Frontend plugin. Command line is better, especially when combines with cron job. Frontend can be used when normal cron jobs are unavailable. Command line script have a convenient option to import only certain amount of feeds at a time. If you have 20-30 streams to import, it is very handy to import them often but to import only some of them at a time. It makes less load on the external web sites and makes better news distribution.

There were several technical challenges in this development. Two most difficult were:
  • Avoiding double imports
  • Handling invalid dates
Avoiding double imports may seem easy for those who familar with RSS/ATOM format. "Just use article id!" they will say. That works, yes. But it work only until article id stays the same when you import. I found that article id changes sometimes. I had to make certain workarounds about it.

Another pretty nasty things were that some feeds do not have valid dates. For example, specifying date like this:

Mon, 30 Jun 2008 08:00:00 GMT +0300

This date is invalid. It has two time zones in it: GMT and +0300. Fortunately feed validator helped me to discover the problem immediately. Until I fixed this, all such articles imported as if they were coming from January 1st, 1970.

In general, I liked working on this extension. It was not easy but I like technical chalenges. They helps me to grow. They help me to create. They help me to have fun. I am sure this extension will be useful not only for me but also for many other TYPO3 users.

No comments:

Post a Comment