??xml version="1.0" encoding="utf-8" standalone="yes"?>
You've probably already heard of RSS, the XML-based format which allows
Web sites to publish and syndicate the latest content on their site to
all interested parties. RSS is a boon to the lazy Webmaster, because
(s)he no longer has to manually update his or her Web site with new
content.
Instead, all a Webmaster has to do is plug in an RSS client,
point it to the appropriate Web sites, and sit back and let the site
"update itself" with news, weather forecasts, stock market data, and
software alerts. You've already seen, in previous articles,
how you can use the ASP.NET platform to manually parse an RSS feed and
extract information from it by searching for the appropriate elements.
But I'm a UNIX guy, and I have something that's even better than
ASP.NET. It's called Perl.
Installing XML::RSS
Written entirely in Perl, XML::RSS isn't included with Perl by default, and you must install it from CPAN.
Detailed installation instructions are provided in the download
archive, but by far the simplest way to install it is to use the CPAN
shell, as follows:
shell> perl -MCPAN -e shell If you use the CPAN shell, dependencies will be automatically
downloaded for you (unless you told the shell not to download dependent
modules). If you manually download and install the module, you may need
to download and install the XML::Parser module before XML::RSS can be
installed. The examples in this tutorial also need the LWP::Simple
package, so you should download and install that one too if you don't
already have it.
Basic usage
Listing A
"; }
# print footers
print "
Place the script in your Web server's cgi-bin/ directory/. Remember to
make it executable, and then browse to it using your Web browser. After
a short wait for the RSS file to download, you should see something
like Figure A.
The various elements of the RSS feed are converted into Perl structures, and a foreach()
loop is used to iterate over the array of items. Each item contains
properties representing the item name, URL and description; these
properties are used to dynamically build a readable list of news items.
Each time Slashdot updates its RSS feed, the list of items displayed by
the script above will change automatically, with no manual intervention
required.
The script in Listing A will work with other RSS feeds as well—simply alter the URL passed to the LWP's get() method, and watch as the list of items displayed by the script changes.
Tip: Notice that the RSS channel name (and description) can be obtained with the object's channel() method, which accepts any one of three arguments (title, description or link) and returns the corresponding channel value.
Adding multiple sources and optimising performance
Listing B
"; }
# print footers
print "
Figure B shows you what it looks like.
You'll notice, if you're sharp-eyed, that Listing B uses the parsefile()
method to read a local version of the RSS file, instead of using LWP to
retrieve it from the remote site. This revision results in improved
performance, because it does away with the need to generate an internal
request for the RSS data source every time the script is executed.
Fetching the RSS file on each script run not only causes things to go
slow (because of the time taken to fetch the RSS file), but it's also
inefficient; it's unlikely that the source RSS file will change on a
minute-by-minute basis, and by fetching the same data over and over
again, you're simply wasting bandwidth. A better solution is to
retrieve the RSS data source once, save it to a local file, and use
that local file to generate your page.
Depending on how often the source file gets updated, you can
write a simple shell script to download a fresh copy of the file on a
regular basis.
Here's an example of such a script:
#!/bin/bash
This script uses the wget utility (included with most Linux distributions) to download and save the RSS file to disk. Add this to your system crontab, and set it to run on an hourly or daily basis.
If you find performance unacceptably low even after using local
copies of RSS files, you can take things a step further, by generating
a static HTML snapshot from the script above, and sending that to
clients instead. To do this, comment out the line printing the
"Content-Type" header in the script above and then run the script from
the console, redirecting the output to an HTML file. Here's how:
$ ./rss.cgi > static.html
Now, simply serve this HTML file to your users. Since the file
is a static file and not a script, no server-side processing takes
place before the server transmits it to the client. You can run the
command-line above from your crontab
to regenerate the HTML file on a regular basis. Performance with a
static file should be noticeably better than with a Perl script.
Looks easy? What are you waiting for—get out there and start hooking your site up to your favorite RSS news feeds.
最q搜?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析工具中找CMagPieRSS 和基于其设计?a >LilinaQLilina的主要功能:(x) 1 ZWEB界面?b style="color: black; background-color: rgb(160, 255, 255);">RSS理Q添加,删除QOPML导出Q?b style="color: black; background-color: rgb(160, 255, 255);">RSS后台~存机制Q避免对数据源服务器产生q大压力Q,ScriptLet: cM于Del.icio.us it的收藏夹x订阅JS脚本Q?/p>
2 前台发布Q将自己的首|成了用Lilina发布我常看的几个朋友的网志,也省M很多更新自己|页的工作,需?strong>php 4.3 + mbstring iconv 记得q初Wen Xin在CNBlog的研讨会(x)上介l了个h门户的概念,随着RSS在CMS技术中的成熟,来多的服务可以让个h用户Ҏ(gu)自己需求构建门P也算是符合了互联|的非中心化势吧,比如利用Add to My Yahoo!功能Q用户可以轻杄实现自己从更多数据源q行新闻订阅。想象一下把你自qdel.icio.us书签收藏 / flickr囄收藏 / Yahoo!新闻都通过q样一?b style="color: black; background-color: rgb(160, 255, 255);">RSS聚合器聚?发布h。其传播效率有多快?/p>
好比软g开发通过中间q_/虚拟机实玎ͼ(x)一ơ写成,随处q行QWrite once, run anywhereQ,通过RSS/XMLq个中间层,信息发布也实CQ一ơ写成,随处发布QWrite once, publish anywhere...Q?/p>
安装Lilina需要PHP 4.3 以上Qƈ带有iconv mbstring{函数的支持Q请认一?a --with-iconv' 另外是一个需要能通过服务器端向外部服务器发送RPChQ这?1.NET不支持。感?a >PowWeb的服?/a>很不错,很多~省的包都安装好了:(x) iconv Directive Local Value Master Value mbstring 安装包解包Q下载文件扩展名?gz 其实?tgzQ需要重命名一下)Q上传到服务器相应目录下Q注意:(x)相应cache目录和当前目录的可写入属性设|,然后配置一下conf.php中的参数卛_开始用?/p>
何东l我的徏议:(x) 一些改q计划:(x) " 实现Q?br>
2 分组功能Q将RSSq行l输出; 修改默认昄实现QLilina~省昄最q?天发表的文章Q如果需要改成其他时间周期可以找刎ͼ(x) q行改动?/p>
RSS是一个能自q所有资源:(x)WIKI / BLOG / 邮g聚合h的轻量协议Q以后无Z在何处书写,只要?b style="color: black; background-color: rgb(160, 255, 255);">RSS接口都可以通过一定方式进行再ơ的汇聚和发布v来,从而大大提高了个h知识理和发?传播效率?/p>
以前?b style="color: black; background-color: rgb(160, 255, 255);">RSS理解非常:(x)不就是一个DTD嘛,真了解v解析器来Q才知道namespace的重要性,一个好的协议也应该是这L(fng)Qƈ非没有什么可加的Q但肯定是没有什么可“减”的了,而真的要做到q个其实很难很难……?/p>
我会(x)再尝试一下JAVA的相兌析器Q将其扩展到WebLucene目中,更多Java相关Open Source RSS解析器资?/a>?/p>
另外扑ֈ?个?b style="color: black; background-color: rgb(255, 255, 102);">Perlq行RSS解析的包Q?br>
使用XML::RSS::Parser::Lite?a >XML::RSS::Parser XML::RSS::Parser::Lite的代码样例如下:(x) #!/usr/bin/perl -w use strict; # print blog header # convert item to <li> 安装Q?br>
需要SOAP-Lite 优点Q?br>
Ҏ(gu)单,支持q程抓取Q?/p>
~点Q?br>
只支持title, url, descriptionq?个字D,不支持时间字D, 计划用于单的抓取RSS同步服务设计Q每个h都可以出版自p阅的RSS?/p>
use strict; # output some values 优点Q?br>
能够直接数据按字段输出Q提供更底层的界面; ~点Q?br>
不能直接解析q程RSSQ需要下载后再解析; 2004-12-14: Planet的安装:(x)解包后,直接在目录下q行Qpython planet.py examples/config.ini 可以在output目录中看到缺省样例FEED中的输出了index.htmlQ另外还有opml.xml?b style="color: black; background-color: rgb(160, 255, 255);">rss.xml{输出(q点比较好) 我用几个RSS试了一下,UTF-8的没有问题,但是GBK的全部都q了,planetlib.py中和XML字符集处理的只有以下代码Q看来所有的非UTF-8都被当作iso8859_1处理了:(x) q期学习(fn)一下Python的unicode处理Q感觉是一个很z的语言Q有比较好的try ... catch 机制和logging 关于MagPieRSS性能问题的疑虑:(x) 可以看到QLilina的缓存机制是每次h的时候遍历缓存目录下?b style="color: black; background-color: rgb(160, 255, 255);">RSS文gQ如果缓存文件过期,q要动态向RSS数据源进行请求。因此不能支持后台太多的RSS订阅和前端大量的q发讉KQ会(x)造成很多的I/O操作Q?/p>
Planet是一个后台脚本,通过脚本订阅的RSS定期汇聚成一个文件输出成静态文件?/p>
其实只要在MagPieRSS前端增加一个wget脚本定期index.php的数据输出成index.htmlQ然后要求每ơ访问先讉Kindex.html~存Q这样不和Planet的每时生成index.html静态缓存一样了吗?/p>
所以在不允许自己配|服务器脚本的虚拟主机来说PlanetҎ(gu)是无法运行的?/p>
更多关于PHP中处理GBK的XML解析问题请参考:(x) 2004-12-19 Posted by chedong at December 11, 2004 12:34 AM
Edit
TrackBack URL for this entry: Listed below are links to weblogs that reference LilinaQ?b style="color: black; background-color: rgb(160, 255, 255);">RSS聚合器构Z人门?Write once, publish anywhere):
MagPieRSS中UTF-8和GBK?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析分析Q附Qphp中的面向字符~程详解Q?/a> from 车东BLOG Tracked on December 19, 2004 12:37 AM
?lilina ?blogline 来看 blog from Philharmania's Weblog Tracked on December 26, 2004 01:57 PM
CNBlog作者群RSS征集?/a> from CNBlog: Blog on Blog Tracked on December 26, 2004 07:42 PM Tracked on January 14, 2005 06:14 PM
MT的模板修改和界面皮肤讄 from 车东BLOG Tracked on January 17, 2005 01:25 PM 请问如果更改默认昄7天的新闻Q谢谢?/p>
Posted by: honren at December 12, 2004 10:20 PM 我用lilina已经一D|间了?br>
http://news.yanfeng.org Posted by: mulberry at December 13, 2004 09:24 AM 老R同志Q没觉得你用lilina以来Q主늚讉K速度h吗?攑ּ吧,臛_没必要当作首,lilinaq在技术还不成熟`~ Posted by: kalen at December 16, 2004 10:33 AM 可以考虑一下用drupal Posted by: shunz at December 28, 2004 06:46 PM 可以试试我做的:(x)http://blog.terac.com ?时抓取blog,然后每个?条最新的Q排序,聚合Q生成静态xmlQ用xsl格式化显C。。?/p>
Posted by: andy at January 6, 2005 12:53 PM 车东同志Q这样做不好QP The Rich Site Summary (RSS) format, previously known as the RDF Site
Summary, has quietly become the dominant format for distributing news
headlines on the Web. In this Mother of Perl tutorial, we will write a short Perl script
(less than 100 lines) that retrieves an XML RSS file from the Web or
local file system and converts it to HTML. Using a Server Side Include
(SSI) or similar method, you can easily add news headlines from any
number of sources to your Web site. Where did RSS come from you ask? Netscape invented the RSS format for "channels" on Netscape Netcenter (http://my.netscape.com). It was released to the public in March of 1999. The first non-Netscape Web site to incorporate the new format was Scripting News, a popular technology news site run by Dave Winer, president of Userland Software
(think Frontier). Interestingly enough, Scripting News had been using
its own XML format, scriptingNews, since December of 1997.
In May of 1999, Dave Winer released a new version of the
scriptingNews XML format, which added new content-rich elements.
Netscape followed suit by adopting most of the new scriptingNews
elements into RSS 0.91, which was released in July of 1999.
Userland Software also rolled out their own flavor of my.netscape.com. If you haven't already guessed, it's available at http://my.userland.com.
As far as I know, RSS is the most widely used XML format on the
Web today. RSS headlines are available for many popular news sites like
Slashdot,
Forbes, and CNET News.com, and the list is growing daily.
In a time when "stickiness" is a good, displaying news headlines
on your Web site can really help give it the extra "umph" that will
encourage users to return. After all, users can only read your
president's bio but so many times.
For rss2html.pl to work on your system, you should have a recent
version of Perl installed, 5.003 or better. 5.005 is recommended. You
will also need the XML::Parser and XML::RSS modules installed. To install the modules on a *nix system, type: If you're using a win32 machine (Win95/98/NT), you have a recent
installation of Activestate Perl. If you don't have a recent version,
visit http://www.activestate.com. To install XML::Parser on a win32 machine type: To install XML::RSS on a win32 machine (you must have a C compiler and nmake):
Next, we'll examine the RSS format in more detail.
December 22, 2004
URL: http://www.builderau.com.au/architect/webservices/0,39024590,39171461,00.htm
Take advantage of the XML::RSS CPAN package, which is specifically designed to read and parse RSS feeds.
RSS parsing in Perl is usually handled by the XML::RSS CPAN
package. Unlike ASP.NET, which comes with a generic XML parser and
expects you to manually write RSS-parsing code, the XML::RSS package is
specifically designed to read and parse RSS feeds. When you give
XML::RSS an RSS feed, it converts the various <item>s in the feed
into array elements, and exposes numerous methods and properties to
access the data in the feed. XML::RSS currently supports versions 0.9,
0.91, and 1.0 of RSS.
cpan> install XML::RSS
For our example, we'll assume that you're interested in displaying
the latest geek news from Slashdot on your site. The URL for Slashdot's
RSS feed is located here. The script in Listing A retrieves this feed, parses it, and turns it into a human-readable HTML page using XML::RSS:
#!/usr/bin/perl
# import packages
use XML::RSS;
use LWP::Simple;
# initialize object
$rss = new XML::RSS();
# get RSS data
$raw = get('http://www.slashdot.org/index.rss');
# parse RSS feed
$rss->parse($raw);
# print HTML header and page
print "Content-Type: text/html\n\n";
print "
print "";
print "
";" . $rss->channel('title') .
" ";
# print titles and URLs of news items
foreach my $item (@{$rss->{'items'}})
{
$title = $item->{'title'};
$url = $item->{'link'};
print "$title
print "";
Slashdot RSS feed
Here are some RSS feeds to get you started
So that takes care of adding a feed to your Web site. But hey, why limit yourself to one when you can have many? Listing B, a revision of the Listing A,
sets up an array containing the names of many different RSS feeds, and
iterates over the array to produce a page containing multiple channels
of information.
#!/usr/bin/perl
# import packages
use XML::RSS;
use LWP::Simple;
# initialize object
$rss = new XML::RSS();
# get RSS data
$raw = get('http://www.slashdot.org/index.rss');
# parse RSS feed
$rss->parse($raw);
# print HTML header and page
print "Content-Type: text/html\n\n";
print "
print "";
print "
";" . $rss->channel('title') .
" ";
# print titles and URLs of news items
foreach my $item (@{$rss->{'items'}})
{
$title = $item->{'title'};
$url = $item->{'link'};
print "$title
print "";
Several RSS feeds
/bin/wget http://www.freshmeat.net/backend/fm.rdf -O freshmeat.rdf
]]>
开源Y件对i18n的支持越来越好了Qphp 4.3.xQ?--enable-mbstring' '--with-iconv'后比较好的同时处理了UTF-8和其他中文字W集发布?b style="color: black; background-color: rgb(160, 255, 255);">RSS?br>
需要感谢Steve在PHPq行转码斚w?a >MagPieRSSq行和XML Hacking工作。至目前ؓ(f)止:(x)Add to my yahooq不能很好的处理utf-8字符集的RSS收藏?/p>
iconv support enabled
iconv implementation unknown
iconv library version unknown
iconv.input_encoding ISO-8859-1 ISO-8859-1
iconv.internal_encoding ISO-8859-1 ISO-8859-1
iconv.output_encoding ISO-8859-1 ISO-8859-1
Multibyte Support enabled
Japanese support enabled
Simplified chinese support enabled
Traditional chinese support enabled
Korean support enabled
Russian support enabled
Multibyte (japanese) regex support enabled
1Q右边的一栏,W一的sources最好跟hobby、友情链接一P加个囄?br>
2Q一堆检索框在那儿,有些乱,只有一个,其它的放C个二U页面上?br>
3Q把联系方式?qing)cc,分别做成一条或一个图片,攑֜双一栏中Q具体的内容可以攑ֈ二面上,因ؓ(f)我觉得好象没有多h?x)细读这些文字?br>
4Q如果可能,把lilina的头部链接汉化一下吧Q?/p>
1 删除q长的摘要,可以通过LW??
$TIMERANGE = ( $_REQUEST['hours'] ? $_REQUEST['hours']*3600 : 3600*24 ) ;
# $Id$
# XML::RSS::Parser::Lite sample
use XML::RSS::Parser::Lite;
use LWP::Simple;
my $xml = get("http://www.klogs.org/index.xml");
my $rp = new XML::RSS::Parser::Lite;
$rp->parse($xml);
print "<a href=\"".$rp->get('url')."\">" . $rp->get('title') . " - " . $rp->get('description') . "</a>\n";
print "<ul>";
for (my $i = 0; $i < $rp->count(); $i++) {
my $it = $rp->get($i);
print "<li><a href=\"" . $it->get('url') . "\">" . $it->get('title') . "</a></li>\n";
}
print "</ul>";
XML::RSS::Parser代码样例如下Q?br>
#!/usr/bin/perl -w
# $Id$
# XML::RSS::Parser sample with Iconv charset convert
use XML::RSS::Parser;
use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "gbk");
my $p = new XML::RSS::Parser;
my $feed = $p->parsefile('index.xml');
my $title = XML::RSS::Parser->ns_qualify('title',$feed->rss_namespace_uri);
# may cause error this line: print $feed->channel->children($title)->value."\n";
print "item count: ".$feed->item_count()."\n\n";
foreach my $i ( $feed->items ) {
map { print $_->name.": ".$converter->convert($_->value)."\n" } $i->children;
print "\n";
}
从cnblog的Trackback中了解到?a >Planet RSS聚合?/a>
try:
data = unicode(data, "utf8").encode("utf8")
logging.debug("Encoding: UTF-8")
except UnicodeError:
try:
data = unicode(data, "iso8859_1").encode("utf8")
logging.debug("Encoding: ISO-8859-1")
except UnicodeError:
data = unicode(data, "ascii", "replace").encode("utf8")
logging.warn("Feed wasn't in UTF-8 or ISO-8859-1, replaced " +
"all non-ASCII characters.")
对于Planet和MagPieRSS性能的主要差异在是缓存机制上Q关于用缓存机制加速WEB服务可以参考:(x)可缓存的cms设计?/p>
MagPieRSS中UTF-8和GBK?b style="color: black; background-color: rgb(160, 255, 255);">RSS解析分析
正如在SocialBrain 2005q的讨论?x)中QIsaac Mao所_(d)(x)Blog is a 'Window', also could be a 'Bridge'QBlog是个?l织对外的“窗口”,?b style="color: black; background-color: rgb(160, 255, 255);">RSS更方便你这些窗口组合v来,成ؓ(f)光的“桥梁”,有了q样的中间发布层QBlog不仅从单点发布,更到P2P自助传播Q越来越看到?b style="color: black; background-color: rgb(160, 255, 255);">RSS在网l传播上的重要性?/p>
Last Modified at December 19, 2004 04:40 PM
相关文章:
Trackback Pings
http://www.chedong.com/cgi-bin/mt3/mt-tb.cgi/27
W一ơ尝试MagpieRSSQ因为没有安装iconv和mbstringQ所以失败了Q今天在服务器上安装了iconv和mtstring的支持,我今天仔l看了一下lilina中的rss_fetch的用法:(x)最重要的是制定RSS的输出格式ؓ(f)'MAGPIE_OU... [Read More]
看到一?a rel="nofollow">介绍 lilina 的文?/a>后就自己安装了一?/a>试了下?a rel="nofollow">lilina 是一个用 PHP ?[Read More]
在CNBLOG上搭ZLilina RSS聚合?/a>Q请各位志愿者将各自|志或者和与cnblog相关专栏?b style="color: black; background-color: rgb(160, 255, 255);">RSS提交l我 ?直接在评Z回复卛_?
推广使用RSS聚合工具主要的目? . [Read More]
直接以下语句加入到 index.php 头部卛_QLILINA中你 .
[Read More]
分类索引Q?首页~省有按月归档的索引Q没有分cȝ录的索引Q看了手册里面也没有具体的参数定义,只好直接看SOURCEQ尝试着把MonthlyҎ(gu)CategoryQ居然成?:-) q到了Movable Style的MT样式站,... [Read More]
Comments
E微改了一点UI?br>
如果你能改进它,那就好了?/p>
rss本来在|上Q你聚合它在你的|页上不仅损害了你自׃늚质量Q而且qh了搜索引擎,造成你痛斥的“门L(fng)站损宛_作热情”的效果。还是不要聚合的好!
]]>History
Required Modules
perl -MCPAN -e "install XML::Parser"
perl -MCPAN -e "install XML::RSS"
ppm install XML-Parser
rss2html.pl
Get the source
This script converts an RSS file on the Web or local file system to HTML.
The first public version of RSS, 0.9, includes basic headline information. Below is an example RSS file for Freshmeat.net, a popular news site for Linux software:
<?xml version="1.0"?> |
The first major element is channel
which contains
the following elements:
title
- the title of the channel
link
- the link to the channel Web site
description
- short description of the channel
An RSS channel may also contain an image
element as in the example above which contains the following elements:
title
- the text describing the image
url
- the URL of the image
link
- the URL that the image is linked to
The item
element contains the real channel
content which is comprised of a title
and a
link
element. An RSS file may contain up to
15 items.
An RSS 0.9 file may alternatively contain a textinput
element which allows users to type a string into a HTML text input field and
submit it via the HTTP GET method to the URL specified in the
link
element.
Next, we will examine RSS 0.91 which was released by Netscape in July of 1999.
The latest version of RSS added a few new elements. Below is a sample RSS file from XML.com, an excellent XML resource site:
<?xml version="1.0"?> |
Notice that there are more descriptive elements for the channel, image, amd items elements. These are referred to as "fat elements" because they contain a more detailed description of each channel item.
![]() |
Now that you've had a change to glance at two RSS examples, it's time to introduct the XML::RSS module. XML::RSS is a subclass of XML::Parser, a Perl module maintained by Clark Cooper that utilizes James Clark's Expat C library. XML::RSS was developed to simplify the task of manipulating and parsing RSS files. A deep understanding of XML is not a prerequisite for using XML::RSS since the XML details are hidden inside the class interface.
While XML::RSS is capable of creating RSS files, we will be
focusing on parsing existing RSS files in this column. You can read
more about the capabilities of XML::Parser in the module's
documentation or by typing:
perldoc XML::RSS
Well, let's look at the code shall we? Lines 16-17 load the XML::RSS and LWP::Simple modules. We've already talked about XML::RSS in brief, but what does LWP::Simple do? Good question! The answer is simple (puns intended). It's a procedural interface for interacting with a Web server. It's also the little cousin of LWP::UserAgent, a fuller object oriented interface. We'll be using one of the library's subroutines later in the code to fetch an RSS file from the Web.
In lines 20-21 we initialize two variables that we're going to use later.
Line 25 starts the main
code body. The first thing we do is verify that the user
typed exactly one command-line parameter. This parameter is then assigned
to the $arg
variable in
line 28.
Next we create a new instance of the XML::RSS class and assign the
reference to the $rss
variable on
line 31.
Now we must determine whether the command-line parameter the user
entered is an HTTP URL or a file on the local file system
(lines 34-46). On
line 34, we us a
regular expression to look for the characters http:
.
If the command-line argument starts with these characters, we can safely
assume that the user intends to retrieve an RSS file from a Web server.
On line 35 we pass the
argument to the get()
function, which is a part of
LWP::Simple, and assign the results to the $content
variable. On line 36 we call
die()
if $content
is empty. If this happens,
it means there was an error retrieving the RSS file. If the RSS file
was downloaded successfully, $rss->parse($content)
is called
which parses the RSS file and stores the results in the object's internal
structure (line 38).
If the command-line argument does not contain the http:
characters, we assume the argument is a file instead of a URL on
lines 41-46. The
first thing we do is assign the value of $arg
to the $file
variable and test for the existence of
the file (lines 42-43).
Then we call $rss->parsefile($file)
(line 45), which parses
the RSS file and stores the results in the object's internal structure.
The parsefile()
method parses a file, whereas the
parse()
method parses the string that's passed to it.
Lastly, we call the print_html
subroutine on
line 49, which converts
the RSS object in nicely formatted HTML.
As you examine this subroutine, you will begin to understand
the internal structure of the XML::RSS object. The critical portion
of the subroutine is contained on
lines 76-79. In this
foreach
loop, we iterate over each of the RSS items.
Next, let's take a look at rss2html.pl in action.
I've added the following cron jobs that run once per hour on the Webreference server (Scheduler is the NT counterpart):
rss2html.pl http://slashdot.org/slashdot.rdf > slashdot.html
rss2html.pl http://freshmeat.net/backend/fm.rdf > freshmeat.html
rss2html.pl http://www.linuxtoday.com/backend/my-netscape.rdf > linuxtoday.html
rss2html.pl http://www.xml.com/xml/news.rdf > xmlnews.html
rss2html.pl http://www.perlxml.com/rdf/moperl.rdf > mop.html
The commands above fetch the RSS files off the Web and convert them to HTML. Using Server-Side Includes (SSI), I've included the results below:
|
|
|
![]() |
Well, we've shown in this column that Perl can really pack a wallop in a short amount of code. With rss2html.pl, anyone can automatically add a news feed to their Web site.
For more information on RSS, you might try visiting the following sites:
|
Level: Introductory |
|
13 Nov 2002
RSS is one of the most successful XML services ever. Despite its chaotic roots, it has become the community standard for exchanging content information across Web sites. Python is an excellent tool for RSS processing, and Mike Olson and Uche Ogbuji introduce a couple of modules available for this purpose.
RSS is an abbreviation with several expansions: "RDF Site Summary," "Really Simple Syndication," "Rich Site Summary," and perhaps others. Behind this confusion of names is an astonishing amount of politics for such a mundane technological area. RSS is a simple XML format for distributing summaries of content on Web sites. It can be used to share all sorts of information including, but not limited to, news flashes, Web site updates, event calendars, software updates, featured content collections, and items on Web-based auctions.
RSS was created by Netscape in 1999 to allow content to be gathered from many sources into the Netcenter portal (which is now defunct). The UserLand community of Web enthusiasts became early supporters of RSS, and it soon became a very popular format. The popularity led to strains over how to improve RSS to make it even more broadly useful. This strain led to a fork in RSS development. One group chose an approach based on RDF, in order to take advantage of the great number of RDF tools and modules, and another chose a more stripped-down approach. The former is called RSS 1.0, and the latter RSS 0.91. Just last month the battle flared up again with a new version of the non-RDF variant of RSS, which its creators are calling "RSS 2.0."
RSS 0.91 and 1.0 are very popular, and used in numerous portals and Web logs. In fact, the blogging community is a great user of RSS, and RSS lies behind some of the most impressive networks of XML exchange in existence. These networks have grown organically, and are really the most successful networks of XML services in existence. RSS is a XML service by virtue of being an exchange of XML information over an Internet protocol (the vast majority of RSS exchange is simple HTTP GET of RSS documents). In this article, we introduce just a few of the many Python tools available for working with RSS. We don't provide a technical introduction to RSS, because you can find this in so many other articles (see Resources). We recommend first that you gain a basic familiarity with RSS, and that you understand XML. Understanding RDF is not required.
[We consider RSS an 'XML service' rather than a 'Web service' due to the use of XML descriptions but the lack of use of WSDL. -- Editors]
RSS.py
Mark Nottingham's RSS.py is a Python library for RSS processing. It is
very complete and well-written. It requires Python 2.2 and PyXML 0.7.1.
Installation is easy; just download the Python file from Mark's home
page and copy it to somewhere in your PYTHONPATH
.
Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel
and TrackingChannel
. The latter seems the more useful of the two. TrackingChannel
is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel
is a similar data structure, but organized more as RSS documents
themselves are, with the top-level channel information pointing to the
item details using hash values for the URLs. You will probably use the
utility namespace declarations in the RSS.ns
structure. Listing 1
is a simple script that downloads and parses an RSS feed for Python
news, and prints out all the information from the various items in a
simple listing.
|
We start by creating a TrackingChannel
instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf
.
RSS.py uses tuples as the property names for RSS data. This may seem an
unusual approach to those not used to XML processing techniques, but it
is actually a very useful way of being very precise about what was in
the original RSS file. In effect, an RSS 0.91 title
element is not considered to be equivalent to an RSS 1.0 one. There is
enough data for the application to ignore this distinction, if it
likes, by ignoring the namespace portion of each tuple; but the basic
API is wedded to the syntax of the original RSS file, so that this
information is not lost. In the code, we use this property data to
gather all the items from the news feed for display. Notice that we are
careful not to assume which properties any particular item might have.
We retrieve properties using the safe form as seen in the code below.
|
Which provides a default value if the property is not found, rather than this example.
|
This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.
|
Of course, you would expect somewhat different output because the
news items will have changed by the time you try it. The RSS.py channel
objects also provide methods for adding and modifying RSS information.
You can write the result back to RSS 1.0 format using the output()
method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py
. At the resuting Python prompt, run the following example.
|
The result is an RSS 1.0 document printed out. You must have RSS.py,
version 0.42 or more recent for this to work. There is a bug in the output()
method in earlier versions.
rssparser.py
Mark Pilgrim offers another module for RSS file parsing. It doesn't
provide all the features and options that RSS.py does, but it does
offer a very liberal parser, which deals well with all the confusing
diversity in the world of RSS. To quote from the rssparser.py page:
You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
Then
there are feeds, like Aaron's feed, which are too bleeding edge. He
puts an excerpt in the description element but puts the full text in
the content:encoded element (as CDATA). This is valid RSS 1.0, but
nobody actually uses it (except Aaron), few news aggregators support
it, and many parsers choke on it. Other parsers are confused by the new
elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And
then there's Jon Udell's feed, with the fullitem
element that he just sort of made up.
It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, rssparser.py is designed to deal with all the madness.
Installing rssparser.py is also very easy. You download the Python
file (see Resources), rename it from "rssparser.py.txt" to
"rssparser.py", and copy it to your PYTHONPATH
. I also
suggest getting the optional timeoutsocket module which improves the
timeout behavior of socket operations in Python, and thus can help
getting RSS feeds less likely to stall the application thread in case
of error.
Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.
|
As you can see, the code is much simpler. The trade-off between RSS.py and rssparser.py is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the RSS.py parser only accepts well-formed XML).
The output should be the same as in Listing 2.
Conclusion
There are many Python tools for RSS, and we don't have space to cover
them all. Aaron Swartz's page of RSS tools is a good place to start
looking if you want to explore other modules out there. RSS is easy to
work with in Python, because of all the great modules available for it.
The modules hide all the chaos brought about by the history and
popularity of RSS. If your XML services needs mostly involve the
exchange of descriptive information for Web sites, we highly recommend
using the most successful XML service technology in employment.
Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.
About the authors![]() |
|
This is a "universal" feed parser, suitable for reading syndicated feeds as produced by weblogs, news sites, wikis, and many other types of sites. It handles Atom feeds, CDF, and the nine different versions of RSS.
This project is now hosted at SourceForge. Please check there for updates. This page contains old news and is no longer updated. (2004-06-21)
?者:(x) 车东
最后更斎ͼ(x)2002-08-30 13:18:41
版权声明Q可以Q意{载,转蝲时请务必标明原始出处和作者信?br>
概述Q?b style="color: black; background-color: rgb(255, 255, 102);">CVS是一个C/SpȝQ多个开发h员通过一个中心版本控制系l来记录文g版本Q从而达C证文件同步的目的?
CVS服务器(文g版本库)
/
| \
Q版 ??步)
/
| \
开发? 开发? 开发?
以下是本文主要内容:(x)开发h员可以主要挑?, 6看就可以了,CVS的管理员则更需要懂的更多一?
一个系l?0%的功能往往能够满80%的需求,CVS也不例外Q以下是CVS最常用的功能,可能用到的还不到它全部命令选项?0%Q更多的功能请在实际应用q程中体?x),学?fn)q程中应该是用多,学多,用到了再学也不迟?
CVS环境初始?br>
============
环境讄Q指?b style="color: black; background-color: rgb(255, 255, 102);">CVS库的路径CVSROOT
tcsh
setenv CVSROOT /path/to/cvsroot
bash
CVSROOT=/path/to/cvsroot ; export CVSROOT
后面q提到远E?b style="color: black; background-color: rgb(255, 255, 102);">CVS服务器的讄Q?br>
CVSROOT=:ext:$USER@test.server.address#port:/path/to/cvsroot CVS_RSH=ssh; export
CVSROOT CVS_RSH
初始化:(x)CVS版本库的初始化?br>
cvs init
一个项目的首次导入
cvs import -m "write some comments here" project_name vendor_tag
release_tag
执行后:(x)?x)将所有源文g?qing)目录导入?path/to/cvsroot/project_name目录?br>
vender_tag: 开发商标记
release_tag: 版本发布标记
目导出Q将代码?b style="color: black; background-color: rgb(255, 255, 102);">CVS库里导出
cvs checkout project_name
cvs 创建project_name目录Qƈ最新版本的源代码导出到相应目录中。这个checkout和Virvual
SourceSafe中的check out不是一个概念,相对于Virvual SourceSafe的check
out?b style="color: black; background-color: rgb(255, 255, 102);">cvs updateQ?check in?b style="color: black; background-color: rgb(255, 255, 102);">cvs commit?/i>
CVS的日怋?/b>
=============
注意Q第一ơ导Z后,׃是通过cvs checkout来同步文件了Q而是要进入刚?b style="color: black; background-color: rgb(255, 255, 102);">cvs checkout project_name导出的project_name目录下进行具体文件的版本同步Q添加,修改Q删除)操作?/b>
文件同步到最新的版本Q?br>
cvs update
不制定文件名Q?b style="color: black; background-color: rgb(255, 255, 102);">cvs同步所有子目录下的文gQ也可以制定某个文g?目录q行同步
cvs update file_name
最好每天开始工作前或将自己的工作导入到CVS库里前都要做一ơ,q养成“先同步
后修改”的?fn)惯Q和Virvual SourceSafe不同Q?b style="color: black; background-color: rgb(255, 255, 102);">CVS里没有文仉定的概念Q所有的冲突是在commit之前解决Q如果你修改q程中,有其他h修改qcommitCCVS库中Q?b style="color: black; background-color: rgb(255, 255, 102);">CVS?x)通知你文件冲H,q自动将冲突部分?br>
>>>>>>
content on cvs server
<<<<<<
content in your file
>>>>>>
标记出来Q由你确认冲H内容的取舍?br>
版本冲突一般是在多个h修改一个文仉成的,但这U项目管理上的问题不应该指望?b style="color: black; background-color: rgb(255, 255, 102);">CVS来解冟?/i>
认修改写入?b style="color: black; background-color: rgb(255, 255, 102);">CVS库里Q?br> cvs commit -m "write some comments here" file_name
注意Q?b style="color: black; background-color: rgb(255, 255, 102);">CVS的很多动作都是通过cvs commitq行最后确认ƈ修改的,最好每ơ只修改一个文件。在认的前Q还需要用户填写修Ҏ(gu)释,以帮助其他开发h员了解修改的原因。如果不用写-m
"comments"而直接确认`cvs commit file_name` 的话Q?b style="color: black; background-color: rgb(255, 255, 102);">cvs?x)自动调用系l缺省的文字~辑?一般是vi)要求你写入注释?br>
注释的质量很重要Q所以不仅必要写,而且必须写一些比较有意义的内容:(x)以方便其他开发h员能够很好的理解
不好的注释,很难让其他的开发h员快速的理解Q比如:(x) -m
"bug fixed" 甚至 -m ""
好的注释Q甚臛_以用中文: -m "在用h册过E中加入了Email地址校验"
修改某个版本注释Q每ơ只认一个文件到CVS库里是一个很好的?fn)惯Q但隑օ有时候忘了指定文件名Q把多个文g以同h释commit?b style="color: black; background-color: rgb(255, 255, 102);">CVS库里了,以下命o(h)可以允许你修Ҏ(gu)个文件某个版本的注释Q?br>
cvs admin -m 1.3:"write some comments here" file_name
d文g
创徏好新文g后,比如Qtouch new_file
cvs add new_file
注意Q对于图片,W(xu)ord文档{非U文本的目Q需要?b style="color: black; background-color: rgb(255, 255, 102);">cvs
add -b选项Q否则有可能出现文g被破坏的情况
比如Q?b style="color: black; background-color: rgb(255, 255, 102);">cvs add -kb new_file.gif
然后认修改q注?
cvs ci -m "write some comments here"
删除文gQ?br>
某个源文g物理删除后,比如Qrm file_name
cvs rm file_name
然后认修改q注?br>
cvs ci -m "write some comments here"
以上面前2步合q的Ҏ(gu)为:(x)
cvs rm -f file_name
cvs ci -m "why delete file"
注意Q很?b style="color: black; background-color: rgb(255, 255, 102);">cvs命o(h)都有~写形式Qcommit=>ci; update=>up; checkout=>co; remove=>rm;
d目录Q?br>
cvs add dir_name
查看修改历史Q?b style="color: black; background-color: rgb(255, 255, 102);">cvs log file_name
cvs history file_name
查看当前文g不同版本的区?br>
cvs diff -r1.3 -r1.5 file_name
查看当前文gQ可能已l修改了Q和库中相应文g的区?br>
cvs diff file_name
cvs的web界面提供了更方便的定位文件修改和比较版本区别的方法,具体安装讄L(fng)后面的cvsweb使用
正确的通过CVS恢复旧版本的Ҏ(gu)Q?br>
如果?b style="color: black; background-color: rgb(255, 255, 102);">cvs update -r1.2 file.name
q个命o(h)是给file.name加一个STICK TAGQ?"1.2"
Q虽然你的本意只是想它恢复?.2版本
正确的恢复版本的Ҏ(gu)是:(x)cvs update -p -r1.2 file_name >file_name
如果不小心已l加成STICK TAG的话Q用cvs update -A 解决
Ud文gQ文仉命名
cvs里没?b style="color: black; background-color: rgb(255, 255, 102);">cvs move?b style="color: black; background-color: rgb(255, 255, 102);">cvs renameQ因两个操作是先cvs remove
old_file_nameQ然?b style="color: black; background-color: rgb(255, 255, 102);">cvs add new_file_name实现的?/p>
删除Q移动目录:(x)
最方便的方法是让管理员直接UdQ删除CVSROOT里相应目录(因ؓ(f)CVS一个项目下的子目录都是独立的,Ud?CVSROOT目录下都可以作ؓ(f)新的独立目Q好比一颗树(wi)Q其实砍下Q意一枝都能独立存?gu)z)Q对目录q行了修改后Q要求其开发h员重新导出项?b style="color: black; background-color: rgb(255, 255, 102);">cvs
checkout project_name 或者用cvs update -dP同步?/p>
CVS BranchQ项目多分支同步开?br> =============================
认版本里程:(x)多个文g各自版本号不一P目C定阶D,可以l所有文件统一指定一个阶D里E碑版本P方便以后按照q个阶段里程版本号导出目Q同时也是项目的多个分支开发的基础?br> cvs tag release_1_0
开始一个新的里E碑Q?br> cvs commit -r 2 标记所有文件开始进?.x的开?/p>
注意Q?b style="color: black; background-color: rgb(255, 255, 102);">CVS里的revsion和Y件包的发布版本可以没有直接的关系。但所有文件用和发布版本一致的版本h较有助于l护?/i>
在开发项目的2.x版本的时候发?.x有问题,?.x又不敢用Q则从先前标记的里程:(x)release_1_0导出一个分支release_1_0_patch
cvs rtag -b -r release_1_0 release_1_0_patch proj_dir
一些h先在另外一个目录下导出release_1_0_patchq个分支Q解?.0中的紧急问题,
cvs checkout -r release_1_0_patch
而其他h员仍旧在目的主q分?.x上开?/p>
在release_1_0_patch上修正错误后Q标C?.0的错误修正版本号
cvs tag release_1_0_patch_1
如果2.0认ؓ(f)q些错误修改?.0里也需要,也可以在2.0的开发目录下合ƈrelease_1_0_patch_1中的修改到当前代码中Q?br> cvs update -j release_1_0_patch_1
CVS的远E认证:(x)通过SSHq程讉KCVS
================================
使用cvs本n的远E认证很ȝ,需要定义服务器和用L(fng)Q用户名Q设|密码等Q而且不安全,因此和系l本地帐可证ƈ通过SSH传输是比较好的办法,通过在客h?etc/profile里设|一下内容:(x)
CVSROOT=:ext:$USER@test.server.address#port:/path/to/cvsroot CVS_RSH=ssh; export
CVSROOT CVS_RSH
所有客h所有本地用户都可以映射?b style="color: black; background-color: rgb(255, 255, 102);">CVS服务器相应同名帐号了?br>
如果CVS所在服务器的SSH端口不在~省?2Q或者和客户端与CVS服务器端SSH~省端口不一_(d)有时候设|了Q?br>
:ext:$USER@test.server.address#port:/path/to/cvsroot
仍然不行Q比如有以下错误信息Q?br>
ssh: test.server.address#port: Name or service not known
cvs [checkout aborted]: end of file from server (consult above messages if any)
解决的方法是做一个脚本指定端口{向(不能使用aliasQ会(x)出找不到文g错误Q:(x)
创徏一?usr/bin/ssh_cvs文gQ?br>
#!/usr/bin/sh
/path/to/ssh -p 34567 "$@"
然后Qchmod +x /usr/bin/ssh_cvs
qCVS_RSH=ssh_cvs; export CVS_RSH
注意Qport是指相应服务器SSH的端口,不是cvs pserver的端?br>
CVSWEBQ提高程序员比较文g修改效率
================================
CVSWEB是CVS的WEB界面Q可以大大提高程序员定位修改的效?
使用的样例可以看Q?a >http://www.freebsd.org/cgi/cvsweb.cgi
CVSWEB的下载:(x)CVSWEB从最初的版本已经演化出很多功能界面更丰富的版本,q个是个人感觉觉得安装设|比较方便的Q?br>
http://www.spaghetti-code.de/software/linux/cvsweb/
下蝲解包Q?br>
tar zxf cvsweb.tgz
把配|文件cvsweb.conf攑ֈ安全的地方(比如和apache的配|放在同一个目录下Q,
修改Qcvsweb.cgi让CGI扑ֈ配置文gQ?br>
$config = $ENV{'CVSWEB_CONFIG'} || '/path/to/apache/conf/cvsweb.conf';
转到/path/to/apache/conf下ƈ修改cvsweb.confQ?/p>
CVSWEB可不能随便开攄所有用P因此需要用WEB用户认证Q?br>
先生?passwd:
/path/to/apache/bin/htpasswd -c cvsweb.passwd user
修改httpd.conf: 增加
<Directory "/path/to/apache/cgi-bin/cvsweb/">
AuthName "CVS Authorization"
AuthType Basic
AuthUserFile /path/to/cvsweb.passwd
require valid-user
</Directory>
CVS TAGS: who? when?
====================
?Id$ 加在E序文g开头的注释里是一个很好的?fn)惯Q?b style="color: black; background-color: rgb(255, 255, 102);">cvs能够自动解释更新其中的内Ҏ(gu)Qfile_name
version time user_name 的格式,比如Qcvs_card.txt,v 1.1 2002/04/05
04:24:12 chedong ExpQ可以这些信息了解文件的最后修改h和修Ҏ(gu)?br>
几个常用的缺省文Ӟ(x)
default.php
<?php
/*
* Copyright (c) 2002 Company Name.
* $Header$
*/
?>
====================================
Default.java: 注意文g头一般注释用 /* 开?JAVADOC注释?/**
开始的区别
/*
* Copyright (c) 2002 Company Name.
* $Header$
*/
package com.netease;
import java.io;
/**
* comments here
*/
public class Default {
/**
*
* @param
* @return
*/
public toString() {
}
}
====================================
default.pl:
#!/usr/bin/perl -w
# Copyright (c) 2002 Company Name.
# $Header$
# file comments here
use strict;
CVS vs VSS
===========
CVS没有文g锁定模式QVSS在check out同时Q同时记录了文g被导锁定?
CVS是update commitQ?VSS是check out check in
?b style="color: black; background-color: rgb(255, 255, 102);">CVS中,标记自动更新功能~省是打开的,q样也带来一个潜在的问题Q就是不?kb方式dbinary文g的话?b style="color: black; background-color: rgb(255, 255, 102);">cvs自动更新时可能会(x)D文g失效?
Virsual SourceSafe中这个功能称之ؓ(f)Keyword ExplainationQ缺省是关闭的,需要通过OPITION打开Qƈ指定需要进行源文g关键词扫描的cdQ?.txt,*.java,*.html...
对于Virsual
SourceSafe?b style="color: black; background-color: rgb(255, 255, 102);">CVS都通用的TAG有:(x)
$Header$
$Author$
$Date$
$Revision$
量使用通用的关键词保证代码?b style="color: black; background-color: rgb(255, 255, 102);">CVS和VSS都能方便的跟t?
相关资源Q?/p>
CVS HOMEQ?br> http://www.cvshome.org
CVS FAQQ?br>
http://www.loria.fr/~molli/cvs-index.html
相关|站:
http://directory.google.com/Top/Computers/Software/Configuration_Management/Tools/Concurrent_Versions_System/
CVS 免费?
http://cvsbook.red-bean.com/
CVS 命o(h)的速查卡片Q?br> http://www.refcards.com/about/cvs.html
摘自Q?a target="_blank">http://www.chedong.com/tech/cvs_card.html
|
q条命o(h)创徏了一个名为myaccount的普通Unix用户?
然后Ҏ(gu)它创Z个Samba用户Q?
|
或者是Q?
|
The password in Samba is not related to the unix account password.
注意Q一旦你更新?b style="color: black; background-color: rgb(255, 255, 102);">samba配置文gQ你必须要通过使用/etc/init.d/samba restart (debian)来重起你?b style="color: black; background-color: rgb(255, 255, 102);">samba
Then in windows, use the username and samba's password to map network drive.