路漫漫其修遠(yuǎn)兮，吾將上下而求索
經(jīng)驗(yàn)淺薄，耐心積累；記性不好，記諸文字

隨筆-204 評(píng)論-149 文章-0 trackbacks-0

python 異常、正則表達(dá)式
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html#regex-howto

例 6.1. 打開一個(gè)不存在的文件
>>> fsock = open("/notthere", "r")
Traceback (innermost last):
File "<interactive input>", line 1, in ?
IOError: [Errno 2] No such file or directory: '/notthere'
>>> try:
...     fsock = open("/notthere")
... except IOError:
...     print "The file does not exist, exiting gracefully"
... print "This line will always print"
The file does not exist, exiting gracefully
This line will always print

# Bind the name getpass to the appropriate function
try:
      import termios, TERMIOS
except ImportError:
      try:
          import msvcrt
      except ImportError:
          try:
              from EasyDialogs import AskPassword
          except ImportError:
              getpass = default_getpass
          else:
              getpass = AskPassword
      else:
          getpass = win_getpass
else:
      getpass = unix_getpass

例 6.10. 遍歷 dictionary
>>> import os
>>> for k, v in os.environ.items():
... print "%s=%s" % (k, v)
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM
USERNAME=mpilgrim

[...略...]
>>> print "\n".join(["%s=%s" % (k, v)
... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim
OS=Windows_NT
COMPUTERNAME=MPILGRIM

例 6.13. 使用 sys.modules
>>> import fileinfo
>>> print '\n'.join(sys.modules.keys())
win32api
os.path
os
fileinfo
exceptions

>>> fileinfo
<module 'fileinfo' from 'fileinfo.pyc'>
>>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>

下面的例子將展示通過(guò)結(jié)合使用 __module__ 類屬性和 sys.modules dictionary 來(lái)獲取已知類所在的模塊。

例 6.14. __module__ 類屬性
>>> from fileinfo import MP3FileInfo
>>> MP3FileInfo.__module__
'fileinfo'
>>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'> 每個(gè) Python 類都擁有一個(gè)內(nèi)置的類屬性 __module__，它定義了這個(gè)類的模塊的名字。
將它與 sys.modules 字典復(fù)合使用，你可以得到定義了某個(gè)類的模塊的引用。

例 6.16. 構(gòu)造路徑名
>>> import os
>>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.join("c:\\music\\ap", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3'
>>> os.path.expanduser("~")
'c:\\Documents and Settings\\mpilgrim\\My Documents'
>>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'

例 7.2. 匹配整個(gè)單詞
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
'100 BRD.'
>>> re.sub('\\bROAD$', 'RD.', s)
'100 BROAD'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'

我真正想要做的是，當(dāng) 'ROAD' 出現(xiàn)在字符串的末尾，并且是作為一個(gè)獨(dú)立的單詞時(shí)，而不是一些長(zhǎng)單詞的一部分，才對(duì)他進(jìn)行匹配。為了在正則表達(dá)式中表達(dá)這個(gè)意思，你利用 \b，它的含義是“單詞的邊界必須在這里”。在 Python 中，由于字符 '\' 在一個(gè)字符串中必須轉(zhuǎn)義，這會(huì)變得非常麻煩。有時(shí)候，這類問題被稱為“反斜線災(zāi)難”，這也是 Perl 中正則表達(dá)式比 Python 的正則表達(dá)式要相對(duì)容易的原因之一。另一方面，Perl 也混淆了正則表達(dá)式和其他語(yǔ)法，因此，如果你發(fā)現(xiàn)一個(gè) bug，很難弄清楚究竟是一個(gè)語(yǔ)法錯(cuò)誤，還是一個(gè)正則表達(dá)式錯(cuò)誤。
為了避免反斜線災(zāi)難，你可以利用所謂的“原始字符串”，只要為字符串添加一個(gè)前綴 r 就可以了。這將告訴 Python，字符串中的所有字符都不轉(zhuǎn)義；'\t' 是一個(gè)制表符，而 r'\t' 是一個(gè)真正的反斜線字符 '\'，緊跟著一個(gè)字母 't'。我推薦只要處理正則表達(dá)式，就使用原始字符串；否則，事情會(huì)很快變得混亂 (并且正則表達(dá)式自己也會(huì)很快被自己搞亂了)。

例 7.4. 檢驗(yàn)百位數(shù)
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern, 'MCM')
<SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')
<SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')
<SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')
>>> re.search(pattern, '')
<SRE_Match object at 01071D98>

例 7.5. 老方法：每一個(gè)字符都是可選的
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMMM')
>>>

例 7.6. 一個(gè)新的方法：從 n 到 m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090>
>>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8>
>>> re.search(pattern, 'MMMM')
>>>

對(duì)于個(gè)位數(shù)的正則表達(dá)式有類似的表達(dá)方式，我將省略細(xì)節(jié)，直接展示結(jié)果。

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
用另一種 {n,m} 語(yǔ)法表達(dá)這個(gè)正則表達(dá)式會(huì)如何呢？這個(gè)例子展示新的語(yǔ)法。

例 7.8. 用 {n,m} 語(yǔ)法確認(rèn)羅馬數(shù)字
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48>

例 7.9. 帶有內(nèi)聯(lián)注釋 (Inline Comments) 的正則表達(dá)式
>>> pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M')
當(dāng)使用松散正則表達(dá)式時(shí)，最重要的一件事情就是：必須傳遞一個(gè)額外的參數(shù) re.VERBOSE，該參數(shù)是定義在 re 模塊中的一個(gè)常量，標(biāo)志著待匹配的正則表達(dá)式是一個(gè)松散正則表達(dá)式。正如你看到的，這個(gè)模式中，有很多空格 (所有的空格都被忽略)，和幾個(gè)注釋 (所有的注釋也被忽略)。如果忽略所有的空格和注釋，它就和前面章節(jié)里的正則表達(dá)式完全相同，但是具有更好的可讀性。
>>> re.search(pattern, 'M')
這個(gè)沒有匹配。為什么呢？因?yàn)闆]有 re.VERBOSE 標(biāo)記，所以 re.search 函數(shù)把模式作為一個(gè)緊湊正則表達(dá)式進(jìn)行匹配。Python 不能自動(dòng)檢測(cè)一個(gè)正則表達(dá)式是為松散類型還是緊湊類型。Python 默認(rèn)每一個(gè)正則表達(dá)式都是緊湊類型的，除非你顯式地標(biāo)明一個(gè)正則表達(dá)式為松散類型。

例 7.16. 解析電話號(hào)碼 (最終版本)
>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')

現(xiàn)在，你應(yīng)該熟悉下列技巧：

^ 匹配字符串的開始。
$ 匹配字符串的結(jié)尾。
\b 匹配一個(gè)單詞的邊界。
\d 匹配任意數(shù)字。
\D 匹配任意非數(shù)字字符。
x? 匹配一個(gè)可選的 x 字符 (換言之，它匹配 1 次或者 0 次 x 字符)。
x* 匹配0次或者多次 x 字符。
x+ 匹配1次或者多次 x 字符。
x{n,m} 匹配 x 字符，至少 n 次，至多 m 次。
(a|b|c) 要么匹配 a，要么匹配 b，要么匹配 c。
(x) 一般情況下表示一個(gè)記憶組 (remembered group)。你可以利用 re.search 函數(shù)返回對(duì)象的 groups() 函數(shù)獲取它的值。

http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html

Regular expression pattern syntax

Element

Meaning

.

Matches any character except \n (if DOTALL, also matches \n)

^

Matches start of string (if MULTILINE, also matches after \n)

$

Matches end of string (if MULTILINE, also matches before \n)

*

Matches zero or more cases of the previous regular expression; greedy (match as many as possible)

+

Matches one or more cases of the previous regular expression; greedy (match as many as possible)

?

Matches zero or one case of the previous regular expression; greedy (match one if possible)

*? , +?, ??

Non-greedy versions of *, +, and ? (match as few as possible)

{m,n}

Matches m to n cases of the previous regular expression (greedy)

{m,n}?

Matches m to n cases of the previous regular expression (non-greedy)

[...]

Matches any one of a set of characters contained within the brackets

|

Matches expression either preceding it or following it

(...)

Matches the regular expression within the parentheses and also indicates a group

(?iLmsux)

Alternate way to set optional flags; no effect on match

(?:...)

Like (...), but does not indicate a group

(?P<id>...)

Like (...), but the group also gets the name id

(?P=id)

Matches whatever was previously matched by group named id

(?#...)

Content of parentheses is just a comment; no effect on match

(?=...)

Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string

(?!...)

Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string

(?<=...)

Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)

(?<!...)

Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)

\number

Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)

\A

Matches an empty string, but only at the start of the whole string

\b

Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)

\B

Matches an empty string, but not at the start or end of a word

\d

Matches one digit, like the set [0-9]

\D

Matches one non-digit, like the set [^0-9]

\s

Matches a whitespace character, like the set [ \t\n\r\f\v]

\S

Matches a non-white character, like the set [^ \t\n\r\f\v]

\w

Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]

\W

Matches one non-alphanumeric character, the reverse of \w

\Z

Matches an empty string, but only at the end of the whole string

\\

Matches one backslash character

Regular expression pattern syntax
Element	Meaning
.	Matches any character except `\n` (if `DOTALL`, also matches `\n`)
^	Matches start of string (if `MULTILINE`, also matches after `\n`)
$	Matches end of string (if `MULTILINE`, also matches before `\n`)
*	Matches zero or more cases of the previous regular expression; greedy (match as many as possible)
+	Matches one or more cases of the previous regular expression; greedy (match as many as possible)
?	Matches zero or one case of the previous regular expression; greedy (match one if possible)
`*?` , `+?`, `??`	Non-greedy versions of `*`, `+`, and `?` (match as few as possible)
{`m`,`n`}	Matches `m` to `n` cases of the previous regular expression (greedy)
{`m`,`n`}?	Matches `m` to `n` cases of the previous regular expression (non-greedy)
[...]	Matches any one of a set of characters contained within the brackets
\|	Matches expression either preceding it or following it
(...)	Matches the regular expression within the parentheses and also indicates a group
(?iLmsux)	Alternate way to set optional flags; no effect on match
(?:...)	Like `(...)`, but does not indicate a group
(?P<`id`>...)	Like `(...)`, but the group also gets the name `id`
(?P=`id`)	Matches whatever was previously matched by group named `id`
(?#...)	Content of parentheses is just a comment; no effect on match
(?=...)	Lookahead assertion; matches if regular expression `..`. matches what comes next, but does not consume any part of the string
(?!...)	Negative lookahead assertion; matches if regular expression `..`. does not match what comes next, and does not consume any part of the string
(?<=...)	Lookbehind assertion; matches if there is a match for regular expression `..`. ending at the current position (`..`. must match a fixed length)
(?<!...)	Negative lookbehind assertion; matches if there is no match for regular expression `..`. ending at the current position (`..`. must match a fixed length)
\`number`	Matches whatever was previously matched by group numbered `number` (groups are automatically numbered from 1 up to 99)
\A	Matches an empty string, but only at the start of the whole string
\b	Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also `\w`)
\B	Matches an empty string, but not at the start or end of a word
\d	Matches one digit, like the set `[0-9]`
\D	Matches one non-digit, like the set `[^0-9]`
\s	Matches a whitespace character, like the set `[` `\t\n\r\f\v]`
\S	Matches a non-white character, like the set `[^` `\t\n\r\f\v]`
\w	Matches one alphanumeric character; unless `LOCALE` or `UNICODE` is set, `\w` is like `[a-zA-Z0-9_]`
\W	Matches one non-alphanumeric character, the reverse of `\w`
\Z	Matches an empty string, but only at the end of the whole string
\\	Matches one backslash character

posted on 2009-08-22 23:48 Frank_Fang 閱讀(1889) 評(píng)論(0) 編輯收藏所屬分類: Python學(xué)習(xí)

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: Python學(xué)習(xí)筆記（二） Python學(xué)習(xí)筆記一