隨筆-204  評論-149  文章-0  trackbacks-0

          python 異常、正則表達式
          http://docs.python.org/library/re.html
          http://docs.python.org/howto/regex.html#regex-howto

          例 6.1. 打開一個不存在的文件
          >>> fsock = open("/notthere", "r")     
          Traceback (innermost last):
            File "<interactive input>", line 1, in ?
          IOError: [Errno 2] No such file or directory: '/notthere'
          >>> try:
          ...     fsock = open("/notthere")      
          ... except IOError:                    
          ...     print "The file does not exist, exiting gracefully"
          ... print "This line will always print"
          The file does not exist, exiting gracefully
          This line will always print


          # Bind the name getpass to the appropriate function
            try:
                import termios, TERMIOS                    
            except ImportError:
                try:
                    import msvcrt                          
                except ImportError:
                    try:
                        from EasyDialogs import AskPassword
                    except ImportError:
                        getpass = default_getpass          
                    else:                                  
                        getpass = AskPassword
                else:
                    getpass = win_getpass
            else:
                getpass = unix_getpass

           

          例 6.10. 遍歷 dictionary
          >>> import os
          >>> for k, v in os.environ.items():      
          ...     print "%s=%s" % (k, v)
          USERPROFILE=C:\Documents and Settings\mpilgrim
          OS=Windows_NT
          COMPUTERNAME=MPILGRIM
          USERNAME=mpilgrim

          [...略...]
          >>> print "\n".join(["%s=%s" % (k, v)
          ...     for k, v in os.environ.items()])
          USERPROFILE=C:\Documents and Settings\mpilgrim
          OS=Windows_NT
          COMPUTERNAME=MPILGRIM

           

          例 6.13. 使用 sys.modules
          >>> import fileinfo        
          >>> print '\n'.join(sys.modules.keys())
          win32api
          os.path
          os
          fileinfo
          exceptions

          >>> fileinfo
          <module 'fileinfo' from 'fileinfo.pyc'>
          >>> sys.modules["fileinfo"]
          <module 'fileinfo' from 'fileinfo.pyc'>


          下面的例子將展示通過結合使用 __module__ 類屬性和 sys.modules dictionary 來獲取已知類所在的模塊。

          例 6.14. __module__ 類屬性
          >>> from fileinfo import MP3FileInfo
          >>> MP3FileInfo.__module__             
          'fileinfo'
          >>> sys.modules[MP3FileInfo.__module__]
          <module 'fileinfo' from 'fileinfo.pyc'>  每個 Python 類都擁有一個內置的類屬性 __module__,它定義了這個類的模塊的名字。 
            將它與 sys.modules 字典復合使用,你可以得到定義了某個類的模塊的引用。 

           

          例 6.16. 構造路徑名
          >>> import os
          >>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3") 
          'c:\\music\\ap\\mahadeva.mp3'
          >>> os.path.join("c:\\music\\ap", "mahadeva.mp3")  
          'c:\\music\\ap\\mahadeva.mp3'
          >>> os.path.expanduser("~")                        
          'c:\\Documents and Settings\\mpilgrim\\My Documents'
          >>> os.path.join(os.path.expanduser("~"), "Python")
          'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'

           

          例 7.2. 匹配整個單詞
          >>> s = '100 BROAD'
          >>> re.sub('ROAD$', 'RD.', s)
          '100 BRD.'
          >>> re.sub('\\bROAD$', 'RD.', s) 
          '100 BROAD'
          >>> re.sub(r'\bROAD$', 'RD.', s) 
          '100 BROAD'
          >>> s = '100 BROAD ROAD APT. 3'
          >>> re.sub(r'\bROAD$', 'RD.', s) 
          '100 BROAD ROAD APT. 3'
          >>> re.sub(r'\bROAD\b', 'RD.', s)
          '100 BROAD RD. APT 3'

          我真正想要做的是,當 'ROAD' 出現在字符串的末尾,并且是作為一個獨立的單詞時,而不是一些長單詞的一部分,才對他進行匹配。為了在正則表達式中表達這個意思,你利用 \b,它的含義是“單詞的邊界必須在這里”。在 Python 中,由于字符 '\' 在一個字符串中必須轉義,這會變得非常麻煩。有時候,這類問題被稱為“反斜線災難”,這也是 Perl 中正則表達式比 Python 的正則表達式要相對容易的原因之一。另一方面,Perl 也混淆了正則表達式和其他語法,因此,如果你發現一個 bug,很難弄清楚究竟是一個語法錯誤,還是一個正則表達式錯誤。 
            為了避免反斜線災難,你可以利用所謂的“原始字符串”,只要為字符串添加一個前綴 r 就可以了。這將告訴 Python,字符串中的所有字符都不轉義;'\t' 是一個制表符,而 r'\t' 是一個真正的反斜線字符 '\',緊跟著一個字母 't'。我推薦只要處理正則表達式,就使用原始字符串;否則,事情會很快變得混亂 (并且正則表達式自己也會很快被自己搞亂了)。 

           

          例 7.4. 檢驗百位數
          >>> import re
          >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
          >>> re.search(pattern, 'MCM')           
          <SRE_Match object at 01070390>
          >>> re.search(pattern, 'MD')            
          <SRE_Match object at 01073A50>
          >>> re.search(pattern, 'MMMCCC')        
          <SRE_Match object at 010748A8>
          >>> re.search(pattern, 'MCMC')          
          >>> re.search(pattern, '')              
          <SRE_Match object at 01071D98>

           

          例 7.5. 老方法:每一個字符都是可選的
          >>> import re
          >>> pattern = '^M?M?M?$'
          >>> re.search(pattern, 'M')   
          <_sre.SRE_Match object at 0x008EE090>
          >>> pattern = '^M?M?M?$'
          >>> re.search(pattern, 'MM')  
          <_sre.SRE_Match object at 0x008EEB48>
          >>> pattern = '^M?M?M?$'
          >>> re.search(pattern, 'MMM') 
          <_sre.SRE_Match object at 0x008EE090>
          >>> re.search(pattern, 'MMMM')
          >>>


          例 7.6. 一個新的方法:從 n 到 m
          >>> pattern = '^M{0,3}$'      
          >>> re.search(pattern, 'M')   
          <_sre.SRE_Match object at 0x008EEB48>
          >>> re.search(pattern, 'MM')  
          <_sre.SRE_Match object at 0x008EE090>
          >>> re.search(pattern, 'MMM') 
          <_sre.SRE_Match object at 0x008EEDA8>
          >>> re.search(pattern, 'MMMM')
          >>>


          對于個位數的正則表達式有類似的表達方式,我將省略細節,直接展示結果。

          >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
          用另一種 {n,m} 語法表達這個正則表達式會如何呢?這個例子展示新的語法。

          例 7.8. 用 {n,m} 語法確認羅馬數字
          >>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
          >>> re.search(pattern, 'MDLV')            
          <_sre.SRE_Match object at 0x008EEB48>
          >>> re.search(pattern, 'MMDCLXVI')        
          <_sre.SRE_Match object at 0x008EEB48>


          例 7.9. 帶有內聯注釋 (Inline Comments) 的正則表達式
          >>> pattern = """
              ^                   # beginning of string
              M{0,3}              # thousands - 0 to 3 M's
              (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                                  #            or 500-800 (D, followed by 0 to 3 C's)
              (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                                  #        or 50-80 (L, followed by 0 to 3 X's)
              (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                                  #        or 5-8 (V, followed by 0 to 3 I's)
              $                   # end of string
              """
          >>> re.search(pattern, 'M', re.VERBOSE)               
          <_sre.SRE_Match object at 0x008EEB48>
          >>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)       
          <_sre.SRE_Match object at 0x008EEB48>
          >>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE) 
          <_sre.SRE_Match object at 0x008EEB48>
          >>> re.search(pattern, 'M')                           
            當使用松散正則表達式時,最重要的一件事情就是:必須傳遞一個額外的參數 re.VERBOSE,該參數是定義在 re 模塊中的一個常量,標志著待匹配的正則表達式是一個松散正則表達式。正如你看到的,這個模式中,有很多空格 (所有的空格都被忽略),和幾個注釋 (所有的注釋也被忽略)。如果忽略所有的空格和注釋,它就和前面章節里的正則表達式完全相同,但是具有更好的可讀性。 
          >>> re.search(pattern, 'M')       
          這個沒有匹配。為什么呢?因為沒有 re.VERBOSE 標記,所以 re.search 函數把模式作為一個緊湊正則表達式進行匹配。Python 不能自動檢測一個正則表達式是為松散類型還是緊湊類型。Python 默認每一個正則表達式都是緊湊類型的,除非你顯式地標明一個正則表達式為松散類型。

           

          例 7.16. 解析電話號碼 (最終版本)
          >>> phonePattern = re.compile(r'''
                          # don't match beginning of string, number can start anywhere
              (\d{3})     # area code is 3 digits (e.g. '800')
              \D*         # optional separator is any number of non-digits
              (\d{3})     # trunk is 3 digits (e.g. '555')
              \D*         # optional separator
              (\d{4})     # rest of number is 4 digits (e.g. '1212')
              \D*         # optional separator
              (\d*)       # extension is optional and can be any number of digits
              $           # end of string
              ''', re.VERBOSE)
          >>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()       
          ('800', '555', '1212', '1234')
          >>> phonePattern.search('800-555-1212')                               
          ('800', '555', '1212', '')

           


          現在,你應該熟悉下列技巧:

          ^ 匹配字符串的開始。
          $ 匹配字符串的結尾。
          \b 匹配一個單詞的邊界。
          \d 匹配任意數字。
          \D 匹配任意非數字字符。
          x? 匹配一個可選的 x 字符 (換言之,它匹配 1 次或者 0 次 x 字符)。
          x* 匹配0次或者多次 x 字符。
          x+ 匹配1次或者多次 x 字符。
          x{n,m} 匹配 x 字符,至少 n 次,至多 m 次。
          (a|b|c) 要么匹配 a,要么匹配 b,要么匹配 c。
          (x) 一般情況下表示一個記憶組 (remembered group)。你可以利用 re.search 函數返回對象的 groups() 函數獲取它的值。

          http://www.woodpecker.org.cn/diveintopython/regular_expressions/phone_numbers.html

          Regular expression pattern syntax

          Element

          Meaning

          .

          Matches any character except \n (if DOTALL, also matches \n)

          ^

          Matches start of string (if MULTILINE, also matches after \n)

          $

          Matches end of string (if MULTILINE, also matches before \n)

          *

          Matches zero or more cases of the previous regular expression; greedy (match as many as possible)

          +

          Matches one or more cases of the previous regular expression; greedy (match as many as possible)

          ?

          Matches zero or one case of the previous regular expression; greedy (match one if possible)

          *? , +?, ??

          Non-greedy versions of *, +, and ? (match as few as possible)

          {m,n}

          Matches m to n cases of the previous regular expression (greedy)

          {m,n}?

          Matches m to n cases of the previous regular expression (non-greedy)

          [...]

          Matches any one of a set of characters contained within the brackets

          |

          Matches expression either preceding it or following it

          (...)

          Matches the regular expression within the parentheses and also indicates a group

          (?iLmsux)

          Alternate way to set optional flags; no effect on match

          (?:...)

          Like (...), but does not indicate a group

          (?P<id>...)

          Like (...), but the group also gets the name id

          (?P=id)

          Matches whatever was previously matched by group named id

          (?#...)

          Content of parentheses is just a comment; no effect on match

          (?=...)

          Lookahead assertion; matches if regular expression ... matches what comes next, but does not consume any part of the string

          (?!...)

          Negative lookahead assertion; matches if regular expression ... does not match what comes next, and does not consume any part of the string

          (?<=...)

          Lookbehind assertion; matches if there is a match for regular expression ... ending at the current position (... must match a fixed length)

          (?<!...)

          Negative lookbehind assertion; matches if there is no match for regular expression ... ending at the current position (... must match a fixed length)

          \number

          Matches whatever was previously matched by group numbered number (groups are automatically numbered from 1 up to 99)

          \A

          Matches an empty string, but only at the start of the whole string

          \b

          Matches an empty string, but only at the start or end of a word (a maximal sequence of alphanumeric characters; see also \w)

          \B

          Matches an empty string, but not at the start or end of a word

          \d

          Matches one digit, like the set [0-9]

          \D

          Matches one non-digit, like the set [^0-9]

          \s

          Matches a whitespace character, like the set [ \t\n\r\f\v]

          \S

          Matches a non-white character, like the set [^ \t\n\r\f\v]

          \w

          Matches one alphanumeric character; unless LOCALE or UNICODE is set, \w is like [a-zA-Z0-9_]

          \W

          Matches one non-alphanumeric character, the reverse of \w

          \Z

          Matches an empty string, but only at the end of the whole string

          \\

          Matches one backslash character

          posted on 2009-08-22 23:48 Frank_Fang 閱讀(1885) 評論(0)  編輯  收藏 所屬分類: Python學習

          只有注冊用戶登錄后才能發表評論。


          網站導航:
           
          主站蜘蛛池模板: 巧家县| 修武县| 清苑县| 五莲县| 牡丹江市| 五指山市| 日喀则市| 张掖市| 若羌县| 大姚县| 白玉县| 佛教| 渭源县| 华坪县| 竹山县| 怀仁县| 武夷山市| 平顺县| 屯昌县| 新乐市| 汨罗市| 原平市| 大洼县| 化州市| 文水县| 宁津县| 马公市| 佳木斯市| 革吉县| 永年县| 大同市| 横峰县| 嘉禾县| 习水县| 高陵县| 崇州市| 文安县| 定结县| 乐都县| 义马市| 恩平市|