A practical Introduction to Python Programming
Download 1.95 Mb. Pdf ko'rish
|
A Practical Introduction to Python Programming Heinold
- Bu sahifa navigatsiya:
- Matching multiple copies of something
- Matching only at the start or end
- Escaping special characters
- Preceding and following matches
Raw strings A lot of the patterns use backslashes. However, backslashes in strings are used for escape characters, like the newline, \n. To get a backslash in a string, we need to do \\. This can quickly clutter up a regular expression. To avoid this, our patterns will be raw strings, where backslashes can appear as is and don’t do anything special. To mark a string as a raw string, preface it with an r like below: s = r ' This is a raw string. Backslashes do not do anything special. ' 21.2 Syntax Basic example We start with a regular expression that mimics the replace method of strings. Here is a example of using replace to replace all occurrences of abc with *: ' abcdef abcxyz ' .replace( ' abc ' , ' * ' ) *def *xyz Here is the regular expression code that does the same thing: re.sub( r ' abc ' , ' * ' , ' abcdef abcxyz ' ) Square brackets We can use square brackets to indicate that we only want to match certain letters. Here is an example where we replace every a and d with asterisks: re.sub( r ' [ad] ' , ' * ' , ' abcdef ' ) *bc*ef Here is another example, where an asterisk replaces all occurrences of an a, b, or c that is followed by a 1, 2, or 3: re.sub( r ' [abc][123] ' , ' * ' , ' a1 + b2 + c5 + x2 ' ) * + * + c5 + x2 We can give ranges of values—for example, [a-j] for the letters a through j. Here are some further examples of ranges: Range Description [A-Z] any capital letter [0-9] any digit [A-Za-z0-9] any letter or digit A slightly shorter way to match any digit is \d, instead of [0-9]. Matching any character Use a dot to match (almost) any character. Here is an example: re.sub( r ' A.B ' , ' * ' , ' A2B AxB AxxB A$B ' ) 21.2. SYNTAX 209 * * AxxB * The pattern matches an A followed by almost any single character followed by a B. Exception: The one character not matched by the dot is the newline character. If you need that to be matched, too, put ?s at the start of your pattern. Matching multiple copies of something Here is an example where we match an A followed by one or more B’s: re.sub( r ' AB+ ' , ' * ' , ' ABC ABBBBBBC AC) *C *C AC We use the + character to indicate that we want to match one or more B’s here. There are similar things we can use to specify different numbers of B’s here. For instance, using * in place of + will match zero or more B’s. (This means that AC in the example above would be replaced by *C because A counts as an A followed by zero B’s.) Here is a table of what you can do: Code Description + match 1 or more occurrences * match 0 or more occurrences ? match 0 or 1 occurrence { m } match exactly m occurrences { m,n } match between m and n occurrences, inclusive Here is an example that matches an A followed by three to six B’s: re.sub( r ' AB { 3,6 } ' , ' * ' , ' ABB ABBB ABBBB ABBBBBBBBB ' ) ' ABB * * *BBB ' Here, we do not match ABB because the A is only followed by two B’s. The next two pieces get matched, as the A is followed by three B’s in the second term, and four B’s in the third. In the last piece, there is an A followed by nine B’s. What gets matched isthe A along with the first six B’s. Note that the matching in the last piece above is greedy; that is, it takes as many B’s as it is allowed. It is allowed to take between three and six B’s, and it takes all six. To get the opposite behavior, to get it take as few B’s as allowed, add a ?, like below: re.sub( r ' AB { 3,6 } ? ' , ' * ' , ' ABB ABBB ABBBB ABBBBBBBBB ' ) ' ABB * * *BBBBBB ' The ? can go after any of the numeric specifiers, like +?, -?, ??, etc. The | character The | character acts as an “or.” Here is an example: 210 CHAPTER 21. REGULAR EXPRESSIONS re.sub( r ' abc|xyz ' , ' * ' , ' abcdefxyz123abc ' ) ' *def*123* ' In the above example, every time we encounter an abc or an xyz, we replace it with an asterisk. Matching only at the start or end Sometimes you don’t want to match every occurrence of some- thing, maybe just the first or the last occurrence. To match just the first occurrence of something, start the pattern off with the ^ character. To match just the last occurrence, end the pattern with the $ character. Here are some examples: re.sub( ' ^abc ' , ' * ' , ' abcdefgabc ' ) re.sub( ' abc$ ' , ' * ' , ' abcdefgabc ' ) *defgabc abcdefg* Escaping special characters We have seen that + and * have special meanings. What if we need to match a plus sign? To do so, use the backslash to escape it, like \+. Here is an example: re.sub( r ' AB\+ ' , ' * ' , ' AB+C ' ) *C Also, in a pattern, \n represents a newline. Just a note again about raw strings—if we didn’t use them for the patterns, every backslash would have to be doubled. For instance, r ' AB\+ ' would have to be ' AB\\+ . Backslash sequences • \d matches any digit, and \D matches any non-digit. Here is an example: re.sub( r ' \d ' , ' * ' , ' 3 + 14 = 17 ' ) re.sub( r ' \D ' , ' * ' , ' 3 + 14 = 17 ' ) * + ** = ** 3***14***17 • \w matches any letter or number, and \W matches anything else. Here is an example: re.sub( r ' \w ' , ' * ' , ' This is a test. Or is it? ' ) re.sub( r ' \W ' , ' * ' , ' This is a test. Or is it? ' ) ' **** ** * ****. ** ** **? ' ' This*is*a*test***Or*is*it* ' This is a good way to work with words. • \s matches whitespace, and \S matches non-whitespace. Here is an example: 21.2. SYNTAX 211 re.sub( r ' \s ' , ' * ' , ' This is a test. Or is it? ' ) re.sub( r ' \S ' , ' * ' , ' This is a test. Or is it? ' ) ' This*is*a*test.**Or*is*it? ' ' **** ** * ***** ** ** *** ' Preceding and following matches Sometimes you want to match things if they are preceded or followed by something. Code Description (?=) matches only if followed by (?!) matches only if not followed by (?<=) matches only if preceded by (? matches only if not preceded by Here is an example that matched the word the only if it is followed by cat: re.sub( r ' the(?= cat) ' , ' * ' , ' the dog and the cat ' ) ' the dog and * cat ' Here is an example that matches the word the only if it is preceded by a space: re.sub( r ' (?<= )the ' , ' * ' , ' Athens is the capital. ' ) Athens is * capital. The following example will match the word the only if it neither preceded by and nor followed by letters, so you can use it to replace occurrences of the word the, but not occurrences of the within other words. re.sub( r ' (? ' , ' * ' , ' The cat is on the lathe there. ' ) * cat is on * lathe there. Flags There are a few flags that you can use to affect the behavior of a regular expression. We look at a few of them here. • (?i) — This is to ignore case. Here is an example: re.sub( ' (?i)ab ' , ' * ' , ' ab AB ' ) * * • (?s) — Recall the . character matches any character except a newline. This flag makes it match newline characters, too. 212 CHAPTER 21. REGULAR EXPRESSIONS • (?x) — Regular expressions can be long and complicated. This flag allows you to use a more verbose, multi-line format, where whitespace is ignored. You can also put comments in. Here is an example: pattern = r"""(?x)[AB]\d+ # Match A or B followed by some digits [CD]\d+ # Match C or D followed by some digits """ (re.sub(pattern, ' * ' , ' A3C9 and C1B17 ' )) * and * 21.3 Summary There is a lot to remember. Here is a summary: Expression Description [] any of the characters inside the brackets . any character except the newline + 1 or more of the preceding * 0 or more of the preceding ? 0 or 1 of the preceding { m } exactly m of the preceding { m,n } between m and n (inclusive) of the preceding ? following +, *, ?, { m } , and { m,n } — take as few as possible \ escape special characters | “or” ^ (at start of pattern) match just the first occurrence $ (at end of pattern) match just the last occurrence \d \D any digit (non-digit) \w \W any letter or number (non-letter or -number) \s \S any whitespace (non-whitespace) (?=) only if followed by (?!) only if not followed by (?<=) only if preceded by (? only if not preceded by (?i) flag to ignore case (?s) flag to make the . match newlines, too (?x) flag to enable verbose style 21.3. SUMMARY 213 Here are some short examples: Expression Description ' abc ' the exact string abc ' [ABC] ' an A, B, or C ' [a-zA-Z][0-9] ' match a letter followed by a digit ' [a..] ' a followed by any two characters (except newlines) ' a+ ' one or more a’s ' a* ' any number of a’s, even none ' a? ' zero or one a ' a { 2 } ' exactly two a’s ' a { 2,4 } ' two, three, or four a’s ' a+? ' one or more a’s taking as few as possible ' a\. ' a followed by a period ' ab|zy ' an ab or a zy ' ^a ' first a ' a$ ' last a ' \d ' every digit ' \w ' every letter or number ' \s ' every whitespace ' \D ' everything except digits ' \W ' everything except letters and numbers ' \S ' everything except whitespace ' a(?=b) ' every a followed by a b ' a(?!b) ' every a not followed by a b ' (?<=b)a ' every a preceded by a b ' (? ' every a not preceded by a b Note Note that in all of the examples in this chapter, we are dealing with non-overlapping patterns. For instance, if we look for the pattern ' aba ' in the string ' abababa ' , we see there are several overlapping matches. All of our matching is done from the left and does not consider overlaps. For instance, we have the following: re.sub( ' aba ' , ' * ' , ' abababa ' ) ' *b* ' 214 CHAPTER 21. REGULAR EXPRESSIONS 21.4 Groups Using parentheses around part of an expression creates a group that contains the text that matches a pattern. You can use this to do more sophisticated substitutions. Here is an example that converts to lowercase every capital letter that is followed by a lowercase letter: def modify (match): letter = match.group() return letter.lower() re.sub( r ' ([A-Z])[a-z] ' , modify, ' PEACH Apple ApriCot ' ) PEACH apple apricot The modify function ends up getting called three times, one for each time a match occurs. The re.sub function automatically sends to the modify function a Match object, which we name match . This object contains information about the matching text. The object’s group method returns the matching text itself. If instead of match.group, we use match.groups, then we can further break down the match according the groups defined by the parentheses. Here is an example that matches a capital letter followed by a digit and converts each letter to lowercase and adds 10 to each number: def modify (match): letter, number = match.groups() return letter.lower() + str ( int (number)+10) re.sub( r ' ([A-Z])(\d) ' , modify, ' A1 + B2 + C7 ' ) a11 + b12 + c17 The groups method returns the matching text as tuples. For instance, in the above program the tuples returned are shown below: First match: ( ' A ' , ' 1 ' ) Second match: ( ' B ' , ' 2 ' ) Third match: ( ' C ' , ' 7 ' ) Note also that we can get at this information by passing arguments to match.group. For the first match, match.group(1) is ' A ' and match.group(2) is 1. 21.5 Other functions • sub — We have seen many examples of sub. One thing we haven’t mentioned is that there is an optional argument count, which specifies how many matches (from the left) to make. Here is an example: re.sub( r ' a ' , ' * ' , ' ababababa ' , count=2) ' *b*bababa ' 21.5. OTHER FUNCTIONS 215 • findall — The findall function returns a list of all the matches found. Here is an exam- ple: re.findall( r ' [AB]\d ' , ' A3 + B2 + A9 ' ) [ ' A3 ' , ' B2 ' , ' A9 ' ] As another example, to find all the words in a string, you can do the following: re.findall( r ' \w+ ' , s) This is better than using s.split() because split does not handle punctuation, while the regular expression does. • split — The split function is analogous to the string method split. The regular expres- sion version allows us to split on something more general than the string method does. Here is an example that splits an algebraic expression at + or -. re.split( r ' \+|\- ' , ' 3x+4y-12x^2+7 ' ) [ ' 3x ' , ' 4y ' , ' 12x^2 ' , ' 7 ' ] • match and search — These are useful if you just want to know if a match occurs. The difference between these two functions is match only checks to see if the beginning of the string matches the pattern, while search searches through the string until it finds a match. Both return None if they fail to find a match and a Match object if they do find a match. Here are examples: if (re.match( r ' ZZZ ' , ' abc ZZZ xyz ' )): ( ' Match found at beginning. ' ) else : ( ' No match at beginning ' ) if (re.search( r ' ZZZ ' , ' abc ZZZ xyz ' )): ( ' Match found in string. ' ) else : ( ' No match found. ' ) No match at beginning. Match found in string. The Match object returned by these functions has group information in it. Say we have the following: a=re.search( r ' ([ABC])(\d) ' , ' = A3+B2+C8 ' ) a.group() a.group(1) a.group(2) ' A3 ' ' A ' ' 3 ' Remember that re.search will only report on the first match it finds in the string. 216 CHAPTER 21. REGULAR EXPRESSIONS • finditer — This returns an iterator of Match objects that we can loop through, like below: for s in re.finditer( r ' ([AB])(\d) ' , ' A3+B4 ' ): (s.group(1)) A B Note that this is a little more general than the findall function in that findall returns the matching strings, whereas finditer returns something like a list of Match objects, which give us access to group information. • compile — If you are going to be reusing the same pattern, you can save a little time by first compiling the pattern, as shown below: pattern = re. compile ( r ' [AB]\d ' ) pattern.sub( ' * ' , ' A3 + B4 ' ) pattern.sub( ' x ' , ' A8 + B9 ' ) * + * x + x When you compile an expression, for many of the methods you can specify optional starting and ending indices in the string. Here is an example: pattern = re. compile ( r ' [AB]\d ' ) pattern.findall( ' A3+B4+C9+D8 ' ,2,6) [ ' B4 ' ] 21.6 Examples Roman Numerals Here we use regular expressions to convert Roman numerals into ordinary numbers. import re d = { ' M ' :1000, ' CM ' :900, ' D ' :500, ' CD ' :400, ' C ' :100, ' XC ' :90, ' L ' :50, ' XL ' :40, ' X ' :10, ' IX ' :9, ' V ' :5, ' IV ' :4, ' I ' :1 } pattern = re. compile ( r"""(?x) (M { 0,3 } )(CM)? (CD)?(D)?(C { 0,3 } ) (XC)?(XL)?(L)?(X { 0,3 } ) (IX)?(IV)?(V)?(I { 0,3 } )""" ) num = input ( ' Enter Roman numeral: ' ).upper() m = pattern.match(num) sum = 0 for x in m.groups(): 21.6. EXAMPLES 217 if x!= None and x!= '' : if x in [ ' CM ' , ' CD ' , ' XC ' , ' XL ' , ' IX ' , ' IV ' ]: sum +=d[x] elif x[0] in ' MDCLXVI ' : sum +=d[x[0]]* len (x) ( sum ) Enter Roman numeral: MCMXVII 1917 The regular expression itself is fairly straightforward. It looks for up to three M’s, followed by zero or one CM’s, followed by zero or one CD’s, etc., and stores each of those in a group. The for loop then reads through those groups and uses a dictionary to add the appropriate values to a running sum. Dates Here we use a regular expression to take a date in a verbose format, like February 6, 2011, and convert it an abbreviated format, mm/dd/yy (with no leading zeroes). Rather than depend on the user to enter the date in exactly the right way, we can use a regular expression to allow for variation and mistakes. For instance, this program will work whether the user spells out the whole month name or abbreviates it (with or with a period). Capitalization does not matter, and it also does not matter if they can even spell the month name correctly. They just have to get the first three letters correct. It also does not matter how much space they use and whether or not they use a comma after the day. import re d = { ' jan ' : ' 1 ' , ' feb ' : ' 2 ' , ' mar ' : ' 3 ' , ' apr ' : ' 4 ' , ' may ' : ' 5 ' , ' jun ' : ' 6 ' , ' jul ' : ' 7 ' , ' aug ' : ' 8 ' , ' sep ' : ' 9 ' , ' oct ' : ' 10 ' , ' nov ' : ' 11 ' , ' dec ' : ' 12 ' } date = Download 1.95 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling