正則表達其實就是對文本進行模式匹配,,所有語言中的正則表達式都有一些共同的特征,。我們使用help(regex)命令查看R正則表達的幫助內(nèi)容。下面我們對元字符(metacharacters),、數(shù)量詞(quantitifiers),、序列(sequences)、字符類(character class)和POSIX字符類分別進行說明,。1.Metacharacters 最簡單的正則表達式是匹配一個簡單的字符,如字母,、數(shù)字和標點等,。對于標點等特殊字符,通常被稱為“metacharacter”,,在匹配這些元字符時,,R語言里面需要使用'\'。主要的元字符有:. $ * + ? | \ ^ [ ] { } ( )等,。 # 帶元字符的單詞money = "$money"# 錯誤的匹配方式sub(pattern = "$", replacement = "", x = money) ## [1] "$money" # 正確的匹配方式sub(pattern = "\\$", replacement = "", x = money) ## [1] "money" # 類似的案例sub("\\$", "", "$Peace-Love") ## [1] "Peace-Love" sub("\\.", "", "Peace.Love") ## [1] "PeaceLove" sub("\\+", "", "Peace+Love") ## [1] "PeaceLove" sub("\\^", "", "Peace^Love") ## [1] "PeaceLove" sub("\\|", "", "Peace|Love") ## [1] "PeaceLove" sub("\\(", "", "Peace(Love)") ## [1] "PeaceLove)" sub("\\)", "", "Peace(Love)") ## [1] "Peace(Love" sub("\\[", "", "Peace[Love]") ## [1] "PeaceLove]" sub("\\[", "", "Peace[Love]") ## [1] "PeaceLove]" sub("\\{", "", "Peace{Love}") ## [1] "PeaceLove}" sub("\\}", "", "Peace{Love}") ## [1] "Peace{Love" sub("\\\\", "", "Peace\\Love") ## [1] "PeaceLove" 2.Sequences Sequences用于匹配字符序列,,主要的序列有: \d 匹配數(shù)字字符 \D 匹配非數(shù)字字符 \s 匹配間隔符 \S 匹配非間隔符 \w 匹配單詞字符 \W 匹配非單詞字符 \b 匹配詞界 \B 匹配非詞界 \h 匹配水平間隔 \H 匹配非水平間隔 \v 匹配垂直間隔 \V 匹配非垂直間隔 2.1數(shù)字和非數(shù)字 # 用'_'替換數(shù)字sub("\\d", "_", "the dandelion war 2010") ## [1] "the dandelion war _010" gsub("\\d", "_", "the dandelion war 2010") ## [1] "the dandelion war ____" # 用'_'替換非數(shù)字型字符sub("\\D", "_", "the dandelion war 2010") ## [1] "_he dandelion war 2010" gsub("\\D", "_", "the dandelion war 2010") ## [1] "__________________2010" 2.2空格與非空格 # 用'_'替空格sub("\\s", "_", "the dandelion war 2010") ## [1] "the_dandelion war 2010" gsub("\\s", "_", "the dandelion war 2010") ## [1] "the_dandelion_war_2010" # 用'_'替非空格字符sub("\\S", "_", "the dandelion war 2010") ## [1] "_he dandelion war 2010" gsub("\\S", "_", "the dandelion war 2010") ## [1] "___ _________ ___ ____" 2.3單詞與非單詞 # 用'_'替單詞sub("\\w", "_", "the dandelion war 2010") ## [1] "_he dandelion war 2010" gsub("\\w", "_", "the dandelion war 2010") ## [1] "___ _________ ___ ____" # 用'_'替非單詞sub("\\W", "_", "the dandelion war 2010") ## [1] "the_dandelion war 2010" gsub("\\W", "_", "the dandelion war 2010") ## [1] "the_dandelion_war_2010" 2.4詞界與非詞界 # 用'_'替詞界sub("\\b", "_", "the dandelion war 2010") ## [1] "_the dandelion war 2010" gsub("\\b", "_", "the dandelion war 2010") ## [1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_" # 用'_'替非詞界sub("\\B", "_", "the dandelion war 2010") ## [1] "t_he dandelion war 2010" gsub("\\B", "_", "the dandelion war 2010") ## [1] "t_he d_an_de_li_on w_ar 2_01_0" 3.Character Class 字符類或字符集是用“[ ]”括起來的字符集,只要匹配字符集中的任意類,。例如[aA]表示匹配任意小寫a或大寫字母A,[0123456789]表示匹配任意單個數(shù)字,,這里要區(qū)別字符類與字符的區(qū)別。常見的一些字符類有: [aeiou] 匹配任意元音字母 [AEIOU] 匹配任何一個大寫元音 [0123456789] 匹配任意單個數(shù)字 [0-9] 匹配任意數(shù)字(同上) [a-z] 匹配任何ASCII小寫字母 [A-Z] 匹配任何ASCII大寫字母 [a-zA-Z0-9] 匹配任意上面的類 [^aeiou] 匹配除小寫元音外的字母 [^0-9] 匹配除數(shù)字外的字符 transport = c("car", "bike", "plane", "boat")# 匹配'e'和'i'grep(pattern = "[ei]", transport, value = TRUE) ## [1] "bike" "plane" numerics = c("123", "17-April", "I-II-III", "R 3.0.1")# 匹配含'0'或'1'的字符grep(pattern = "[01]", numerics, value = TRUE) ## [1] "123" "17-April" "R 3.0.1" # 匹配含任意數(shù)字的字符grep(pattern = "[0-9]", numerics, value = TRUE) ## [1] "123" "17-April" "R 3.0.1" # 匹配不含數(shù)字的字符grep(pattern = "[^0-9]", numerics, value = TRUE) ## [1] "17-April" "I-II-III" "R 3.0.1" 4.POSIX Character Classes POSIX字符類是用"[[ ]]“括起來的正則表達,,常見的POSIX字符類有: [[:lower:]] 小寫字母 [[:upper:]] 大寫字母 [[:alpha:]] 所以字母 ([[:lower:]] and [[:upper:]]) [[:digit:]] 數(shù)字: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 [[:alnum:]] 字母和數(shù)字 ([[:alpha:]] and [[:digit:]]) [[:blank:]] 空白字符: space and tab [[:cntrl:]] 控制字符 [[:punct:]] 標點符號: ! ” # % & ' ( ) * + , - . / : ; [[:space:]] 空格字符:制表符,換行符, 垂直制表符,換頁符,回車和空格 [[:xdigit:]] 十六進制數(shù)字: 0-9 A B C D E F a b c d e f [[:print:]]控制字符 ([[:alpha:]], [[:punct:]] and space) [[:graph:]] 圖形化字符 ([[:alpha:]] and [[:punct:]]) # la vie (string)la_vie = "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"print(la_vie) ## [1] "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie" cat(la_vie) ## La vie en #FFC0CB (rose); ## Ces't la vie! tres jolie # 刪除空格字符gsub(pattern = "[[:blank:]]", replacement = "", la_vie) ## [1] "Lavieen#FFC0CB(rose);\nCes'tlavie!tresjolie" # 刪除標點gsub(pattern = "[[:punct:]]", replacement = "", la_vie) ## [1] "La vie en FFC0CB rose\nCest la vie \ttres jolie" # 刪除數(shù)字gsub(pattern = "[[:xdigit:]]", replacement = "", la_vie) ## [1] "L vi n # (ros);\ns't l vi! \ttrs joli" # 刪除控制字符gsub(pattern = "[[:print:]]", replacement = "", la_vie) ## [1] "\n" # 刪除非控制符gsub(pattern = "[^[:print:]]", replacement = "", la_vie) ## [1] "La vie en #FFC0CB (rose);Ces't la vie! \ttres jolie" # 刪除圖形化字符gsub(pattern = "[[:graph:]]", replacement = "", la_vie) ## [1] " \n \t " # 刪除非圖形化字符gsub(pattern = "[^[:graph:]]", replacement = "", la_vie) ## [1] "Lavieen#FFC0CB(rose);Ces'tlavie!tresjolie" 5.Quantifiers Quantifiers在要滿足特定條件的匹配一定數(shù)量的字符時使用,,用于設(shè)定符合匹配表達的實例數(shù)。常見的數(shù)量詞表達有: '?' 前面的待匹配的項目是可選的,,且最多匹配一個 '*' 前面待匹配的項目可以匹配0個或更多個 '+' 前面待匹配的項目將匹配一個或多個 '{n}' 前面待匹配的項目將匹配n個 '{n,}' 前面待匹配的項目將匹配n個或更多個 '{n,m}' 前面待匹配的項目將匹配至少n個最多m個 people = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", "rasmus", "jacob", "youna", "flora", "adi")# 匹配'm',最多一次grep(pattern = "m?", people, value = TRUE) ## [1] "rori" "emilia" "matteo" "mehmet" "filipe" "anna" "tyler" ## [8] "rasmus" "jacob" "youna" "flora" "adi" # 匹配‘m’,一次grep(pattern = "m{1}", people, value = TRUE, perl = FALSE) ## [1] "emilia" "matteo" "mehmet" "rasmus" # 匹配'm'零次或更多次,并匹配't'grep(pattern = "m*t", people, value = TRUE) ## [1] "matteo" "mehmet" "tyler" # 匹配't'零次或更多次,并匹配'm'grep(pattern = "t*m", people, value = TRUE) ## [1] "emilia" "matteo" "mehmet" "rasmus" # 匹配‘m’一次或更多次grep(pattern = "m+", people, value = TRUE) ## [1] "emilia" "matteo" "mehmet" "rasmus" # 匹配‘m’一次或更多次,,并匹配‘t’grep(pattern = "m+.t", people, value = TRUE) ## [1] "matteo" "mehmet" # 匹配‘t’兩次grep(pattern = "t{2}", people, value = TRUE) ## [1] "matteo" 6.正則表達式函數(shù) 主要的函數(shù)有:grep、grepl,、regexpr,、gregexpr、regexec,、sub,、gsub、strsplit等,。 stringr包也集合了正則表達函數(shù),,主要有:str_detect,、str_extract、str_extract_all,、str_match,、str_match_all、str_locate,、str_locate_all,、str_replace、str_replace_all,、str_split和str_split_fixed等,。 另外還有一些互補匹配功能函數(shù)諸如regmatches、match,、pmatch,、charmatch等。 還有一些輔助函數(shù)可以接受正則表達式模式,,如apropos,、browseEnv、glob2rx,、help.search和list.files等,。 |
|