R語言中的正則表達式(一)R正則表達式語法細節(jié)

gearss 2018-05-09

展開全文

正則表達其實就是對文本進行模式匹配,，所有語言中的正則表達式都有一些共同的特征,。我們使用help(regex)命令查看R正則表達的幫助內(nèi)容。下面我們對元字符(metacharacters),、數(shù)量詞(quantitifiers),、序列(sequences)、字符類(character class)和POSIX字符類分別進行說明,。

1.Metacharacters

最簡單的正則表達式是匹配一個簡單的字符，如字母,、數(shù)字和標點等,。對于標點等特殊字符，通常被稱為“metacharacter”,，在匹配這些元字符時,，R語言里面需要使用'\'。主要的元字符有：. $ * + ? | \ ^ [ ] { } ( )等,。

# 帶元字符的單詞money = "$money"# 錯誤的匹配方式sub(pattern = "$", replacement = "", x = money)

## [1] "$money"

# 正確的匹配方式sub(pattern = "\\$", replacement = "", x = money)

## [1] "money"

# 類似的案例sub("\\$", "", "$Peace-Love")

## [1] "Peace-Love"

sub("\\.", "", "Peace.Love")

## [1] "PeaceLove"

sub("\\+", "", "Peace+Love")

## [1] "PeaceLove"

sub("\\^", "", "Peace^Love")

## [1] "PeaceLove"

sub("\\|", "", "Peace|Love")

## [1] "PeaceLove"

sub("\\(", "", "Peace(Love)")

## [1] "PeaceLove)"

sub("\\)", "", "Peace(Love)")

## [1] "Peace(Love"

sub("\\[", "", "Peace[Love]")

## [1] "PeaceLove]"

sub("\\[", "", "Peace[Love]")

## [1] "PeaceLove]"

sub("\\{", "", "Peace{Love}")

## [1] "PeaceLove}"

sub("\\}", "", "Peace{Love}")

## [1] "Peace{Love"

sub("\\\\", "", "Peace\\Love")

## [1] "PeaceLove"

2.Sequences

Sequences用于匹配字符序列,，主要的序列有：

\d 匹配數(shù)字字符

\D 匹配非數(shù)字字符

\s 匹配間隔符

\S 匹配非間隔符

\w 匹配單詞字符

\W 匹配非單詞字符

\b 匹配詞界

\B 匹配非詞界

\h 匹配水平間隔

\H 匹配非水平間隔

\v 匹配垂直間隔

\V 匹配非垂直間隔

2.1數(shù)字和非數(shù)字

# 用'_'替換數(shù)字sub("\\d", "_", "the dandelion war 2010")

## [1] "the dandelion war _010"

gsub("\\d", "_", "the dandelion war 2010")

## [1] "the dandelion war ____"

# 用'_'替換非數(shù)字型字符sub("\\D", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\D", "_", "the dandelion war 2010")

## [1] "__________________2010"

2.2空格與非空格

# 用'_'替空格sub("\\s", "_", "the dandelion war 2010")

## [1] "the_dandelion war 2010"

gsub("\\s", "_", "the dandelion war 2010")

## [1] "the_dandelion_war_2010"

# 用'_'替非空格字符sub("\\S", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\S", "_", "the dandelion war 2010")

## [1] "___ _________ ___ ____"

2.3單詞與非單詞

# 用'_'替單詞sub("\\w", "_", "the dandelion war 2010")

## [1] "_he dandelion war 2010"

gsub("\\w", "_", "the dandelion war 2010")

## [1] "___ _________ ___ ____"

# 用'_'替非單詞sub("\\W", "_", "the dandelion war 2010")

## [1] "the_dandelion war 2010"

gsub("\\W", "_", "the dandelion war 2010")

## [1] "the_dandelion_war_2010"

2.4詞界與非詞界

# 用'_'替詞界sub("\\b", "_", "the dandelion war 2010")

## [1] "_the dandelion war 2010"

gsub("\\b", "_", "the dandelion war 2010")

## [1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"

# 用'_'替非詞界sub("\\B", "_", "the dandelion war 2010")

## [1] "t_he dandelion war 2010"

gsub("\\B", "_", "the dandelion war 2010")

## [1] "t_he d_an_de_li_on w_ar 2_01_0"

3.Character Class

字符類或字符集是用“[ ]”括起來的字符集，只要匹配字符集中的任意類,。例如[aA]表示匹配任意小寫a或大寫字母A,[0123456789]表示匹配任意單個數(shù)字,，這里要區(qū)別字符類與字符的區(qū)別。常見的一些字符類有：

[aeiou] 匹配任意元音字母

[AEIOU] 匹配任何一個大寫元音

[0123456789] 匹配任意單個數(shù)字

[0-9] 匹配任意數(shù)字(同上)

[a-z] 匹配任何ASCII小寫字母

[A-Z] 匹配任何ASCII大寫字母

[a-zA-Z0-9] 匹配任意上面的類

[^aeiou] 匹配除小寫元音外的字母

^{[^0-9]} 匹配除數(shù)字外的字符

transport = c("car", "bike", "plane", "boat")# 匹配'e'和'i'grep(pattern = "[ei]", transport, value = TRUE)

## [1] "bike"  "plane"

numerics = c("123", "17-April", "I-II-III", "R 3.0.1")# 匹配含'0'或'1'的字符grep(pattern = "[01]", numerics, value = TRUE)

## [1] "123"      "17-April" "R 3.0.1"

# 匹配含任意數(shù)字的字符grep(pattern = "[0-9]", numerics, value = TRUE)

## [1] "123"      "17-April" "R 3.0.1"

# 匹配不含數(shù)字的字符grep(pattern = "[^0-9]", numerics, value = TRUE)

## [1] "17-April" "I-II-III" "R 3.0.1"

4.POSIX Character Classes

POSIX字符類是用"[[ ]]“括起來的正則表達,，常見的POSIX字符類有：

[[:lower:]] 小寫字母

[[:upper:]] 大寫字母

[[:alpha:]] 所以字母 ([[:lower:]] and [[:upper:]])

[[:digit:]] 數(shù)字: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]] 字母和數(shù)字 ([[:alpha:]] and [[:digit:]])

[[:blank:]] 空白字符: space and tab

[[:cntrl:]] 控制字符

[[:punct:]] 標點符號: ! ” # % & ' ( ) * + , - . / : ;

[[:space:]] 空格字符:制表符,換行符, 垂直制表符,換頁符,回車和空格

[[:xdigit:]] 十六進制數(shù)字: 0-9 A B C D E F a b c d e f

[[:print:]]控制字符 ([[:alpha:]], [[:punct:]] and space)

[[:graph:]] 圖形化字符 ([[:alpha:]] and [[:punct:]])

# la vie (string)la_vie = "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"print(la_vie)

## [1] "La vie en #FFC0CB (rose);\nCes't la vie! \ttres jolie"

cat(la_vie)

## La vie en #FFC0CB (rose);
## Ces't la vie!    tres jolie

# 刪除空格字符gsub(pattern = "[[:blank:]]", replacement = "", la_vie)

## [1] "Lavieen#FFC0CB(rose);\nCes'tlavie!tresjolie"

# 刪除標點gsub(pattern = "[[:punct:]]", replacement = "", la_vie)

## [1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"

# 刪除數(shù)字gsub(pattern = "[[:xdigit:]]", replacement = "", la_vie)

## [1] "L vi n # (ros);\ns't l vi! \ttrs joli"

# 刪除控制字符gsub(pattern = "[[:print:]]", replacement = "", la_vie)

## [1] "\n"

# 刪除非控制符gsub(pattern = "[^[:print:]]", replacement = "", la_vie)

## [1] "La vie en #FFC0CB (rose);Ces't la vie! \ttres jolie"

# 刪除圖形化字符gsub(pattern = "[[:graph:]]", replacement = "", la_vie)

## [1] "    \n   \t "

# 刪除非圖形化字符gsub(pattern = "[^[:graph:]]", replacement = "", la_vie)

## [1] "Lavieen#FFC0CB(rose);Ces'tlavie!tresjolie"

5.Quantifiers

Quantifiers在要滿足特定條件的匹配一定數(shù)量的字符時使用,，用于設(shè)定符合匹配表達的實例數(shù)。常見的數(shù)量詞表達有：

'?' 前面的待匹配的項目是可選的,，且最多匹配一個

'*' 前面待匹配的項目可以匹配0個或更多個

'+' 前面待匹配的項目將匹配一個或多個

'{n}' 前面待匹配的項目將匹配n個

'{n,}' 前面待匹配的項目將匹配n個或更多個

'{n,m}' 前面待匹配的項目將匹配至少n個最多m個

people = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", 
    "rasmus", "jacob", "youna", "flora", "adi")# 匹配'm',最多一次grep(pattern = "m?", people, value = TRUE)

##  [1] "rori"   "emilia" "matteo" "mehmet" "filipe" "anna"   "tyler" 
##  [8] "rasmus" "jacob"  "youna"  "flora"  "adi"

# 匹配‘m’,一次grep(pattern = "m{1}", people, value = TRUE, perl = FALSE)

## [1] "emilia" "matteo" "mehmet" "rasmus"

# 匹配'm'零次或更多次,并匹配't'grep(pattern = "m*t", people, value = TRUE)

## [1] "matteo" "mehmet" "tyler"

# 匹配't'零次或更多次,并匹配'm'grep(pattern = "t*m", people, value = TRUE)

## [1] "emilia" "matteo" "mehmet" "rasmus"

# 匹配‘m’一次或更多次grep(pattern = "m+", people, value = TRUE)

## [1] "emilia" "matteo" "mehmet" "rasmus"

# 匹配‘m’一次或更多次,，并匹配‘t’grep(pattern = "m+.t", people, value = TRUE)

## [1] "matteo" "mehmet"

# 匹配‘t’兩次grep(pattern = "t{2}", people, value = TRUE)

## [1] "matteo"

6.正則表達式函數(shù)

主要的函數(shù)有：grep、grepl,、regexpr,、gregexpr、regexec,、sub,、gsub、strsplit等,。

stringr包也集合了正則表達函數(shù),，主要有：str_detect,、str_extract、str_extract_all,、str_match,、str_match_all、str_locate,、str_locate_all,、str_replace、str_replace_all,、str_split和str_split_fixed等,。

另外還有一些互補匹配功能函數(shù)諸如regmatches、match,、pmatch,、charmatch等。

還有一些輔助函數(shù)可以接受正則表達式模式,，如apropos,、browseEnv、glob2rx,、help.search和list.files等,。