regex - Truncate words within each element of a character vector in R -
i have data frame 1 column character vector , every element in vector full text of document. want truncate words in each element maximum word length 5 characters.
for example:
a <- c(1, 2) b <- c("words longer 5 characters should truncated", "words shorter 5 characters should not modified") df <- data.frame("file" = a, "text" = b, stringsasfactors=false) head(df) file text 1 1 words longer 5 characters should truncated 2 2 words shorter 5 characters should not modified
and i'm trying get:
file text 1 1 words longe 5 chara shoul trunc 2 2 words short 5 chara shoul not modif
i've tried using strsplit() , strtrim() modify each word (based in part on split vectors of words every n words (vectors in list)):
x <- unlist(strsplit(df$text, "\\s+")) y <- strtrim(x, 5) y [1] "words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "words" "short" "than" [12] "five" "chara" "shoul" "not" "be" "modif"
but don't know if that's right direction, because need words in data frame associated correct row, shown above.
is there way using gsub , regex?
if you're looking utilize gsub
perform task:
> df$text <- gsub('(?=\\b\\pl{6,}).{5}\\k\\pl*', '', df$text, perl=t) > df # file text # 1 1 words longe 5 chara shoul trunc # 2 2 words short 5 chara shoul not modif
Comments
Post a Comment