regex - Truncate words within each element of a character vector in R -

- August 15, 2014

i have data frame 1 column character vector , every element in vector full text of document. want truncate words in each element maximum word length 5 characters.

for example:

a <- c(1, 2) b <- c("words longer 5 characters should truncated",        "words shorter 5 characters should not modified") df <- data.frame("file" = a, "text" = b, stringsasfactors=false)  head(df)   file                                                      text 1    1     words longer 5 characters should truncated 2    2 words shorter 5 characters should not modified

and i'm trying get:

  file                                           text 1    1     words longe 5 chara shoul trunc 2    2 words short 5 chara shoul not modif

i've tried using strsplit() , strtrim() modify each word (based in part on split vectors of words every n words (vectors in list)):

x <- unlist(strsplit(df$text, "\\s+")) y <- strtrim(x, 5) y [1] "words" "longe" "than"  "five"  "chara" "shoul" "be"    "trunc" "words" "short" "than"  [12] "five"  "chara" "shoul" "not"   "be"    "modif"

but don't know if that's right direction, because need words in data frame associated correct row, shown above.

is there way using gsub , regex?

if you're looking utilize gsub perform task:

> df$text <- gsub('(?=\\b\\pl{6,}).{5}\\k\\pl*', '', df$text, perl=t) > df #   file                                           text # 1    1     words longe 5 chara shoul trunc # 2    2 words short 5 chara shoul not modif

Search This Blog

Soju

regex - Truncate words within each element of a character vector in R -

Comments

Post a Comment

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -