Match substring of two vectors and create a new vector combining them

Question

Consider two vectors.

a <- c(123, 234, 432, 223)
b <- c(234, 238, 342, 325, 326)

Now, I want to match last two digits of a to first two digits of b and create a new vector pasting first digit of a, the matched part and last digit of b. My expected output is :

[1] 1234 1238 2342 4325 4326 2234 2238

For simplicity purpose consider all the elements would always be of length 3.

I have tried :

sub_a <- substr(a, 2, 3)   #get last two digits of a
sub_b <- substr(b, 1, 2)   #get first two digits of b
common <- intersect(sub_a, sub_b)

common gives me the common elements in both a and b which are :

[1] "23" "34" "32"

and then I use match and paste0 together and I get incomplete output.

paste0(a[match(common, sub_a)], substr(b[match(common, sub_b)], 3, 3))
#[1] "1234" "2342" "4325"

as match matches only with the first occurrences.

How can I achieve my expected output?

Hmm... A not very efficient way can be d1 <- expand.grid(a, b); d2 <- expand.grid(sub_a, sub_b); i1 <- d2$Var1 == d2$Var2; do.call(paste0, d1[i1,]) and then remove duplicates from each string — Sotos, 6 hours ago
I don't think match will suit you there as it stops after first match. Probably more like sub_a %in% sub — Cath, 6 hours ago
@Sotos that works except that common part is repeated twice in the final output. Can we make it more efficient? I am already using so many extra variables (sub_a, sub_b, common), adding two more would be expensive? — Ronak Shah, 6 hours ago

Jaap · Answer 1 · 2017-11-22 14:37:34Z

up vote 10 down vote

A possible solution:

a <- setNames(a, substr(a,2,3))
b <- setNames(b, substr(b,1,2))

df <- merge(stack(a), stack(b), by = 'ind')
paste0(substr(df$values.x,1,1), df$values.y)

which gives:

[1] "1234" "1238" "2234" "2238" "4325" "4326" "2342"

A second alternative:

a1 <- setNames(a1, substr(a1,2,3))
b1 <- setNames(b1, substr(b1,1,2))

l <- lapply(names(a1), function(x) b1[x == names(b1)])
paste0(substr(rep(a1, lengths(l)),1,1), unlist(l))

which gives the same result and is considerably faster (see the benchmark).

edited 2 hours ago

answered 6 hours ago

Jaap

40.1k158491

Very nice @Jaap. Never occurred to me to that it's possible to use stack on a named vector; I'm curious: Is this documented somewhere? I couldn't find anything in ?stack. – Maurits Evers 5 hours ago

@MauritsEvers If it isn't in ?stack then I wouldn't know of another source which documents this :-\ – Jaap 5 hours ago

I see, thanks anyway for the clarification. – Maurits Evers 4 hours ago

add a comment |

Sotos · Answer 2 · 2017-11-22 13:03:54Z

Another way could be to use expand.grid, so picking up at your sub_a and sub_b,

d1 <- expand.grid(a, b, stringsAsFactors = FALSE)
d2 <- expand.grid(sub_a, sub_b, stringsAsFactors = FALSE)
i1 <- d2$Var1 == d2$Var2
d1 <- d1[i1,] 
d1$Var1 <- substr(d1$Var1, 1, 1)

do.call(paste0, d1)
#[1] "1234" "2234" "1238" "2238" "2342" "4325" "4326"

Tensibai · Answer 3 · 2017-11-22 10:26:19Z

Probably a little complex but works:

unlist( sapply( a, function(x) {
  regex <- paste0( substr(x, 2, 3), '(\\d)')
  z <- sub(regex, paste0(x, "\\1"), b)
  z[!b %in% z] 
} ))

which give: [1] "1234" "1238" "2342" "4325" "4326" "2234" "2238"

The main idea is to create a regex for each entry in a, apply this regex to b and replace the values with the current a value and append only the last digit captured (the (\\d) part of the regex, then filter the resulting vector to get back only the modified values.

Out of curiosity, I did a small benchmark (adding sub_a and sub_b creation into Sotos and Heikki answers so everyone start on the same initial vectors a of 400 observations and b of 500 observations):

Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
      Jaap(a, b) 341.0224 342.6853 345.2182 344.3482 347.3161 350.2840     3
     Tensi(a, b) 415.9175 416.2672 421.9148 416.6168 424.9134 433.2100     3
    Heikki(a, b) 126.9859 139.6727 149.3252 152.3594 160.4948 168.6302     3
     Sotos(a, b) 151.1264 164.9869 172.0310 178.8474 182.4833 186.1191     3
 MattWBase(a, b) 286.9651 290.8923 293.3795 294.8195 296.5867 298.3538     3

Loops are not bad in themselves as long as you don't grow a vector within them — Tensibai, 6 hours ago
(here the vector is extended, but not on each iteration, so it's quickly skipping no matches) — Tensibai, 6 hours ago

Heikki · Answer 4 · 2017-11-22 09:58:32Z

Here is an approach where the list is gone through:

result <- c()
for (ai in a) {
  sub_ai <- substr(ai,2,3)
  if (sub_ai %in% sub_a) {
    b_match <- (sub_b == sub_ai)
    result <- c(result,paste0(ai,substr(b[b_match],3,4)))
  }
}
> result
[1] "1234" "1238" "2342" "4325" "4326" "2234" "2238"

Then you may want to unique the results.

zx8754 · Answer 5 · 2017-11-22 10:20:34Z

Using dplyr::inner_join on the middle pieces:

library(dplyr)

a <- c(123, 234, 432, 223)
b <- c(234, 238, 342, 325, 326)

a1 <- data.frame(a)
b1 <- data.frame(b)

a1 <- a1 %>% mutate(first_a = substr(a, 1, 1), last_a = substr(a, 2, 3))
b1 <- b1 %>% mutate(first_b = substr(b, 1, 2), last_b = substr(b, 3, 3))

c1 <- inner_join(a1, b1, by = c("last_a" = "first_b"))

results <- paste0(c1$a, c1$last_b)

Using base::merge:

a1 <- data.frame(a)
b1 <- data.frame(b)

a1$first_a = substr(a1$a, 1, 1)
a1$last_a = substr(a1$a, 2, 3)
b1$first_b = substr(b1$b, 1, 2)
b1$last_b = substr(b1$b, 3, 3)

c1 <- merge(a1, b1, by.x = "last_a", by.y = "first_b")

results <- paste0(c1$a, c1$last_b)

zx8754 · Answer 6 · 2017-11-22 13:51:54Z

How about a bit of maths*:

unlist(sapply(a, function(i)
  i * 10 + (b %% 10)[i %% 100 == b %/% 10]))

*Assumption: all numbers are 3 digits, but this of course can be adjusted within sapply.

Check output, the output will be in different order than other answers, and output is numeric, not character.

identical(sort(as.numeric(docendo(a, b))), sort(zx8754(a, b)))
# [1] TRUE
identical(sort(as.numeric(jaap(a, b))), sort(zx8754(a, b)))
# [1] TRUE

Edit: forloop version seems 3x faster.

zx8754 <- function(a, b) {
  unlist(sapply(a, function(i) i * 10 + (b %% 10)[i %% 100 == b %/% 10]))
}

zx8754_forloop <- function(a, b) {
  res <- integer()
  for(i in a)  res <- c(res, i * 10 + (b %% 10)[i %% 100 == b %/% 10])
  res
}

microbenchmark::microbenchmark(
  zx8754 = zx8754(a, b),
  zx8754_forloop = zx8754_forloop(a, b)
)

# Unit: microseconds
#           expr    min      lq     mean median     uq      max neval
# zx8754         16.535 17.3910 55.05348 17.676 18.246 3672.223   100
# zx8754_forloop  4.562  5.4165 46.74887  5.987  6.272 4080.469   100

#check output
identical(zx8754(a, b), zx8754_forloop(a, b))
# [1] TRUE

docendo discimus · Answer 7 · 2017-11-22 15:06:53Z

Here's another option in base R:

foo <- function(a, b) {
  split_a <- split(a,  substr(a, 2, 3))
  split_b <- split(substr(b, 3, 3), substr(b, 1, 2))
  idx <- intersect(names(split_a), names(split_b))
  stopifnot(length(idx) > 0)
  unlist(Map(function(x,y) outer(x, y, paste0), split_a[idx], split_b[idx]), 
                      use.names = FALSE)
}

foo(a, b)
# [1] "1234" "2234" "1238" "2238" "4325" "4326" "2342"

Update:

I used the function definitions from https://stackoverflow.com/a/47435067/3521006 to make another benchmark with all answers and larger data. The input data and results I got are:

set.seed(123)
a <- sample(100:999, 1e4, TRUE)
b <- sample(100:999, 1e3, TRUE)

library(microbenchmark)
library(dplyr)
res <- microbenchmark(docendo(a, b), 
               Jaap1(a, b), 
               Jaap2(a, b), 
               Sotos(a, b), 
               Tensi(a, b), 
               Heikki(a, b), 
               Matt_base(a, b),
               Matt_dplyr(a, b), 
               zx8754(a, b),
               times = 10, unit = "relative")

Unit: relative
             expr        min         lq       mean     median         uq        max neval
    docendo(a, b)   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000    10
      Jaap1(a, b)  14.002977  13.724432  13.347755  13.433175  12.788948  13.301811    10
      Jaap2(a, b)   4.364993   4.936248   5.201879   5.125639   5.060425   7.520069    10
      Sotos(a, b)  22.215750  23.850280  25.743047  25.177676  28.274083  28.288089    10
      Tensi(a, b) 231.230360 234.830000 246.587532 242.345573 260.784725 273.184452    10
     Heikki(a, b) 135.615708 136.900943 144.775845 146.314048 150.546406 156.873954    10
  Matt_base(a, b)  13.274675  12.995334  13.402940  12.723798  12.432802  18.881093    10
 Matt_dplyr(a, b)   1.299223   1.314568   1.420479   1.345850   1.380378   1.807671    10
     zx8754(a, b)   9.607226  10.175381  10.486580  10.136439  10.096818  13.410858    10

Interestingly, when I reproduce the comparison of Frank's answer and mine from the benchmark, I get opposite results:

Frank <- function(a, b) {
  aDT <- as.data.table(tstrsplit(a, ""))
  bDT <- setnames(as.data.table(tstrsplit(b, "")), c("V2", "V3", "V4"))
  merge(aDT, bDT, allow.cartesian = TRUE)[, paste0(V1, V2, V3, V4)]
}

set.seed(1)  # same input size as in the cw benchmark answer
a <- sample(100:999, 8e3, TRUE)
b <- sample(100:999, 1e4, TRUE)

microbenchmark(Frank(a, b), docendo(a, b), unit = "relative", times = 10)

Unit: relative
          expr     min       lq     mean   median       uq      max neval
   Frank(a, b) 1.37435 1.390417 1.500996 1.470548 1.644079 1.616446    10
 docendo(a, b) 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000    10
all.equal(sort(docendo(a, b)), sort(Frank(a, b)))
#[1] TRUE

Nice one and the fastest! See benchmark in the separate cw-answer. — Jaap, 3 hours ago
@Sotos, yeah, I guess you need to use stringsAsFactors=FALSE in there. — docendo discimus, 3 hours ago

4 revs · Answer 8 · 2017-11-22 16:36:09Z

A benchmark (adding sub_a and sub_b creation into Sotos and Heikki answers so everyone start on the same initial vectors a of 800 observations and b of 1000 observations).

Running the benchmark with:

library(dplyr)
library(data.table)
library(microbenchmark)

a <- sample(100:999, 8e3, TRUE)
b <- sample(100:999, 1e4, TRUE)

microbenchmark(Jaap1(a,b), Jaap2(a,b), Tensi(a,b), Heikki(a,b), Sotos(a,b),
               Matt_base(a,b), Matt_dplyr(a,b), Docendo(a,b),
               zx8754(a,b), zx8754for(a,b), Frank(a,b),
               times = 5, unit = 'relative')

gives:

Unit: relative
             expr        min         lq       mean     median         uq        max neval     cld
      Jaap1(a, b)  34.703161  34.555558  32.159832  33.589581  33.379529  26.404050     5    d   
      Jaap2(a, b)   4.632965   4.757991   4.526898   4.703065   4.855419   3.879726     5 a      
      Tensi(a, b) 280.223144 282.671462 262.658752 273.782450 273.316759 217.585132     5       g
     Heikki(a, b)  80.749127  80.611073  79.538881  82.291796  85.075395  71.342298     5      f 
      Sotos(a, b)  26.127738  26.050240  24.299225  25.205191  25.262141  20.166425     5   c    
  Matt_base(a, b)  33.349114  33.438351  31.153113  32.459036  31.932295  26.181568     5    d   
 Matt_dplyr(a, b)   1.546961   1.540109   1.481780   1.527166   1.575283   1.280775     5 a      
    Docendo(a, b)   1.366530   1.399819   1.328550   1.363766   1.348267   1.204104     5 a      
     zx8754(a, b)  15.751249  16.257225  14.970995  15.755592  15.601824  12.322439     5  b     
  zx8754for(a, b)  46.335615  48.606064  45.353230  48.128807  47.942506  37.997405     5     e  
      Frank(a, b)   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000     5 a

The used functions:

Jaap1 <- function(a,b) {
  a <- setNames(a, substr(a,2,3))
  b <- setNames(b, substr(b,1,2))

  df <- merge(stack(a), stack(b), by = 'ind')
  paste0(substr(df$values.x,1,1), df$values.y)
}

Jaap2 <- function(a,b) {
  a <- setNames(a, substr(a,2,3))
  b <- setNames(b, substr(b,1,2))

  l <- lapply(names(a), function(x) b[x == names(b)])
  paste0(substr(rep(a, lengths(l)),1,1), unlist(l))
}

Tensi <- function(a,b) {
  unlist(sapply(a,function(x) {regex <- paste0(substr(x,2,3),'(\\d)'); z <- sub(regex,paste0(x,"\\1"),b); z[!b %in% z] } ))
}

Heikki <- function(a,b) {
  sub_a <- substr(a, 2, 3)
  sub_b <- substr(b, 1, 2)
  result <- c()
  for (ai in a) {
    sub_ai <- substr(ai,2,3)
    if (sub_ai %in% sub_a) {
      b_match <- (sub_b == sub_ai)
      result <- c(result,paste0(ai,substr(b[b_match],3,4)))
    }
  }
  result
}

Sotos <- function(a,b) {
  sub_a <- substr(a, 2, 3)
  sub_b <- substr(b, 1, 2)
  d1 <- expand.grid(a, b, stringsAsFactors = FALSE)
  d2 <- expand.grid(sub_a, sub_b, stringsAsFactors = FALSE)
  i1 <- d2$Var1 == d2$Var2
  d1 <- d1[i1,] 
  d1$Var1 <- substr(d1$Var1, 1, 1)

  do.call(paste0, d1)
}

Matt_base <- function(a,b) {
  a1 <- data.frame(a)
  b1 <- data.frame(b)

  a1$first_a = substr(a1$a, 1, 1)
  a1$last_a = substr(a1$a, 2, 3)
  b1$first_b = substr(b1$b, 1, 2)
  b1$last_b = substr(b1$b, 3, 3)

  c1 <- merge(a1, b1, by.x = "last_a", by.y = "first_b")

  results <- paste0(c1$a, c1$last_b)
}

Matt_dplyr <- function(a,b) {
  a1 <- data.frame(a)
  b1 <- data.frame(b)

  a1 <- a1 %>% mutate(first_a = substr(a, 1, 1), last_a = substr(a, 2, 3))
  b1 <- b1 %>% mutate(first_b = substr(b, 1, 2), last_b = substr(b, 3, 3))

  c1 <- inner_join(a1, b1, by = c("last_a" = "first_b"))

  results <- paste0(c1$a, c1$last_b)
}

Docendo <- function(a, b) {
  split_a <- split(a,  substr(a, 2, 3))
  split_b <- split(substr(b, 3, 3), substr(b, 1, 2))
  idx <- intersect(names(split_a), names(split_b))
  stopifnot(length(idx) > 0)
  unlist(Map(function(x,y) outer(x, y, paste0), split_a[idx], split_b[idx]), 
         use.names = FALSE)
}

zx8754 <- function(a, b) {
  unlist(sapply(a, function(i) i * 10 + (b %% 10)[i %% 100 == b %/% 10]))
}

zx8754for <- function(a, b) {
  res <- integer()
  for(i in a)  res <- c(res, i * 10 + (b %% 10)[i %% 100 == b %/% 10])
  res
}

Frank <- function(a, b) {
  aDT <- as.data.table(tstrsplit(a, ""))
  bDT <- setnames(as.data.table(tstrsplit(b, "")), c("V2", "V3", "V4"))
  merge(aDT, bDT, allow.cartesian = TRUE)[, paste0(V1, V2, V3, V4)]
}

asked	today
viewed	491 times
active	today

Learn, Share, Build

Match substring of two vectors and create a new vector combining them

8 Answers 8

Your Answer

Not the answer you're looking for? Browse other questions tagged r vector or ask your own question.

Visit Chat

Hot Network Questions

Learn, Share, Build

8 Answers 8

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r vector or ask your own question.

Related