How to count number of spaces just after the date information?

Question

I have unstructured data that look like this:

data <- c("24-March-2017      product 1              color 1",
"March-2017-24              product 2                 color 2",
"2017-24-March  product 3              color 3")

I would like to count number of spaces between the date and the first character (product column) for each line. As shown in the sample data, the date format can vary. This information will be used to put the data into structured format.

What is the best way to perform this in R? I believe gsub can be used in this case, just not sure how to apply to count only number of spaces at the beginning of each line.

its worth noting that you could bypass counting the spaces and just substitute anything greater than one space with a comma. That will let you split the string and allow coercion to a data.frame etc. — zacdav, 10 hours ago
@ zacdav- thanks for the comment unfortunately the data I have could have more than one consecutive space in each field so this won't work. I provided the example above to simplify. — Curious, 10 hours ago
it may have been helpful to not simplify in that case as there may be easier ways to solve that problem directly. — zacdav, 10 hours ago

sinQueso · Accepted Answer · 2017-04-03 23:02:08Z

One approach would be to use regexpr that will return information about the first match of a given regular expression. In your case, you are looking for the first instance of a repeated white space. So, the following would tell you (1) where in your string you'll find the first white spaces, and (2) in the attributes how many white spaces you have:

regexpr("\\s+", data)
# [1] 14 14 14
# attr(,"match.length")
# [1]  6 14  2
# attr(,"useBytes")
# [1] TRUE

You can then use attr to extract the match.length attribute:

attr(regexpr("\\s+", data), "match.length")

Rich Scriven · Answer 2 · 2017-04-03 23:13:25Z

up vote 6 down vote

You can sub out that section, then take the number of characters.

nchar(sub("\\S+(\\s+).*", "\\1", data))
# [1]  6 14  2

Or this one is kinda fun:

nchar(data) - nchar(sub("\\s+", "", data))
# [1]  6 14  2

edited 10 hours ago

answered 10 hours ago

Rich Scriven

60.9k664128

add a comment |

agstudy · Answer 3 · 2017-04-03 23:03:41Z

up vote 3 down vote

Same solution using gregexpr as the above but in one line:

vapply(gregexpr(" +",dat),function(x)attr(x,"match.length")[1],0)
## [1]  6 14  2

I am assuming that the date is always is coming at the begining.

answered 10 hours ago

agstudy

84.6k897157

yes, the date is always coming at the beginning. Worked perfectly. – Curious 10 hours ago

add a comment |

akrun · Answer 4 · 2017-04-04 03:19:02Z

up vote 1 down vote

Here is a stringi approach to get the output

library(stringi)
m1 <- stri_locate(data, regex = "\\s+")
m1[,2] -m1[,1] + 1
#[1]  6 14  2

answered 6 hours ago

akrun

248k973124

add a comment |

asked	today
viewed	360 times
active	today

How to count number of spaces just after the date information?

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged r or ask your own question.

Hot Network Questions

How to count number of spaces just after the date information?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r or ask your own question.

Related

Hot Network Questions