I have unstructured data that look like this:

data <- c("24-March-2017      product 1              color 1",
"March-2017-24              product 2                 color 2",
"2017-24-March  product 3              color 3")

I would like to count number of spaces between the date and the first character (product column) for each line. As shown in the sample data, the date format can vary. This information will be used to put the data into structured format.

What is the best way to perform this in R? I believe gsub can be used in this case, just not sure how to apply to count only number of spaces at the beginning of each line.

share
1  
its worth noting that you could bypass counting the spaces and just substitute anything greater than one space with a comma. That will let you split the string and allow coercion to a data.frame etc. – zacdav 10 hours ago
    
@ zacdav- thanks for the comment unfortunately the data I have could have more than one consecutive space in each field so this won't work. I provided the example above to simplify. – Curious 10 hours ago
1  
it may have been helpful to not simplify in that case as there may be easier ways to solve that problem directly. – zacdav 10 hours ago
    
agree, that was my thought after posting the question! – Curious 9 hours ago
up vote 11 down vote accepted

One approach would be to use regexpr that will return information about the first match of a given regular expression. In your case, you are looking for the first instance of a repeated white space. So, the following would tell you (1) where in your string you'll find the first white spaces, and (2) in the attributes how many white spaces you have:

regexpr("\\s+", data)
# [1] 14 14 14
# attr(,"match.length")
# [1]  6 14  2
# attr(,"useBytes")
# [1] TRUE

You can then use attr to extract the match.length attribute:

attr(regexpr("\\s+", data), "match.length")
share

You can sub out that section, then take the number of characters.

nchar(sub("\\S+(\\s+).*", "\\1", data))
# [1]  6 14  2

Or this one is kinda fun:

nchar(data) - nchar(sub("\\s+", "", data))
# [1]  6 14  2
share

Same solution using gregexpr as the above but in one line:

vapply(gregexpr(" +",dat),function(x)attr(x,"match.length")[1],0)
## [1]  6 14  2

I am assuming that the date is always is coming at the begining.

share
    
yes, the date is always coming at the beginning. Worked perfectly. – Curious 10 hours ago

Here is a stringi approach to get the output

library(stringi)
m1 <- stri_locate(data, regex = "\\s+")
m1[,2] -m1[,1] + 1
#[1]  6 14  2
share

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.