Improve performance, drastically reduce misfires for new string(char*) on 64-bit #6125
I'd dispute the fact that I'm 'copying' from glibc. This kind of trick (checking if a word has any zero bytes) is used in other places as well, e.g. FreeBSD, and is mentioned in Bit Twiddling Hacks. It's more of a general technique, than any glibc-specific thing. |
You stated multiple times that you based this on glibc. |
|
@stephentoub Yep, I guess so. I'll close this for now then, and see if I can come up with some other way to make it faster. Oh well |
|
@stephentoub I switched to a trimmed-down version of the other algorithm from Bit Twiddling Hacks, so now it doesn't borrow anything from glibc; I'd like to have this re-reviewed, if that's OK with you. |
| @@ -1695,29 +1695,104 @@ private String CtorCharCount(char c, int count) | ||
| private static unsafe int wcslen(char *ptr) | ||
| { | ||
| char *end = ptr; | ||
| + | ||
| + // TODO: Test to see if switching to something like | ||
| + // the 64-bit version would be beneficial on x86 as well | ||
| + | ||
| + // First make sure our pointer is aligned on a word boundary | ||
| + int alignment = IntPtr.Size - 1; | ||
| + | ||
| + // If ptr is at an odd address, this loop will simply iterate all the way |
|
stephentoub
This comment doesn't look right. This isn't checking to see if ptr is at an odd (as in not even) address, but rather whether it's at an aligned address. And if it isn't aligned, it's not iterating "all the way" (assuming that means to the end), but rather only until it hits an aligned address, unless it it's very short and it gets to the end before it hits an aligned address.
jamesqo
It is actually, chars are 2 bytes in size so
stephentoub
How does that contradict what I wrote? Let's say this is on 64-bit, the address passed in ends in 0x6, and the string is many chars long.. This loop will iterate through 1 char, at which point it's at an aligned address, and then exit, not at the end. (It's also true that this would never hit an aligned address if it started at an odd address, but that'd be a much less common case. It'd be very common to have even but not-word-aligned chars.)
jamesqo
That's what I was referring to; perhaps I should clarify the comment. |
|
The new version looks good to me, once the TODOs are removed (and the old commits are rebased away, as we don't want remnants of those in the repo). But @jkotas should review, too. Do you have updated benchmark results? |
The reason I initially kept the existing implementation on 32-bit was out of fear that existing code would regress; e.g. code that was processing data that didn't have many of these misfires before would now have to do more bitwise operations for no gain, since the sizeof a word on 32-bit is 2 chars so we would still be only able to process 2 chars at a time. For context, I made a program that reads its own text/the contents of its edit: I'm deciding to just keep it as-is for 32-bit for this PR, and remove the TODOs. Someone else can probably change it in another PR if it proves to be beneficial. From what I've seen so far, I don't think it's worth it. |
|
@stephentoub Here are the updated benchmarks for the new version: https://gist.github.com/jamesqo/3de048580abc570031651a238bf08ec9 |
|
FYI ... I updated the contributor guidelines to describe the "rules" in this situation. They didn't do that before. Sorry! They are intended to make these kind of PRs easier and more predictable, by clarifying the licensing requirements. Updated guidlines for porting: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md#porting-files-from-other-projects |
|
@richlander Thanks for letting me know about the new guidelines. Since @stephentoub mentioned this earlier I've changed the implementation from one based on glibc to one based on a trick here, which says at the top all code is in the public domain (unless otherwise noted, which it isn't for that particular algorithm). So I don't think this PR should have any issues with licensing anymore, since it's no longer based on glibc. |
|
@jamesqo yup. this looks good. @stephentoub and I chatted about this PR. That's (obviously) what motivated me to update the guidelines and also what led to Stephen's feedback. It's all good. Thanks for your patience and your collaboration/contribution. |
9 of 10 checks passed
|
@jkotas Thank you for merging this! I'm working on porting this (and the other PR) to CoreRT right now. |
|
|
martinwoodward |
05fa33b
|
In the
new string(char*)constructor, we don't know how long the character buffer pointed to by thechar*is, so we have to roll our own function,wcslen, to find the first 0 in the buffer. Instead of naively looping through each char in the buffer,wcslendoes something different. It first 4-byte aligns the pointer, and then in a loop ANDs the next two chars together. If one of them is equal to zero, the result will be 0 and it will exit the loop. This can be very fast, as it is only 1 instruction (and).Unfortunately, this approach has some downsides. Just because
(a & b) == 0doesn't necessarily mean that either a or b has to be 0, it just means they don't share any bits (though it could be likely that one of them is zero). This means that chars like@or(space), which only contain 1 bit, are very prone to misfires and we have to fall back to checking each char individually for 0. This is especially true of non-ASCII characters too (which have at least one of bits 1..9 set), since there are a lot more of them that don't share bits with ASCII chars. Also, it doesn't take advantage of the fact that 4 chars can fit into a word on x64, which means it's best to process 4 (instead of 2) chars at a time.I've introduced a new implementation for(see commentary below)wcslenon 64-bit, based on glibc's strlen implementation. It's a little different since the sizeof acharin .NET is 2 rather than 1, but the basic concept is the same. It reads a word from the string (1 long), and subtracts the magic constant0x0001000100010001. This will subtract 1 from every char in the long, so if a zero is present it will turn to0xffff. It then ANDs this with0x8000800080008000to check if the high bit is set in any of the chars. If it is, it checks each char individually for 0 then exits the loop.Note that this approach is also prone to misfires; any char > 0x8000 will still have its high bit set when subtracted 1 from, so this is going to be slower if the string contains any of those. However, I'm assuming those aren't going to be as common as ASCII chars, so I optimized for the common case.
I ran performance tests before posting this: old / new / source code
It looks like there is a slight slowdown for very small strings (< 8 in size), which is to be expected since we now 8-align rather than 4-align on 64-bit. However, for larger strings the story is different; even assuming the old algorithm experiences no misfires, it is still ~20% slower than the new one for large strings. I also tested the other extreme; if strings like
?@?@...are used which causes the old one to constantly misfire, the new one can be 2x as fast. Also, the new algorithm doesn't seem to slow down too badly; even when fed a constant stream of chars > 0x8000, it's only ~30% slower than the old one with no misfires.I ran the CoreFX test suite on this, and will be adding more tests shortly to cover strings like
?@or\u8000.cc @jkotas @bbowyersmyth @mikedn
edit: I added yet another test case to the performance tests that read in the text of the
.cssource file and testedwcslenon it. It looks like the new implementation is 2x-2.5x faster here.