Improve performance, drastically reduce misfires for new string(char*) on 64-bit #6125

Merged
merged 12 commits into from Jul 20, 2016

Projects

None yet

5 participants

@jamesqo
Contributor
jamesqo commented Jul 5, 2016 edited

In the new string(char*) constructor, we don't know how long the character buffer pointed to by the char* is, so we have to roll our own function, wcslen, to find the first 0 in the buffer. Instead of naively looping through each char in the buffer, wcslen does something different. It first 4-byte aligns the pointer, and then in a loop ANDs the next two chars together. If one of them is equal to zero, the result will be 0 and it will exit the loop. This can be very fast, as it is only 1 instruction (and).

Unfortunately, this approach has some downsides. Just because (a & b) == 0 doesn't necessarily mean that either a or b has to be 0, it just means they don't share any bits (though it could be likely that one of them is zero). This means that chars like @ or (space), which only contain 1 bit, are very prone to misfires and we have to fall back to checking each char individually for 0. This is especially true of non-ASCII characters too (which have at least one of bits 1..9 set), since there are a lot more of them that don't share bits with ASCII chars. Also, it doesn't take advantage of the fact that 4 chars can fit into a word on x64, which means it's best to process 4 (instead of 2) chars at a time.

I've introduced a new implementation for wcslen on 64-bit, based on glibc's strlen implementation. It's a little different since the sizeof a char in .NET is 2 rather than 1, but the basic concept is the same. It reads a word from the string (1 long), and subtracts the magic constant 0x0001000100010001. This will subtract 1 from every char in the long, so if a zero is present it will turn to 0xffff. It then ANDs this with 0x8000800080008000 to check if the high bit is set in any of the chars. If it is, it checks each char individually for 0 then exits the loop. (see commentary below)

Note that this approach is also prone to misfires; any char > 0x8000 will still have its high bit set when subtracted 1 from, so this is going to be slower if the string contains any of those. However, I'm assuming those aren't going to be as common as ASCII chars, so I optimized for the common case.

I ran performance tests before posting this: old / new / source code

It looks like there is a slight slowdown for very small strings (< 8 in size), which is to be expected since we now 8-align rather than 4-align on 64-bit. However, for larger strings the story is different; even assuming the old algorithm experiences no misfires, it is still ~20% slower than the new one for large strings. I also tested the other extreme; if strings like ?@?@... are used which causes the old one to constantly misfire, the new one can be 2x as fast. Also, the new algorithm doesn't seem to slow down too badly; even when fed a constant stream of chars > 0x8000, it's only ~30% slower than the old one with no misfires.

I ran the CoreFX test suite on this, and will be adding more tests shortly to cover strings like ?@ or \u8000.

cc @jkotas @bbowyersmyth @mikedn

edit: I added yet another test case to the performance tests that read in the text of the .cs source file and tested wcslen on it. It looks like the new implementation is 2x-2.5x faster here.

jamesqo added some commits Jul 3, 2016
@jamesqo jamesqo Faster performance for string(char*) ctor, based on glibc strlen e9ea9b1
@jamesqo jamesqo Semantics fixes c1a6e52
@jamesqo jamesqo More fixes, eliminate end loop d4a120b
@jamesqo jamesqo Move comment on x86 loop b4f11a7
@jamesqo jamesqo Add perf TODO to the refactored method 80dde0d
@jamesqo jamesqo Add TODO to the x86 implementation 90963a0
@jamesqo jamesqo Remove additional branch 2f90ec1
@jamesqo jamesqo Fix the loop if we misfired
1d36cc5
@stephentoub
Member

based on glibc's strlen implementation

glibc is LGPL. We shouldn't copy such an implementation.

@jamesqo
Contributor
jamesqo commented Jul 5, 2016

We shouldn't copy such an implementation.

I'd dispute the fact that I'm 'copying' from glibc. This kind of trick (checking if a word has any zero bytes) is used in other places as well, e.g. FreeBSD, and is mentioned in Bit Twiddling Hacks. It's more of a general technique, than any glibc-specific thing.

@jamesqo jamesqo Remove mention of glibc
b826f15
@stephentoub
Member

I'd dispute the fact that I'm 'copying' from glibc

You stated multiple times that you based this on glibc. 😉 Did you look at the glibc implementation before / while doing this work?

@jamesqo
Contributor
jamesqo commented Jul 5, 2016

@stephentoub Yep, I guess so. I'll close this for now then, and see if I can come up with some other way to make it faster. Oh well 😢

@jamesqo jamesqo closed this Jul 5, 2016
@jamesqo jamesqo deleted the jamesqo:ctorcharptr-performance branch Jul 5, 2016
@jamesqo jamesqo restored the jamesqo:ctorcharptr-performance branch Jul 6, 2016
@jamesqo jamesqo Switch to trimmed-down algorithm from Bit Twiddling Hacks
5c24c6c
@jamesqo jamesqo reopened this Jul 6, 2016
@jamesqo
Contributor
jamesqo commented Jul 6, 2016 edited

@stephentoub I switched to a trimmed-down version of the other algorithm from Bit Twiddling Hacks, so now it doesn't borrow anything from glibc; I'd like to have this re-reviewed, if that's OK with you.

@stephentoub stephentoub and 1 other commented on an outdated diff Jul 7, 2016
src/mscorlib/src/System/String.cs
@@ -1695,29 +1695,104 @@ private String CtorCharCount(char c, int count)
private static unsafe int wcslen(char *ptr)
{
char *end = ptr;
+
+ // TODO: Test to see if switching to something like
+ // the 64-bit version would be beneficial on x86 as well
+
+ // First make sure our pointer is aligned on a word boundary
+ int alignment = IntPtr.Size - 1;
+
+ // If ptr is at an odd address, this loop will simply iterate all the way
@stephentoub
stephentoub Jul 7, 2016 Member

This comment doesn't look right. This isn't checking to see if ptr is at an odd (as in not even) address, but rather whether it's at an aligned address. And if it isn't aligned, it's not iterating "all the way" (assuming that means to the end), but rather only until it hits an aligned address, unless it it's very short and it gets to the end before it hits an aligned address.

@jamesqo
jamesqo Jul 7, 2016 Contributor

It is actually, chars are 2 bytes in size so end++ is really adding 2 (sizeof(char)) to the value of end.

@stephentoub
stephentoub Jul 8, 2016 edited Member

How does that contradict what I wrote? Let's say this is on 64-bit, the address passed in ends in 0x6, and the string is many chars long.. This loop will iterate through 1 char, at which point it's at an aligned address, and then exit, not at the end. (It's also true that this would never hit an aligned address if it started at an odd address, but that'd be a much less common case. It'd be very common to have even but not-word-aligned chars.)

@jamesqo
jamesqo Jul 8, 2016 Contributor

It's also true that this would never hit an aligned address if it started at an odd address

That's what I was referring to; perhaps I should clarify the comment.

@stephentoub
Member
stephentoub commented Jul 7, 2016 edited

The new version looks good to me, once the TODOs are removed (and the old commits are rebased away, as we don't want remnants of those in the repo). But @jkotas should review, too. Do you have updated benchmark results?

@jamesqo
Contributor
jamesqo commented Jul 8, 2016 edited

once the TODOs are removed

The reason I initially kept the existing implementation on 32-bit was out of fear that existing code would regress; e.g. code that was processing data that didn't have many of these misfires before would now have to do more bitwise operations for no gain, since the sizeof a word on 32-bit is 2 chars so we would still be only able to process 2 chars at a time.

For context, I made a program that reads its own text/the contents of its project.{lock.}json and prints out how many collisions there are that have (c1 & c2) == 0, although neither is 0. Here are the results I got (source code). As you can see the collisions don't make up a very high percentage of the total combination of chars, so I'll have to do a bit more investigating to see if all of those misfires outweigh the overhead of additional bitwise operations.

edit: I'm deciding to just keep it as-is for 32-bit for this PR, and remove the TODOs. Someone else can probably change it in another PR if it proves to be beneficial. From what I've seen so far, I don't think it's worth it.

@jamesqo jamesqo Remove x86-related TODOs
727d1fc
@jamesqo
Contributor
jamesqo commented Jul 8, 2016 edited

@stephentoub Here are the updated benchmarks for the new version: https://gist.github.com/jamesqo/3de048580abc570031651a238bf08ec9

@jamesqo jamesqo Clarify comment on 'odd addresses'
f508b82
@richlander
Member

FYI ... I updated the contributor guidelines to describe the "rules" in this situation. They didn't do that before. Sorry! They are intended to make these kind of PRs easier and more predictable, by clarifying the licensing requirements.

Updated guidlines for porting: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md#porting-files-from-other-projects

@jamesqo
Contributor
jamesqo commented Jul 8, 2016

@richlander Thanks for letting me know about the new guidelines. Since @stephentoub mentioned this earlier I've changed the implementation from one based on glibc to one based on a trick here, which says at the top all code is in the public domain (unless otherwise noted, which it isn't for that particular algorithm). So I don't think this PR should have any issues with licensing anymore, since it's no longer based on glibc.

@richlander
Member

@jamesqo yup. this looks good.

@stephentoub and I chatted about this PR. That's (obviously) what motivated me to update the guidelines and also what led to Stephen's feedback. It's all good.

Thanks for your patience and your collaboration/contribution.

@jkotas
Member
jkotas commented Jul 20, 2016

LGTM. Thanks again!

@jkotas jkotas merged commit 266108a into dotnet:master Jul 20, 2016

9 of 10 checks passed

Linux ARM Emulator Cross Debug Build Build finished.
Details
CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
Linux ARM Emulator Cross Release Build Build finished.
Details
OSX x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x86 legacy_backend Checked Build and Test Build finished.
Details
Windows_NT x86 ryujit Checked Build and Test Build finished.
Details
@jamesqo jamesqo deleted the jamesqo:ctorcharptr-performance branch Jul 20, 2016
@jamesqo
Contributor
jamesqo commented Jul 20, 2016

@jkotas Thank you for merging this! I'm working on porting this (and the other PR) to CoreRT right now.

@martinwoodward martinwoodward added a commit to martinwoodward/coreclr that referenced this pull request Jul 23, 2016
@martinwoodward martinwoodward Fix #6125
Add reference for Bit Twiddling Hacks
05fa33b
@jkotas jkotas added a commit that referenced this pull request Jul 23, 2016
@martinwoodward @jkotas martinwoodward + jkotas Fix #6125 (#6432)
Add reference for Bit Twiddling Hacks
59f169e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment