Improve performance, drastically reduce misfires for new string(char*) on 64-bit #6125

jamesqo · Jul 5, 2016

In the new string(char*) constructor, we don't know how long the character buffer pointed to by the char* is, so we have to roll our own function, wcslen, to find the first 0 in the buffer. Instead of naively looping through each char in the buffer, wcslen does something different. It first 4-byte aligns the pointer, and then in a loop ANDs the next two chars together. If one of them is equal to zero, the result will be 0 and it will exit the loop. This can be very fast, as it is only 1 instruction (and).

Unfortunately, this approach has some downsides. Just because (a & b) == 0 doesn't necessarily mean that either a or b has to be 0, it just means they don't share any bits (though it could be likely that one of them is zero). This means that chars like @ or (space), which only contain 1 bit, are very prone to misfires and we have to fall back to checking each char individually for 0. This is especially true of non-ASCII characters too (which have at least one of bits 1..9 set), since there are a lot more of them that don't share bits with ASCII chars. Also, it doesn't take advantage of the fact that 4 chars can fit into a word on x64, which means it's best to process 4 (instead of 2) chars at a time.

I've introduced a new implementation for wcslen on 64-bit, based on glibc's strlen implementation. It's a little different since the sizeof a char in .NET is 2 rather than 1, but the basic concept is the same. It reads a word from the string (1 long), and subtracts the magic constant 0x0001000100010001. This will subtract 1 from every char in the long, so if a zero is present it will turn to 0xffff. It then ANDs this with 0x8000800080008000 to check if the high bit is set in any of the chars. If it is, it checks each char individually for 0 then exits the loop. (see commentary below)

Note that this approach is also prone to misfires; any char > 0x8000 will still have its high bit set when subtracted 1 from, so this is going to be slower if the string contains any of those. However, I'm assuming those aren't going to be as common as ASCII chars, so I optimized for the common case.

I ran performance tests before posting this: old / new / source code

It looks like there is a slight slowdown for very small strings (< 8 in size), which is to be expected since we now 8-align rather than 4-align on 64-bit. However, for larger strings the story is different; even assuming the old algorithm experiences no misfires, it is still ~20% slower than the new one for large strings. I also tested the other extreme; if strings like ?@?@... are used which causes the old one to constantly misfire, the new one can be 2x as fast. Also, the new algorithm doesn't seem to slow down too badly; even when fed a constant stream of chars > 0x8000, it's only ~30% slower than the old one with no misfires.

I ran the CoreFX test suite on this, and will be adding more tests shortly to cover strings like ?@ or \u8000.

cc @jkotas @bbowyersmyth @mikedn

edit: I added yet another test case to the performance tests that read in the text of the .cs source file and tested wcslen on it. It looks like the new implementation is 2x-2.5x faster here.

stephentoub · Jul 5, 2016

based on glibc's strlen implementation

glibc is LGPL. We shouldn't copy such an implementation.

jamesqo · Jul 5, 2016

We shouldn't copy such an implementation.

I'd dispute the fact that I'm 'copying' from glibc. This kind of trick (checking if a word has any zero bytes) is used in other places as well, e.g. FreeBSD, and is mentioned in Bit Twiddling Hacks. It's more of a general technique, than any glibc-specific thing.

stephentoub · Jul 5, 2016

I'd dispute the fact that I'm 'copying' from glibc

You stated multiple times that you based this on glibc. 😉 Did you look at the glibc implementation before / while doing this work?

jamesqo · Jul 5, 2016

@stephentoub Yep, I guess so. I'll close this for now then, and see if I can come up with some other way to make it faster. Oh well 😢

jamesqo · Jul 6, 2016

@stephentoub I switched to a trimmed-down version of the other algorithm from Bit Twiddling Hacks, so now it doesn't borrow anything from glibc; I'd like to have this re-reviewed, if that's OK with you.

This comment doesn't look right. This isn't checking to see if ptr is at an odd (as in not even) address, but rather whether it's at an aligned address. And if it isn't aligned, it's not iterating "all the way" (assuming that means to the end), but rather only until it hits an aligned address, unless it it's very short and it gets to the end before it hits an aligned address.

It is actually, chars are 2 bytes in size so end++ is really adding 2 (sizeof(char)) to the value of end.

How does that contradict what I wrote? Let's say this is on 64-bit, the address passed in ends in 0x6, and the string is many chars long.. This loop will iterate through 1 char, at which point it's at an aligned address, and then exit, not at the end. (It's also true that this would never hit an aligned address if it started at an odd address, but that'd be a much less common case. It'd be very common to have even but not-word-aligned chars.)

It's also true that this would never hit an aligned address if it started at an odd address

That's what I was referring to; perhaps I should clarify the comment.

stephentoub · Jul 7, 2016

The new version looks good to me, once the TODOs are removed (and the old commits are rebased away, as we don't want remnants of those in the repo). But @jkotas should review, too. Do you have updated benchmark results?

jamesqo · Jul 8, 2016

once the TODOs are removed

The reason I initially kept the existing implementation on 32-bit was out of fear that existing code would regress; e.g. code that was processing data that didn't have many of these misfires before would now have to do more bitwise operations for no gain, since the sizeof a word on 32-bit is 2 chars so we would still be only able to process 2 chars at a time.

For context, I made a program that reads its own text/the contents of its project.{lock.}json and prints out how many collisions there are that have (c1 & c2) == 0, although neither is 0. Here are the results I got (source code). As you can see the collisions don't make up a very high percentage of the total combination of chars, so I'll have to do a bit more investigating to see if all of those misfires outweigh the overhead of additional bitwise operations.

edit: I'm deciding to just keep it as-is for 32-bit for this PR, and remove the TODOs. Someone else can probably change it in another PR if it proves to be beneficial. From what I've seen so far, I don't think it's worth it.

jamesqo · Jul 8, 2016

@stephentoub Here are the updated benchmarks for the new version: https://gist.github.com/jamesqo/3de048580abc570031651a238bf08ec9

richlander · Jul 8, 2016

FYI ... I updated the contributor guidelines to describe the "rules" in this situation. They didn't do that before. Sorry! They are intended to make these kind of PRs easier and more predictable, by clarifying the licensing requirements.

Updated guidlines for porting: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md#porting-files-from-other-projects

jamesqo · Jul 8, 2016

@richlander Thanks for letting me know about the new guidelines. Since @stephentoub mentioned this earlier I've changed the implementation from one based on glibc to one based on a trick here, which says at the top all code is in the public domain (unless otherwise noted, which it isn't for that particular algorithm). So I don't think this PR should have any issues with licensing anymore, since it's no longer based on glibc.

richlander · Jul 8, 2016

@jamesqo yup. this looks good.

@stephentoub and I chatted about this PR. That's (obviously) what motivated me to update the guidelines and also what led to Stephen's feedback. It's all good.

Thanks for your patience and your collaboration/contribution.

jkotas · Jul 20, 2016

LGTM. Thanks again!

jamesqo · Jul 20, 2016

@jkotas Thank you for merging this! I'm working on porting this (and the other PR) to CoreRT right now.

dnfclas added the cla-already-signed label Jul 5, 2016

jamesqo referenced this pull request in dotnet/corefx Jul 5, 2016
Merged
Add some more test coverage for the new string(char*) impl #9856

jamesqo closed this Jul 5, 2016

jamesqo deleted the jamesqo:ctorcharptr-performance branch Jul 5, 2016

jamesqo restored the jamesqo:ctorcharptr-performance branch Jul 6, 2016

jamesqo reopened this Jul 6, 2016

dnfclas added the cla-already-signed label Jul 6, 2016

jamesqo referenced this pull request Jul 13, 2016
Merged
Better performance for string.CompareOrdinalHelper #6119

jamesqo deleted the jamesqo:ctorcharptr-performance branch Jul 20, 2016

This was referenced Jul 20, 2016

Merged

Port wcslen optimizations from CoreCLR dotnet/corert#1552

Closed

Make 3rd party notice addition for #6125 #6390

martinwoodward added a commit to martinwoodward/coreclr that referenced this pull request Jul 23, 2016

martinwoodward Fix #6125
Add reference for Bit Twiddling Hacks
05fa33b

martinwoodward referenced this pull request Jul 23, 2016
Merged
Add 3PN reference for Bit Twiddling Hacks #6432

jkotas added a commit that referenced this pull request Jul 23, 2016

martinwoodward + jkotas Fix #6125 (#6432)
Add reference for Bit Twiddling Hacks
59f169e

This was referenced Jul 23, 2016

Closed

Improve performance of string.IndexOf for large strings on 64-bit #6434

Merged

Add more tests to string.IndexOf dotnet/corefx#10266

jamesqo	`Faster performance for string(char*) ctor, based on glibc strlen`		`e9ea9b1`
jamesqo	`Semantics fixes`		`c1a6e52`
jamesqo	`More fixes, eliminate end loop`		`d4a120b`
jamesqo	`Move comment on x86 loop`		`b4f11a7`
jamesqo	`Add perf TODO to the refactored method`		`80dde0d`
jamesqo	`Add TODO to the x86 implementation`		`90963a0`
jamesqo	`Remove additional branch`		`2f90ec1`
jamesqo	`Fix the loop if we misfired`	All checks have passed 10 successful checks CentOS7.1 x64 Debug Build and Test — Build finished. Details FreeBSD x64 Checked Build — Build finished. Details Linux ARM Emulator Cross Debug Build — Build finished. Details Linux ARM Emulator Cross Release Build — Build finished. Details OSX x64 Checked Build and Test — Build finished. Details Ubuntu x64 Checked Build and Test — Build finished. Details Windows_NT x64 Debug Build and Test — Build finished. Details Windows_NT x64 Release Priority 1 Build and Test — Build finished. Details Windows_NT x86 legacy_backend Checked Build and Test — Build finished. Details Windows_NT x86 ryujit Checked Build and Test — Build finished. Details	`1d36cc5`

@@ -1695,29 +1695,104 @@ private String CtorCharCount(char c, int count)
              private static unsafe int wcslen(char *ptr)
              {
                  char *end = ptr;
+    +
+    +            // TODO: Test to see if switching to something like
+    +            // the 64-bit version would be beneficial on x86 as well
+    +
+    +            // First make sure our pointer is aligned on a word boundary
+    +            int alignment = IntPtr.Size - 1;
+    +
+    +            // If ptr is at an odd address, this loop will simply iterate all the way

dotnet/coreclr

Improve performance, drastically reduce misfires for new string(char*) on 64-bit #6125

Reviewers

Assignees

Labels

Projects

Milestone

5 participants

Add some more test coverage for the new string(char*) impl #9856

Better performance for string.CompareOrdinalHelper #6119

9 of 10 checks passed

Port wcslen optimizations from CoreCLR dotnet/corert#1552

Make 3rd party notice addition for #6125 #6390

Add 3PN reference for Bit Twiddling Hacks #6432

Improve performance of string.IndexOf for large strings on 64-bit #6434

Add more tests to string.IndexOf dotnet/corefx#10266

dotnet/coreclr

Improve performance, drastically reduce misfires for new string(char*) on 64-bit #6125

Reviewers

Assignees

Labels

Projects

Milestone

5 participants

All checks have passed

Add some more test coverage for the new string(char*) impl #9856

All checks have passed

All checks have passed

All checks have passed

Some checks were not successful

Better performance for string.CompareOrdinalHelper #6119

9 of 10 checks passed

Port wcslen optimizations from CoreCLR dotnet/corert#1552

Make 3rd party notice addition for #6125 #6390

Add 3PN reference for Bit Twiddling Hacks #6432

Improve performance of string.IndexOf for large strings on 64-bit #6434

Add more tests to string.IndexOf dotnet/corefx#10266