Issue25301
Created on 2015-10-02 14:44 by haypo, last changed 2015-10-05 11:49 by python-dev. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| utf8_decoder.patch | haypo, 2015-10-03 00:01 | |||
| bench.py | haypo, 2015-10-04 08:21 | |||
| Messages (6) | |||
|---|---|---|---|
| msg252117 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2015-10-02 14:44 | |
The issue #24870 optimized the ASCII decoder with error handlers: New changeset 3c430259873e by Victor Stinner in branch 'default': Issue #24870: Optimize the ASCII decoder for error handlers: surrogateescape, https://hg.python.org/cpython/rev/3c430259873e We should also optimize the UTF-8 decoder with error handlers. I will work on a patch next days. |
|||
| msg252181 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2015-10-03 00:01 | |
Here is a first patch. It is written to keep best performances for valid UTF-8 encoded string, but speedup strings with a few undecodable bytes. |
|||
| msg252264 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2015-10-04 08:30 | |
Results of the microbenchmark on the UTF-8 decoder. As expected, performances on valid UTF-8 is unchanged, which was an important goal for me. Decoding with error handlers optimized by the patch are *much* faster. backslashreplace is still slow, because I didn't optimize it. Common platform: Python unicode implementation: PEP 393 Timer: time.perf_counter Platform: Linux-4.1.5-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Bits: int=32, long=64, long long=64, size_t=64, void*=64 CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer precision: 55 ns Platform of campaign before: SCM: hg revision=f51921883f50 tag=tip branch=default date="2015-10-04 01:19 -0400" Python version: 3.6.0a0 (default:f51921883f50, Oct 4 2015, 10:19:37) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Date: 2015-10-04 10:19:44 Platform of campaign after: SCM: hg revision=f51921883f50+ tag=tip branch=default date="2015-10-04 01:19 -0400" Python version: 3.6.0a0 (default:f51921883f50+, Oct 4 2015, 10:14:05) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Date: 2015-10-04 10:18:55 ---------------------+-------------+-------- valid UTF-8 (strict) | before | after ---------------------+-------------+-------- 100 x 10**1 bytes | 297 ns (*) | 297 ns 100 x 10**3 bytes | 7.4 us (*) | 7.44 us 100 x 10**2 bytes | 929 ns (*) | 924 ns 100 x 10**4 bytes | 80.4 us (*) | 80.4 us ---------------------+-------------+-------- Total | 89.1 us (*) | 89 us ---------------------+-------------+-------- ------------------+-------------+--------------- ignore | before | after ------------------+-------------+--------------- 100 x 10**1 bytes | 6.68 us (*) | 743 ns (-89%) 100 x 10**3 bytes | 561 us (*) | 42.6 us (-92%) 100 x 10**2 bytes | 56.8 us (*) | 4.55 us (-92%) 100 x 10**4 bytes | 6.02 ms (*) | 425 us (-93%) ------------------+-------------+--------------- Total | 6.65 ms (*) | 473 us (-93%) ------------------+-------------+--------------- ------------------+-------------+--------------- replace | before | after ------------------+-------------+--------------- 100 x 10**1 bytes | 7.61 us (*) | 890 ns (-88%) 100 x 10**3 bytes | 639 us (*) | 50.3 us (-92%) 100 x 10**2 bytes | 64.8 us (*) | 5.37 us (-92%) 100 x 10**4 bytes | 7.09 ms (*) | 505 us (-93%) ------------------+-------------+--------------- Total | 7.81 ms (*) | 561 us (-93%) ------------------+-------------+--------------- ------------------+-------------+--------------- surrogateescape | before | after ------------------+-------------+--------------- 100 x 10**1 bytes | 7.96 us (*) | 855 ns (-89%) 100 x 10**3 bytes | 674 us (*) | 50.2 us (-93%) 100 x 10**2 bytes | 68.8 us (*) | 5.35 us (-92%) 100 x 10**4 bytes | 7.38 ms (*) | 504 us (-93%) ------------------+-------------+--------------- Total | 8.13 ms (*) | 560 us (-93%) ------------------+-------------+--------------- ------------------+-------------+-------- backslashreplace | before | after ------------------+-------------+-------- 100 x 10**1 bytes | 7.66 us (*) | 7.89 us 100 x 10**3 bytes | 633 us (*) | 633 us 100 x 10**2 bytes | 64.1 us (*) | 64.6 us 100 x 10**4 bytes | 6.9 ms (*) | 6.93 ms ------------------+-------------+-------- Total | 7.61 ms (*) | 7.64 ms ------------------+-------------+-------- ---------------------+-------------+--------------- Summary | before | after ---------------------+-------------+--------------- valid UTF-8 (strict) | 89.1 us (*) | 89 us ignore | 6.65 ms (*) | 473 us (-93%) replace | 7.81 ms (*) | 561 us (-93%) surrogateescape | 8.13 ms (*) | 560 us (-93%) backslashreplace | 7.61 ms (*) | 7.64 ms ---------------------+-------------+--------------- Total | 30.3 ms (*) | 9.32 ms (-69%) ---------------------+-------------+--------------- |
|||
| msg252319 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-10-05 11:44 | |
New changeset 3152e4038d97 by Victor Stinner in branch 'default': Issue #25301: The UTF-8 decoder is now up to 15 times as fast for error https://hg.python.org/cpython/rev/3152e4038d97 |
|||
| msg252320 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2015-10-05 11:44 | |
I pushed my optimization. I close the issue. |
|||
| msg252321 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-10-05 11:49 | |
New changeset 5b9ffea7e7c3 by Victor Stinner in branch 'default': Issue #25301: Fix compatibility with ISO C90 https://hg.python.org/cpython/rev/5b9ffea7e7c3 |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2015-10-05 11:49:36 | python-dev | set | messages: + msg252321 |
| 2015-10-05 11:44:37 | haypo | set | status: open -> closed resolution: fixed messages: + msg252320 |
| 2015-10-05 11:44:03 | python-dev | set | nosy:
+ python-dev messages: + msg252319 |
| 2015-10-04 08:30:32 | haypo | set | messages: + msg252264 |
| 2015-10-04 08:21:20 | haypo | set | files: + bench.py |
| 2015-10-03 00:01:15 | haypo | set | files:
+ utf8_decoder.patch keywords: + patch messages: + msg252181 |
| 2015-10-02 14:44:42 | haypo | create | |
