Make GitLab.com fast again
- GitHub.com issue https get => 1,198 ms
- GitLab.com issue https get => 8,444 ms earlier 3,625 ms now
- dev.gitlab.org issue https get => 872 ms
a 7x difference between GitHub.com and GitLab.com!
what causes that? New Relic looks OK. Where is this slowness coming from? The Rails app or something else?
Yorick and @jnijhof are investigating this
-
New Relic data can't be trusted as it only runs on a single server, our handy InfluxDB setup is much more reliable on the other hand. Transaction timings from the past 24 hours:
The response time of the last 24 hours is
6.7seconds based on the following query:select percentile("duration", 99) / 1000 from rails_transactions where time > now() - 24h;These timings are Rails application timings only, they don't include any time spent in Workhorse, Nginx, etc.
-
@yorickpeterse Although 6.7 second response times are shameful I'm relieved to see we got accurate data now instead of the New Relic stuff.
Any idea why the Rails times are so slow? Is it server load, database slowness, markdown parsing, something else?
-
@yorickpeterse Why do we only have one server reporting to NewRelic? AFAIK you can point many servers at one application in NewRelic and won't incur additional costs. It's all priced via application.
-
@sytses I'm not sure just yet. I've seen a few bits of code being called in different controllers while being slow in all cases, so I'll look at that in the coming days. Basically my next goal will be finding/resolving bits of badly performing code that are shared between controllers. Fixing those cases should hopefully lead to a lower overall response time.
@dblessing I'm not sure what the exact reasoning is/was, but I'm pretty sure you pay per host/server, unless this changed in the past.
-
@sytses One example of such a case is discussed in gitlab-org/gitlab-ce#10771. I'll be taking a look at this particular case tomorrow.
-
@dblessing New Relic is prices per server, $150 * 20 servers is too much. And I think it is likely that having it on all servers would not improve the data, it should already be in the ballpark when running on a single server since traffic is distributed evenly by HAproxy.
-
@yorickpeterse Awesome work, please prioritize finding bits of badly performing code over fixing them since multiple people can help with the fixing.
-
gitlab-org/gitlab-ce#10771 was resolved by MR gitlab-org/gitlab-ce!2498, which will hopefully shave off around 500 ms on most project pages.
In gitlab-org/gitlab-ce!2421 the topic of increasing the Unicorn memory limit was brought up. We discovered the limits shipped with CE and Omnibus differ and are currently testing (by just letting it run for an hour or so while collecting data) a slightly more reasonable limit. Ideally this would cut down the amount of worker restarts which in turn could reduce request timings (due to requests not having to wait for workers to become available). Should this help we'll try to also get this into 8.4.
-
Improved some code in gitlab_git (gitlab-org/gitlab_git!62) which should make it (and
EditedProject#empty_repo?in particular) about 4.5x faster (MR for CE: gitlab-org/gitlab-ce!2535). This in turn should (at least for now) solve gitlab-org/gitlab-ce#12418. -
@yorickpeterse great, keep going!
I've added dev.gitlab.org to the description which has normal latency, so the problems are due to the size of GitLab.com in all it's aspects (more users, more issues, larger db, spread amoung multiple machines, higher load, etc.).
Edited -
Interesting GitLab Git slowness issue gitlab-org/gitlab-ce#12443
Edited -
Milestone changed to Scheduled
-
@yorickpeterse Make sure to assign issues to @joshfng for resolution.
-
After having looked into a few issues (e.g. gitlab-org/gitlab-ce#12447 and gitlab-org/gitlab-ce#12444) I'm starting to suspect some of these bits of code may be slow due to either the Azure network or our PostgreSQL server. Starting tomorrow I'll see if I can think of anything that lets us confirm or rule this out.
-
I'll be digging into more low level operations and setting up metrics there. I suspect performance is affected by either the Azure network in general, or our NFS cluster. As a start to clear things up I'll be instrumenting
Rugged(besidesGitlab::Gitwhich we already instrument) just in case we use this somewhere directly. I'll also make sure any other Git related code is instrumented. -
@yorickpeterse awesome!
-
Extra Rugged/Gitlab::Git instrumentation will be added in gitlab-org/gitlab-ce!2664.
-
In an effort to make local testing of Git operations easier I'll be looking into a way to basically slow down Git/IO operations. Because I use an SSD with fairly little operations going on GitLab locally performs way better than GitLab.com making it much harder to determine the impact of certain operations. Basically I just want to add something like
sleep(100 ms)before an operation to mimic a slow NFS.In the mean time I'll continue digging through the exiting metrics/tooling/etc to see what I can find on GitLab.com.
Edited -
To set up a slow file system I used the following Fish script (after a day of googling/trying things out, etc):
#!/usr/bin/env fish sudo tc qdisc del dev lo root sudo tc qdisc add dev lo root netem delay 2ms sshfs -o cache=no -f \ yorickpeterse@localhost:/home/yorickpeterse/Projects/gitlab/repositories \ /mnt/repositories sudo tc qdisc del dev lo rootThis mounts my repositories directory under
/mnt/repositoriesand slows down traffic by 2ms. -
mentioned in issue gitlab-org/gitlab-ce#12794
-
Currently most project specific pages appear to be suffering from the following two problems:
- Various slow Git operations that may not always be needed (e.g. checking if a repository is empty). An example of this is a call to
Project#valid_repo?which @joshfng is looking into gitlab-org/gitlab-ce#12442. Another example wasProject#empty_repo(gitlab-org/gitlab-ce#12418) which I improved in MR gitlab-org/gitlab_git!62 and MR gitlab-org/gitlab-ce!2535. - Certain database queries that run on many pages while not always needed. For example,
BroadcastMessage.currentwas called on a lot (if not all) pages and doing DB queries when it could easily be cached. @joshfng fixed this in gitlab-org/gitlab-ce!2633.
Some other examples can be found using the script https://gitlab.com/gitlab-org/inspector-influx/blob/master/lib/common_methods.rb:
Method Actions Average Minimum Maximum Gitlab::Git::Repository#ref_names 25 26120.35 151.03 31443.33 Gitlab::Git::Repository#tag_names 24 24994.36 180.32 30812.19 Gitlab::Git::Commit.find 74 14507.25 102.44 14743.25 Gitlab::Git::Repository#autocrlf= 110 9806.67 137.11 10971.25 Gitlab::Git::Repository#rugged 110 9805.97 136.99 10971.04 Gitlab::Git::Repository#rev_parse_target 68 9171.58 102.32 14365.43 Gitlab::Git::Blob.find 36 7718.02 101.37 9403.49 Gitlab::Git::Repository#branch_count 103 4587.89 157.91 8061.42 Gitlab::Git::Repository#empty? 35 4576.5 118.32 4890.1 Gitlab::Git::Commit.decorate 45 4178.67 100.62 4827.85 Gitlab::Git::Blob.find_entry_by_path 25 3869.27 131.86 4455.82 Project#empty_repo? 104 3388.37 158.02 6364.22 Gitlab::Git::Repository#branch_names 85 3044.43 106.44 5610.8 Gitlab::Git::Repository#branches 86 3013.46 106.37 5610.69 BroadcastMessage.current 52 2879.07 105.47 4119.16 Gitlab::Git::Repository#root_ref 76 2218.09 107.79 2559.01 Gitlab::Git::Repository#discover_default_branch 76 2218.05 107.76 2558.96 Project#default_branch 28 2013.66 461.43 2047.47 User#can_create_project? 65 1827.47 103.62 2034.53 User#projects_limit_left 65 1827.36 103.49 2034.39 User.find_by_authentication_token 26 1425.22 103.87 1729.6 Project#external_issue_tracker 53 1364.35 104.5 1907.96 License.block_changes? 44 1261.76 100.58 1927.74 License.current 52 1180.08 100.36 1927.11 Project#get_issue 31 1146.23 110.02 1152.63 Project#owner 70 973.12 103.36 1642.53 Group#owners 88 827.69 102.07 1921.78 User#authorized_projects 29 537.54 113.93 558.11 Project.find_with_namespace 68 439.69 104.71 461.44Here we can see that a lot of these Git calls are shared by different Rails actions (= controller + method). For this code there's an issue at gitlab-org/gitlab-ce#12443 and a discussion to cache certain operations at gitlab-org/gitlab-ce#12811.
- Various slow Git operations that may not always be needed (e.g. checking if a repository is empty). An example of this is a call to
-
Personal TODO for the coming days:
-
Measure simple ping timings from a worker to the NFS and databaseICMP seems to be disabled on either the workers or the NFS as a ping from worker11 doesn't yield a response. - Measure read/write (e.g. just a dumb "Hello world") timings
-
Measure simple
Rugged(as in, without involvingGitlab::Git) operations from a worker (e.g. grabbing the latest commit of the master branch)
The idea of the above is to get a better understanding of the baseline latency that's present in GitLab.com. For example, if reading from a text file already takes 50 ms more complex Git operations will only end up taking longer.
Edited -
-
mentioned in issue gitlab-org/gitlab-ce#12442
-
To measure simple reads/writes I SSH'd into worker5 and ran the following scripts:
bench_reads.rb
require 'benchmark' path = '/var/opt/gitlab/git-data/repositories/yorickpeterse/cat-pictures.git/HEAD' timings = [] start = Time.now while Time.now - start <= 10 timings << Benchmark.measure { File.read(path) }.real * 1000 sleep 0.1 end timings = timings.sort avg = timings.inject(:+) / timings.length min = timings[0] max = timings[-1] puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"This outputs:
Minimum: 0.0477 ms | Maximum: 5.569501 ms | Average: 0.15574003 msbench_writes.rb
require 'benchmark' path = '/var/opt/gitlab/git-data/repositories/yorickpeterse/test.txt' timings = [] start = Time.now File.open(path, 'w') do |handle| while Time.now - start <= 10 timings << Benchmark.measure { handle.write('hello') }.real * 1000 sleep 0.1 end end File.unlink(path) timings = timings.sort avg = timings.inject(:+) / timings.length min = timings[0] max = timings[-1] puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"This outputs:
Minimum: 0.0069 ms | Maximum: 0.0284 ms | Average: 0.009199 msTimings
- Average read time: 0.15574003 ms
- Average write time: 0.009199 ms
Based on these very basic scripts it appears very primitive operations don't perform that badly.
Edited -
To benchmark the time it takes to initialize a
Rugged::Repository(which does some IO calls) I used the following script:require 'benchmark' require 'rugged' path = '/var/opt/gitlab/git-data/repositories/gitlab-org/gitlab-ce.git' timings = [] start = Time.now while Time.now - start <= 10 timings << Benchmark.measure { Rugged::Repository.new(path) }.real * 1000 sleep 0.1 end timings = timings.sort avg = timings.inject(:+) / timings.length min = timings[0] max = timings[-1] puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"Using worker5 this spits out:
EditedMinimum: 0.7886 ms | Maximum: 13.667702 ms | Average: 2.18976976 ms -
Next I decided to put some extra pressure on the NFS while running a benchmark. For this I used the following script:
require 'benchmark' require 'rugged' path = '/var/opt/gitlab/git-data/repositories/gitlab-org/gitlab-ce.git' timings = [] start = Time.now stress = true threads = 5.times.map do Thread.new do while stress Rugged::Repository.new(path) end end end while Time.now - start <= 10 timings << Benchmark.measure { Rugged::Repository.new(path) }.real * 1000 end stress = false threads.each(&:join) timings = timings.sort avg = timings.inject(:+) / timings.length min = timings[0] max = timings[-1] puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"This basically starts 5 threads that constantly initialize a repository while the main thread does the same and measures the timings. The idea is to see if/how high load influences timings. This script spits out the following:
Minimum: 0.4011 ms | Maximum: 1703.291789 ms | Average: 8.02635465 ms -
Next I adjusted the script to dump the timings to a text file so I could plot it. The raw timings can be found at timings.txt.
This was plotted using the following gnuplot script:
set title "Timings" set xlabel "Iteration" set ylabel "Time (ms)" set rmargin 10 set term svg size 1400,500 set grid set style line 1 lc rgb "#0060ad" lt 1 lw 1 set style data lines plot "/tmp/timings.txt" using 1 ls 1 title "Timings"The graph looks as following:
What's interesting here is the recurring spike of ~500 ms. Based on this it looks like the NFS works fine when there's little load but starts to misbehave when load increases.
-
@yorickpeterse great information! If we have any insight into our peak traffic times, we should perform these tests during that window for the most accurate representation of NFS latency. Also, we could see varying latency depending on the hardware provided by azure and what other types of load are being applied to the host machine.
-
mentioned in issue gitlab-org/gitlab-ce#12619
-
https://gitlab.slack.com/archives/operations/p1454513429000802
SQL stats: Blocks Read: 29.76/s, Fetches: 833614.48/s, Commits: 4169.81/s, Deletes: 0.40/s, Updates: 19.87/s, Inserts: 3.54/s
Fetches => Number of table rows fetched by bitmap scans when argument is a table, or table rows fetched by simple index scans using the index when argument is an index
Almost 1m fetches a second! We should really bring that down.
https://gitlab.slack.com/archives/operations/p1454598062000853 if you you look to some pages in GitLab it will take up to 100 query's on every load so multiple this with a couple of 1000 users online it actually does make sense
Edited -
@yorickpeterse will add a counter that counts how sql queries are done per url
-
mentioned in issue #68
-
@jnijhof will add the script from #42 (comment 3522792) to checkmk in #68
-
@yorickpeterse The goal is to get a time budget for the allmost 10 seconds that a get of gitlab-org/gitlab-ce#1 takes.
-
gitlab-org/gitlab-ce#1 TTFB is 7.76 seconds and what got my attention is the amount of
mentionedin the issue which is causing this. -
mentioned in issue gitlab-org/gitlab-ce#13241
-
@sytses some things that might be worth noting, based on some poking around I've done:
-
autocomplete_source.jsonis loaded for every single Issue and Merge Request. For this issue, it took more than a second to load and includes every single issue, user, and merge request in a given project. This can't be good for the site's speed, and is an extra (in this case) 175KB that shouldn't be loaded unless explicitly needed. For the gitlab-ce repository, it's more than 350KB! Unless I'm missing something that necessitates this, it should not be being served on every single page load. - It looks like a file called
notesis loaded every ~15 seconds which checks for updates to content on the page. This may not cause a huge performance burden, but it's something simple that may be beneficial to change to ~45 second intervals, if it's even necessary at all. I currently have 5 GitLab tabs open, meaning that I'm polling thenotesendpoint 20 times per minute. These things add up. - Emoji are being served in the aforementioned
autocomplete_source.jsonfile as well as as an 813KB PNG sprite file. Which are we using, and why are we loading these twice? - The award emoji picker is loaded inline rather than dynamically with JavaScript. I would think something like that might be used in - at max - 3% of Issue page loads. Probably less. Could see slightly faster load times if that were taken out.
- Non-Gravatar avatar images (such as my own) are loaded at their full resolution, usually between 50KB and 150KB per avatar with a resolution around 800px - if not more. The Gravatar images are always less than 10KB, served over HTTP2, and are only 48px. Some huge potential gains to be had there.
Have you looked into using a CDN (I'd strongly recommend CloudFlare, I use it for my Rails apps)? Could be a huge performance improvement and take significant pressure off your servers. I realize it's not open source, but seeing as the CI Builds run on Digital Ocean and Azure, that's probably not a huge problem.
Sorry this was a little long-winded, or if the tone sounds negative in any way. It's not intentional, it's just late at night here and my brain isn't working at full capacity :P
-
-
Testing issue #1 locally using Sherlock reveals the following:
- Out of the ~21 seconds it takes to load this issue (using my slower file system) only 0.45 seconds is spent running SQL queries.
- Looking at the line profiler output most time appears to be spent in
app/models/repository.rband the Markdown rendering pipeline. Details below.
app/models/repository.rb
- 84% of the time (roughly 15 seconds) is spent in
Repository#empty?on line 47 - 9% of the time (roughly 1.6 seconds) is spent in
Repository#has_visible_content?on line 60 - 6% of the time (roughly 1 second) is spent in
Repository#root_refon line 483 - The remaining time is spread across a few different lines with most timings taking only a few dozen milliseconds
lib/banzai/filter/relative_link_filter.rb
This file makes the calls to
Repository#empty?as well asRepository#exists?inBanzai::Filter::RelativeLinkFilter#linkable_files?on line 32. All the time spent in this file is spent on this line.Conclusion
So loading issues seems to be slow because it's doing a bunch of Git operations that don't have to be performed for every request. Rugged internally uses
git_repository_is_emptyto check if a repository is empty (https://libgit2.github.com/libgit2/#HEAD/group/repository/git_repository_is_empty), which as far as I can tell is only true before the first commit is ever pushed. For example, if you follow the following steps:- Initialize a repo
- Add a file and commit it
- Remove said file, commit the changes
- Run
Rugged::Repository.new('path/to/repo').empty?
Then step 4 will return
false. As a result of this I think we can just cache the output ofRepository#empty?forever. Example shell session:foo $ git init Initialized empty Git repository in /tmp/foo/.git/ foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?' true foo $ echo 'hello' > test.txt foo $ git add test.txt foo $ git commit -m 'Initial commit' [master (root-commit) 134f396] Initial commit 1 file changed, 1 insertion(+) create mode 100644 test.txt foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?' false foo $ git rm test.txt rm 'test.txt' foo $ git commit -m 'Removed test.txt' [master abeaeb0] Removed test.txt 1 file changed, 1 deletion(-) delete mode 100644 test.txt foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?' falseRepository#has_visible_content?
The output of
Repository#has_visible_content?can also be cached as it simply checks if the number of branches is greater than 0. This cache could be flushed whenever the last branch of a repository is removed (we might need an extra hook of sorts for this).cc @joshfng
Edited -
Applying the following patch reduces the loading time of #1 locally to around 4 seconds (instead of 21):
Editeddiff --git a/app/models/repository.rb b/app/models/repository.rb index e813c94..4f7f8d6 100644 --- a/app/models/repository.rb +++ b/app/models/repository.rb @@ -44,7 +44,7 @@ class Repository end def empty? - raw_repository.empty? + cache.fetch(:empty) { raw_repository.empty? } end # @@ -57,7 +57,9 @@ class Repository # This method return true if repository contains some content visible in project page. # def has_visible_content? - raw_repository.branch_count > 0 + cache.fetch(:has_visible_content) do + raw_repository.branch_count > 0 + end end def commit(id = 'HEAD') -
Repository#root_refcan probably also be cached as it simply returns the name of the default branch. The cache for this can be flushed whenever the user changes the default branch. Caching this further reduces the loading time to around 2.8 seconds. The diff would be:
Editeddiff --git a/app/models/repository.rb b/app/models/repository.rb index e813c94..086e989 100644 --- a/app/models/repository.rb +++ b/app/models/repository.rb @@ -44,7 +44,7 @@ class Repository end def empty? - raw_repository.empty? + cache.fetch(:empty) { raw_repository.empty? } end # @@ -57,7 +57,9 @@ class Repository # This method return true if repository contains some content visible in project page. # def has_visible_content? - raw_repository.branch_count > 0 + cache.fetch(:has_visible_content) do + raw_repository.branch_count > 0 + end end def commit(id = 'HEAD') @@ -480,7 +482,7 @@ class Repository end def root_ref - @root_ref ||= raw_repository.root_ref + @root_ref ||= cache.fetch(:root_ref) { raw_repository.root_ref } end def commit_dir(user, path, message, branch) -
mentioned in merge request gitlab-org/gitlab-ce!2752
-
WIP merge request to resolve the above: gitlab-org/gitlab-ce!2752
-
mentioned in issue gitlab-com/support-forum#382
-
-
Created another merge request that makes our cache flushing less aggressive: gitlab-org/gitlab-ce!2769. tl;dr previously branch statistics were flushed upon every push (for all branches), now we only flush for branches where it's actually needed. A push to the default branch still results in all branch caches being flushed.
-
mentioned in issue gitlab-org/gitlab-ce#13299
-
mentioned in issue gitlab-org/gitlab-ce#13300
-
mentioned in issue gitlab-org/gitlab-ce#13301
-
mentioned in issue gitlab-org/gitlab-ce#13302
-
mentioned in issue #75
-
@connorshea thank you for your suggestions in #42 (comment 3588320) which I split up in:
- autocomplete gitlab-org/gitlab-ce#13299
- note check time gitlab-org/gitlab-ce#13300
- dynamic award emoji picker gitlab-org/gitlab-ce#13301
- load smaller versions of non-gravatar avatars gitlab-org/gitlab-ce#13302
- use a CDN for GItLab.com #75
-
@connorshea Thanks for the comment/info! Sorry I completely overlooked it in my whirlwind of replies above.
Edited -
mentioned in issue gitlab-org/gitlab-development-kit#94
-
@sytses thank you for splitting them up into issues :)
@yorickpeterse no problem! Your empty repository improvement is awesome.
-
@jnijhof What are the values on the Y axes in these graphs, the time it takes to initialize a Git repo?
-
@jnijhof So in this graph:
Does this mean it takes 800 milliseconds to initialize a repository? If not, maybe it's best to run through these graphs and their meanings during the devops meeting, saves pinging 17 participants in this issue :)
-
To test the impact of 8.5 RC1 so far I ran the following query:
select percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-01-11 14:25:00' and time < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';This spits out the 95th percentile of gitlab-org/gitlab-ce#1 for the past 24 hours leading up to today's deploy. The output is:
name: rails_transactions ------------------------ time duration_95th 1452522300000000001 11.419827557000001In other words, for the 24 hours leading up to the deploy the 95th percentile timing was roughly 11.5 seconds.
Next I ran the following query:
percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';This calculates the 95th percentile since today's 8.5 deploy, the output is as following:
name: rails_transactions ------------------------ time duration_95th 1455200700000000001 3.480613037In other words, we went from a 95th percentile of ~11.5 seconds down to ~3.5 seconds, an improvement of roughly 3.3 times.
-
And for the sake of it, here's the timings when we calculate the mean instead of the 95th percentile:
Edited> select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and timestamp > '2016-01-11 14:25:00' and timestamp < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1'; name: rails_transactions ------------------------ time mean 0 8.047509426767022 > select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1'; name: rails_transactions ------------------------ time mean 1455200700000000001 2.933122648561403 -
@yorickpeterse what if we run that query for all controller actions?
-
@yorickpeterse forget about my question, I can see that other controllers also greatly improved http://yorick-influxdb.gitlap.com/dashboard/db/projectscontroller from 2 to 1 second.
-
mentioned in commit gitlab-com/www-gitlab-com@19420897
-
The speed of GitLab.com is not where we want to be yet but the worst is over with the 3x speed increase. I've removed the warning on our website gitlab-com/www-gitlab-com@19420897 as we keep working to increase the speed further.
-
Yorick is working on increasing the speed of the git commands in gitlab-org/gitlab-ce#12443
-
Little factoid about GitLab.com: in the ~15 hours before the 8.5.0 RC1 deploy (11th at ~14:00 UTC) our timings were as following:
95th percentile: 2.026 sec 99th percentile: 6.169 sec mean: 658.6 ms
Since the deploy (until now) the timings are as following:
95th percentile: 1.448 sec (= 1.4x faster) 99th percentile: 3.779 sec (= 1.6x faster) mean: 462.8 ms (= 1.4x faster)
-
Per this comment in a thread on switching to GitLab, pulling the same repo from GitHub vs. GitLab, GitLab takes significantly longer:
GitHub:
❯❯❯ time git pull origin master From github.com:CWSpear/local-persist * branch master -> FETCH_HEAD Already up-to-date. git pull origin master 0.02s user 0.04s system 4% cpu 1.396 totalGitLab:
❯❯❯ time git pull gitlab master From gitlab.com:CWSpear/local-persist * branch master -> FETCH_HEAD Already up-to-date. git pull gitlab master 0.02s user 0.02s system 0% cpu 6.837 total -
# GitLab Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://gitlab.com/connorshea/connorshea.github.io.git Cloning into 'connorshea.github.io'... remote: Counting objects: 708, done. remote: Compressing objects: 100% (300/300), done. remote: Total 708 (delta 370), reused 708 (delta 370) Receiving objects: 100% (708/708), 21.46 MiB | 10.70 MiB/s, done. Resolving deltas: 100% (370/370), done. Checking connectivity... done. real 0m3.216s user 0m0.631s sys 0m0.366s # GitHub Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://github.com/connorshea/connorshea.github.io.git connorshea.github.io2 Cloning into 'connorshea.github.io2'... remote: Counting objects: 708, done. remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708 Receiving objects: 100% (708/708), 21.46 MiB | 9.87 MiB/s, done. Resolving deltas: 100% (370/370), done. Checking connectivity... done. real 0m2.881s user 0m0.680s sys 0m0.429s # GitHub Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://github.com/connorshea/connorshea.github.io.git connorshea.github.io3 Cloning into 'connorshea.github.io3'... remote: Counting objects: 708, done. remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708 Receiving objects: 100% (708/708), 21.46 MiB | 8.58 MiB/s, done. Resolving deltas: 100% (370/370), done. Checking connectivity... done. real 0m3.095s user 0m0.693s sys 0m0.466s # GitLab Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://gitlab.com/connorshea/connorshea.github.io.git connorshea.github.io4 Cloning into 'connorshea.github.io4'... remote: Counting objects: 708, done. remote: Compressing objects: 100% (300/300), done. remote: Total 708 (delta 370), reused 708 (delta 370) Receiving objects: 100% (708/708), 21.46 MiB | 10.90 MiB/s, done. Resolving deltas: 100% (370/370), done. Checking connectivity... done. real 0m3.337s user 0m0.626s sys 0m0.369sInterestingly, cloning a new repo from scratch (in this case it isn't a very big repo), GitHub leads by only about half a second. Tested from Colorado.
The most notable difference between the two that I can find is that GitHub has
Editedremote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708whereas GitLab hasremote: Total 708 (delta 370), reused 708 (delta 370). Not sure what that means exactly, but maybe it makes a difference? -
@connorshea Thanks! That first example appears to be using Git over SSH, which indeed is quite a bit slower than Git over HTTP at the moment. See #99 for some more information on this.
-
mentioned in issue gitlab-org/gitlab-ce#13889
-
Merge request gitlab-org/gitlab-ce!2987 will make global search around 10x faster (at least). For example, the time for the search query "yorick" went from > 30 seconds to 2-3 seconds. Searching for "git" now takes around 5-6 seconds instead of timing out (meaning it used to take over 2 minutes at least).
Project specific search queries have also been improved. For example, searching for "git" in the GitLab CE repository takes around 2-3 seconds.
This applies to setups using CE's search feature, these changes do not apply to ElasticSearch based searching.
Edited -
Merge request gitlab-org/gitlab-ce!3123 (and gitlab-org/gitlab_git!69) will hopefully make displaying commit logs faster. What the final impact is of these merge requests on GitLab.com has yet to be tested though.
-
Merge request gitlab-org/gitlab-ce!3128 adds some methods for count and caching the number of branches and tags. Using these methods the loading times of a project's branches page can be improved up to 1.45x, in my local setup the timings went from 6.65 seconds on average to 4.6 seconds.
-
gitlab-org/gitlab-ce!3158 will make retrieving CI services faster, which seems to have quite an impact on the performance of merge request pages (see the graph in gitlab-org/gitlab-ce#14186).
-
mentioned in issue gitlab-org/gitlab-ce#13651
-
I'm currently looking into the following two issues:
The first issue affects viewing individual issues and has a pretty significant impact on performance. The second issue affects basically everything where we render Markdown, though it will probably be quite hard to resolve as the Markdown rendering pipeline is quite complex.
Edited -
mentioned in issue gitlab-org/gitlab-ce#14685
-
People are rightfully complaining that loading a list of issues shouldn't take 3 seconds https://news.ycombinator.com/item?id=11375663
@yorickpeterse do you think http://stats.pingdom.com/81vpf8jyr1h9/1902794 is taking 4 seconds because of markdown rendering?
-
BTW people are noticing that GitLab is becoming faster in https://news.ycombinator.com/item?id=11375912 and https://news.ycombinator.com/item?id=11375748
-
@sytses Yes, rendering Markdown takes up quite a lot of time especially as the number of notes grows (it seems to grow exponentially even).
-
I'm working on making Markdown rendering faster, the WIP merge request is gitlab-org/gitlab-ce!3389.
-
Wanted to note that we don't have GZIP enabled on GitLab.com, from gitlab-org/gitlab-ce#14344:
Per Google PageSpeed, we could reduce this page's size 1.4MB by enabling gzip. That's a huge difference.
PageSpeed gives that issue page a 37/100 on Desktop and a 31/100 on Mobile. We shouldn't base everything off the PageSpeed stats, but we should aim to at least have a 70/100 on both. I don't claim to be a security researcher, but from what I can tell this is an unnecessary precaution, and I wanted to make sure it was discussed more. Based on gitlab-org/gitlab-ce#4088, I think this can just as easily be mitigated by removing SPDY support.
Edited -
Merge request gitlab-org/gitlab-ce!3389 was merged and is currently running on GitLab.com along with the other changes that made it into 8.7.0 RC1. I'll continue to monitor performance in the next hour or so to see what the impact is of this RC. After that I'll continue looking into the performance of
Issue#referenced_merge_requestsand associated code as this appears to currently be the main cause of loading issues being slow. -
8.7.0 RC1 sadly does not have an impact as drastic as it has locally. Timings for 24 hours leading up to the deploy (today 12:00 UTC) are as follows:
- Mean: 798.15 ms
- 95th percentile: 2.439 sec
- 99th percentile: 5.939 sec
After the deploy until now the timings are as follows:
- Mean: 769.06 ms
- 95th percentile: 2.462 sec
- 99th percentile: 5.648 sec
Note that these timings are the timings of all requests across all controllers.
In gitlab-org/gitlab-ce#14315 (comment 4596105) and gitlab-org/gitlab-ce#14315 (comment 4596248) I noted that it seems quite a bit of time in the Markdown pipeline is spent fetching data from Redis. Moving the caching to the database would help because:
- The records to render are already fetched, thus the column containing the pre-rendered HTML (if any) is also already there. This removes the need for any extra network calls.
- Data cached in Redis expires after 2 weeks (IIRC), storing this in the DB works around that. The cache expiring means that every now and then we'd have to re-build the HTML for all data involved (e.g. we'd have to rebuild the HTML of all notes in this particular issue)
Since this will take a few days (at least) to implement I'll first look into loading the "Related merge requests" section of an issue asynchronously. While this won't make that part faster it will at least cut down the loading time of the document request when loading an issue.
It seems I also broke Markdown rendering in some odd cases (gitlab-org/gitlab-ce#14936) so I'll have to fix that as well.
Git wise, when viewing individual issues there's still Git code involved that's slow as a snail. For example,
Rugged::BranchCollection#eachseems to be called somewhere and in the worst case takes around 1.6 seconds to complete. There also seems to be revision parsing going on usingRugged::Repository#rev_parse(taking up to half a second), this is probably used for rendering commit references. Other than that there are a handful of random Git related methods running but most only take 20-30 milliseconds with an occasional outlier (again probably used for rendering Git related references). -
mentioned in issue gitlab-org/gitlab-ce#14949
-
mentioned in issue #182
-
Thanks @yorickpeterse, removing
Rugged::BranchCollection#eachcalls in individual issues looks important to me. Good idea to make "Related merge requests" async.BTW I've made a longer term issue in #182
-
@yorickpeterse I'd be interested in seeing if changes in the next RC make any difference. Are those timings for serverside rendering or clientside? If clientside, I would think the 99th percentile would decrease some with the smaller page size, although I may be wrong.
-
@connorshea All timings are server side. Due to InfluxDB overwriting data I have to re-verify what the actual impact of RC1 was (as I no longer trust the data we had at that time). Our timings did quite significantly drop the moment we fixed InfluxDB (see #175) so maybe RC1 did have the desired impact.
-
@yorickpeterse have you been tracking time to DOMContentLoaded via Piwik at all?
-
@connorshea I'm not sure what exactly we're tracking in Piwik. Maybe @jnijhof knows more about that.
-
Worth mentioning: I managed to more or less confirm that on the page used for displaying single issues (
Projects::IssuesController#show) Git code is no longer the main offender, at least for 95% of the requests:Here the little boxes at the top (ignore the colors of the text, I'm still making adjustments to the thresholds) show the 95th percentiles of the various timings. There are still some random spikes (e.g. in one instance
Rugged::Repository.newtook over 7 seconds) but in 95% of the requests we only seem to spend around 85 milliseconds in eitherGitlab::GitorRuggedcode. -
It seems to me there is a negative correlation between number of requests and timings, which I find a bit weird and perhaps even suspicious: Peak number of requests occur between 17 and 18, which is also a period of low mean timing. Likewise, the spike in requests around 20:30 or so corresponds to a very low mean timing.
I would instinctively expect that the high number of requests would increase the load on the databse, and lead to (s)lower performance.
-
@elygre There's no guarantee that more requests equals higher timings as this all depends on what those requests are doing. There may also be external factors (e.g. requests for other controllers) that can impact timings.
-
@yorickpeterse Absolutely, it just seemed systematic, and hence both interesting, weird and suspicious :-)
-
mentioned in merge request gitlab-org/gitlab-ce!3804
-
mentioned in issue gitlab-com/support-forum#453
-
All action is now happening in the issues labeled with performance https://gitlab.com/gitlab-org/gitlab-ce/issues?scope=all&sort=updated_desc&state=opened&utf8=%E2%9C%93&label_name%5B%5D=performance Please consider contributing code.
-
@sytses do you think it makes sense to keep this one open now that we have all the other ones?
-
Added direction performance labels
-
Milestone changed to Milestone-1
-
mentioned in issue gitlab-org/gitlab-ce#17693
-
Small status update since it's been a while in this issue:
- gitlab-org/gitlab-ce!3969 adds support for batch retrieval of Markdown references and refactors the code used for getting participants of an Issuable. This MR as a whole greatly cuts down the number of queries executed when viewing individual issues/merge requests (or any code that happens to get participants of a Participable)
- gitlab-org/gitlab-ce!4362 cuts down the number of queries needed to render HTML in UserReferenceFilter by not running a SQL query for every username reference.
- I'm working on another set of changes that cut down the number of queries executed in IssueReferenceFilter when rendering issue references. This is still a giant hack so there's no MR available for this yet (I hope to create one tomorrow or so)
- gitlab-org/gitlab-ce!4378 and gitlab-org/gitlab-ce!4377 aim to cache some simple counters (e.g. running build counters) to further reduce the number of queries in a bunch of places
-
Merge request for cutting down SQL queries when rendering issue references (still a WIP): gitlab-org/gitlab-ce!4410
-
Mayor behaviour change with this: #276
-
@pcarranza thanks for linking that comment!
-
My pleasure, @sytses
-
@pcarranza great job!
It's starting to be visible on graphs under status.gitlab.com
I have crossed fingers for next performance enhancements
Edited
-
mentioned in issue gitlab-org/gitlab-ce#15607
-
@yorickpeterse have your markdown rendering changes already landed? http://stats.pingdom.com/81vpf8jyr1h9/1902794 is still above 4 seconds
-
@sytses Yes, they landed in RC1. In various cases query counts dropped significantly, though timings were not reduced as much as I hoped. Having said that we did have the noise of Unicorn workers being killed very quickly and RC1 introduced some changes that ended up running extra SQL queries for every note rendered.
More information on this (and where the remaining amount of time appears to be spent) can be found in gitlab-org/gitlab-ce#15607 (comment 12372507).
-
@sytses that's the avg for 7 days, the current numbers are down to ~3.5 seconds after the change on memory that happen last Thursday. So I'm expecting that number to go below 4s to roughly 3.5 in the next days.
-
mentioned in issue gitlab-org/gitlab-ce#18581
-
mentioned in issue gitlab-com/support-forum#576
-
Milestone changed to Milestone-3
-
mentioned in issue gitlab-com/infrastructure#57
-
mentioned in issue gitlab-com/infrastructure#59
-
Please register or login to post a comment





