GitLab.com / Operations

New Relic data can't be trusted as it only runs on a single server, our handy InfluxDB setup is much more reliable on the other hand. Transaction timings from the past 24 hours:

The response time of the last 24 hours is 6.7 seconds based on the following query:

select percentile("duration", 99) / 1000 from rails_transactions where time > now() - 24h;

These timings are Rails application timings only, they don't include any time spent in Workhorse, Nginx, etc.

@yorickpeterse Although 6.7 second response times are shameful I'm relieved to see we got accurate data now instead of the New Relic stuff.

Any idea why the Rails times are so slow? Is it server load, database slowness, markdown parsing, something else?

@yorickpeterse Why do we only have one server reporting to NewRelic? AFAIK you can point many servers at one application in NewRelic and won't incur additional costs. It's all priced via application.

@sytses I'm not sure just yet. I've seen a few bits of code being called in different controllers while being slow in all cases, so I'll look at that in the coming days. Basically my next goal will be finding/resolving bits of badly performing code that are shared between controllers. Fixing those cases should hopefully lead to a lower overall response time.

@dblessing I'm not sure what the exact reasoning is/was, but I'm pretty sure you pay per host/server, unless this changed in the past.

@sytses One example of such a case is discussed in gitlab-org/gitlab-ce#10771. I'll be taking a look at this particular case tomorrow.

@dblessing New Relic is prices per server, $150 * 20 servers is too much. And I think it is likely that having it on all servers would not improve the data, it should already be in the ballpark when running on a single server since traffic is distributed evenly by HAproxy.

@yorickpeterse Awesome work, please prioritize finding bits of badly performing code over fixing them since multiple people can help with the fixing.

gitlab-org/gitlab-ce#10771 was resolved by MR gitlab-org/gitlab-ce!2498, which will hopefully shave off around 500 ms on most project pages.

In gitlab-org/gitlab-ce!2421 the topic of increasing the Unicorn memory limit was brought up. We discovered the limits shipped with CE and Omnibus differ and are currently testing (by just letting it run for an hour or so while collecting data) a slightly more reasonable limit. Ideally this would cut down the amount of worker restarts which in turn could reduce request timings (due to requests not having to wait for workers to become available). Should this help we'll try to also get this into 8.4.

2016-01-21T17:27:06Z

Improved some code in gitlab_git (gitlab-org/gitlab_git!62) which should make it (and Project#empty_repo? in particular) about 4.5x faster (MR for CE: gitlab-org/gitlab-ce!2535). This in turn should (at least for now) solve gitlab-org/gitlab-ce#12418.

Edited 2016-01-21 17:27:06 UTC

2016-01-22T18:39:37Z

@yorickpeterse great, keep going!

I've added dev.gitlab.org to the description which has normal latency, so the problems are due to the size of GitLab.com in all it's aspects (more users, more issues, larger db, spread amoung multiple machines, higher load, etc.).

Edited 2016-01-22 18:39:37 UTC

2016-01-22T21:44:18Z

Interesting GitLab Git slowness issue gitlab-org/gitlab-ce#12443

Edited 2016-01-22 21:44:18 UTC

Milestone changed to Scheduled

@yorickpeterse Make sure to assign issues to @joshfng for resolution.

After having looked into a few issues (e.g. gitlab-org/gitlab-ce#12447 and gitlab-org/gitlab-ce#12444) I'm starting to suspect some of these bits of code may be slow due to either the Azure network or our PostgreSQL server. Starting tomorrow I'll see if I can think of anything that lets us confirm or rule this out.

I'll be digging into more low level operations and setting up metrics there. I suspect performance is affected by either the Azure network in general, or our NFS cluster. As a start to clear things up I'll be instrumenting Rugged (besides Gitlab::Git which we already instrument) just in case we use this somewhere directly. I'll also make sure any other Git related code is instrumented.

@yorickpeterse awesome!

Extra Rugged/Gitlab::Git instrumentation will be added in gitlab-org/gitlab-ce!2664.

2016-02-01T16:03:15Z

In an effort to make local testing of Git operations easier I'll be looking into a way to basically slow down Git/IO operations. Because I use an SSD with fairly little operations going on GitLab locally performs way better than GitLab.com making it much harder to determine the impact of certain operations. Basically I just want to add something like sleep(100 ms) before an operation to mimic a slow NFS.

In the mean time I'll continue digging through the exiting metrics/tooling/etc to see what I can find on GitLab.com.

Edited 2016-02-01 16:03:15 UTC

To set up a slow file system I used the following Fish script (after a day of googling/trying things out, etc):

#!/usr/bin/env fish

sudo tc qdisc del dev lo root
sudo tc qdisc add dev lo root netem delay 2ms

sshfs -o cache=no -f \
    yorickpeterse@localhost:/home/yorickpeterse/Projects/gitlab/repositories \
    /mnt/repositories

sudo tc qdisc del dev lo root

This mounts my repositories directory under /mnt/repositories and slows down traffic by 2ms.

mentioned in issue gitlab-org/gitlab-ce#12794

Currently most project specific pages appear to be suffering from the following two problems:

Various slow Git operations that may not always be needed (e.g. checking if a repository is empty). An example of this is a call to Project#valid_repo? which @joshfng is looking into gitlab-org/gitlab-ce#12442. Another example was Project#empty_repo (gitlab-org/gitlab-ce#12418) which I improved in MR gitlab-org/gitlab_git!62 and MR gitlab-org/gitlab-ce!2535.
Certain database queries that run on many pages while not always needed. For example, BroadcastMessage.current was called on a lot (if not all) pages and doing DB queries when it could easily be cached. @joshfng fixed this in gitlab-org/gitlab-ce!2633.

Some other examples can be found using the script https://gitlab.com/gitlab-org/inspector-influx/blob/master/lib/common_methods.rb:

Method                                           Actions  Average  Minimum  Maximum
Gitlab::Git::Repository#ref_names                25       26120.35  151.03   31443.33
Gitlab::Git::Repository#tag_names                24       24994.36  180.32   30812.19
Gitlab::Git::Commit.find                         74       14507.25  102.44   14743.25
Gitlab::Git::Repository#autocrlf=                110      9806.67  137.11   10971.25
Gitlab::Git::Repository#rugged                   110      9805.97  136.99   10971.04
Gitlab::Git::Repository#rev_parse_target         68       9171.58  102.32   14365.43
Gitlab::Git::Blob.find                           36       7718.02  101.37   9403.49
Gitlab::Git::Repository#branch_count             103      4587.89  157.91   8061.42
Gitlab::Git::Repository#empty?                   35       4576.5   118.32   4890.1 
Gitlab::Git::Commit.decorate                     45       4178.67  100.62   4827.85
Gitlab::Git::Blob.find_entry_by_path             25       3869.27  131.86   4455.82
Project#empty_repo?                              104      3388.37  158.02   6364.22
Gitlab::Git::Repository#branch_names             85       3044.43  106.44   5610.8 
Gitlab::Git::Repository#branches                 86       3013.46  106.37   5610.69
BroadcastMessage.current                         52       2879.07  105.47   4119.16
Gitlab::Git::Repository#root_ref                 76       2218.09  107.79   2559.01
Gitlab::Git::Repository#discover_default_branch  76       2218.05  107.76   2558.96
Project#default_branch                           28       2013.66  461.43   2047.47
User#can_create_project?                         65       1827.47  103.62   2034.53
User#projects_limit_left                         65       1827.36  103.49   2034.39
User.find_by_authentication_token                26       1425.22  103.87   1729.6 
Project#external_issue_tracker                   53       1364.35  104.5    1907.96
License.block_changes?                           44       1261.76  100.58   1927.74
License.current                                  52       1180.08  100.36   1927.11
Project#get_issue                                31       1146.23  110.02   1152.63
Project#owner                                    70       973.12   103.36   1642.53
Group#owners                                     88       827.69   102.07   1921.78
User#authorized_projects                         29       537.54   113.93   558.11 
Project.find_with_namespace                      68       439.69   104.71   461.44

Here we can see that a lot of these Git calls are shared by different Rails actions (= controller + method). For this code there's an issue at gitlab-org/gitlab-ce#12443 and a discussion to cache certain operations at gitlab-org/gitlab-ce#12811.

2016-02-03T10:34:06Z

Personal TODO for the coming days:

~~Measure simple ping timings from a worker to the NFS and database~~ ICMP seems to be disabled on either the workers or the NFS as a ping from worker11 doesn't yield a response.
Measure read/write (e.g. just a dumb "Hello world") timings
Measure simple Rugged (as in, without involving Gitlab::Git) operations from a worker (e.g. grabbing the latest commit of the master branch)

The idea of the above is to get a better understanding of the baseline latency that's present in GitLab.com. For example, if reading from a text file already takes 50 ms more complex Git operations will only end up taking longer.

Edited 2016-02-03 10:34:06 UTC

mentioned in issue gitlab-org/gitlab-ce#12442

2016-02-03T10:12:40Z

To measure simple reads/writes I SSH'd into worker5 and ran the following scripts:

bench_reads.rb

require 'benchmark'

path    = '/var/opt/gitlab/git-data/repositories/yorickpeterse/cat-pictures.git/HEAD'
timings = []
start   = Time.now

while Time.now - start <= 10
  timings << Benchmark.measure { File.read(path) }.real * 1000

  sleep 0.1
end

timings = timings.sort

avg = timings.inject(:+) / timings.length
min = timings[0]
max = timings[-1]

puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"

This outputs:

Minimum: 0.0477 ms | Maximum: 5.569501 ms | Average: 0.15574003 ms

bench_writes.rb

require 'benchmark'

path    = '/var/opt/gitlab/git-data/repositories/yorickpeterse/test.txt'
timings = []
start   = Time.now

File.open(path, 'w') do |handle|
  while Time.now - start <= 10
    timings << Benchmark.measure { handle.write('hello') }.real * 1000

    sleep 0.1
  end
end

File.unlink(path)

timings = timings.sort

avg = timings.inject(:+) / timings.length
min = timings[0]
max = timings[-1]

puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"

This outputs:

Minimum: 0.0069 ms | Maximum: 0.0284 ms | Average: 0.009199 ms

Timings

Average read time: 0.15574003 ms
Average write time: 0.009199 ms

Based on these very basic scripts it appears very primitive operations don't perform that badly.

Edited 2016-02-03 10:12:40 UTC

2016-02-03T10:12:28Z

To benchmark the time it takes to initialize a Rugged::Repository (which does some IO calls) I used the following script:

require 'benchmark'
require 'rugged'

path    = '/var/opt/gitlab/git-data/repositories/gitlab-org/gitlab-ce.git'
timings = []
start   = Time.now

while Time.now - start <= 10
  timings << Benchmark.measure { Rugged::Repository.new(path) }.real * 1000

  sleep 0.1
end

timings = timings.sort

avg = timings.inject(:+) / timings.length
min = timings[0]
max = timings[-1]

puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"

Using worker5 this spits out:

Minimum: 0.7886 ms | Maximum: 13.667702 ms | Average: 2.18976976 ms

Edited 2016-02-03 10:12:28 UTC

Removing the sleep 0.1 of the above script results in the following output:

Minimum: 0.436 ms | Maximum: 166.846621 ms | Average: 1.13462863 ms

Next I decided to put some extra pressure on the NFS while running a benchmark. For this I used the following script:

require 'benchmark'
require 'rugged'

path    = '/var/opt/gitlab/git-data/repositories/gitlab-org/gitlab-ce.git'
timings = []
start   = Time.now
stress  = true

threads = 5.times.map do
  Thread.new do
    while stress
      Rugged::Repository.new(path)
    end
  end
end

while Time.now - start <= 10
  timings << Benchmark.measure { Rugged::Repository.new(path) }.real * 1000
end

stress = false

threads.each(&:join)

timings = timings.sort

avg = timings.inject(:+) / timings.length
min = timings[0]
max = timings[-1]

puts "Minimum: #{min.round(8)} ms | Maximum: #{max.round(8)} ms | Average: #{avg.round(8)} ms"

This basically starts 5 threads that constantly initialize a repository while the main thread does the same and measures the timings. The idea is to see if/how high load influences timings. This script spits out the following:

Minimum: 0.4011 ms | Maximum: 1703.291789 ms | Average: 8.02635465 ms

Next I adjusted the script to dump the timings to a text file so I could plot it. The raw timings can be found at timings.txt.

This was plotted using the following gnuplot script:

set title "Timings"

set xlabel "Iteration"
set ylabel "Time (ms)"

set rmargin 10
set term svg size 1400,500
set grid

set style line 1 lc rgb "#0060ad" lt 1 lw 1
set style data lines

plot "/tmp/timings.txt" using 1 ls 1 title "Timings"

The graph looks as following:

What's interesting here is the recurring spike of ~500 ms. Based on this it looks like the NFS works fine when there's little load but starts to misbehave when load increases.

@yorickpeterse great information! If we have any insight into our peak traffic times, we should perform these tests during that window for the most accurate representation of NFS latency. Also, we could see varying latency depending on the hardware provided by azure and what other types of load are being applied to the host machine.

2016-02-03T16:12:05Z

Disk latency of the NFS sever (this is the disk Git data is stored on):

Looking at this graph the raw disk latencies don't appear to be that bad on the NFS itself.

Edited 2016-02-03 16:12:05 UTC

mentioned in issue gitlab-org/gitlab-ce#12619

2016-02-04T15:05:50Z

https://gitlab.slack.com/archives/operations/p1454513429000802

SQL stats: Blocks Read: 29.76/s, Fetches: 833614.48/s, Commits: 4169.81/s, Deletes: 0.40/s, Updates: 19.87/s, Inserts: 3.54/s

Fetches => Number of table rows fetched by bitmap scans when argument is a table, or table rows fetched by simple index scans using the index when argument is an index

Almost 1m fetches a second! We should really bring that down.

https://gitlab.slack.com/archives/operations/p1454598062000853 if you you look to some pages in GitLab it will take up to 100 query's on every load so multiple this with a couple of 1000 users online it actually does make sense

Edited 2016-02-04 15:05:50 UTC

@yorickpeterse will add a counter that counts how sql queries are done per url

CheckMK PostgreSQL stats: https://checkmk.gitlap.com/gitlab/check_mk/index.py?start_url=%2Fgitlab%2Fcheck_mk%2Fview.py%3Fview_name%3Dservice%26service%3DPostgreSQL%2520DB%2520gitlabhq_production%2520Statistics%26host%3Ddb4.cluster.gitlab.com

mentioned in issue #68

@jnijhof will add the script from #42 (comment 3522792) to checkmk in #68

@yorickpeterse The goal is to get a time budget for the allmost 10 seconds that a get of gitlab-org/gitlab-ce#1 takes.

gitlab-org/gitlab-ce#1 TTFB is 7.76 seconds and what got my attention is the amount of mentioned in the issue which is causing this.

mentioned in issue gitlab-org/gitlab-ce#13241

TTFB = time to first byte

mentioned could be the problem indeed

@sytses some things that might be worth noting, based on some poking around I've done:

autocomplete_source.json is loaded for every single Issue and Merge Request. For this issue, it took more than a second to load and includes every single issue, user, and merge request in a given project. This can't be good for the site's speed, and is an extra (in this case) 175KB that shouldn't be loaded unless explicitly needed. For the gitlab-ce repository, it's more than 350KB! Unless I'm missing something that necessitates this, it should not be being served on every single page load.
It looks like a file called notes is loaded every ~15 seconds which checks for updates to content on the page. This may not cause a huge performance burden, but it's something simple that may be beneficial to change to ~45 second intervals, if it's even necessary at all. I currently have 5 GitLab tabs open, meaning that I'm polling the notes endpoint 20 times per minute. These things add up.
Emoji are being served in the aforementioned autocomplete_source.json file as well as as an 813KB PNG sprite file. Which are we using, and why are we loading these twice?
The award emoji picker is loaded inline rather than dynamically with JavaScript. I would think something like that might be used in - at max - 3% of Issue page loads. Probably less. Could see slightly faster load times if that were taken out.
Non-Gravatar avatar images (such as my own) are loaded at their full resolution, usually between 50KB and 150KB per avatar with a resolution around 800px - if not more. The Gravatar images are always less than 10KB, served over HTTP2, and are only 48px. Some huge potential gains to be had there.

Have you looked into using a CDN (I'd strongly recommend CloudFlare, I use it for my Rails apps)? Could be a huge performance improvement and take significant pressure off your servers. I realize it's not open source, but seeing as the CI Builds run on Digital Ocean and Azure, that's probably not a huge problem.

Sorry this was a little long-winded, or if the tone sounds negative in any way. It's not intentional, it's just late at night here and my brain isn't working at full capacity :P

2016-02-08T10:59:19Z

Testing issue #1 locally using Sherlock reveals the following:

Out of the ~21 seconds it takes to load this issue (using my slower file system) only 0.45 seconds is spent running SQL queries.
Looking at the line profiler output most time appears to be spent in app/models/repository.rb and the Markdown rendering pipeline. Details below.

app/models/repository.rb

84% of the time (roughly 15 seconds) is spent in Repository#empty? on line 47
9% of the time (roughly 1.6 seconds) is spent in Repository#has_visible_content? on line 60
6% of the time (roughly 1 second) is spent in Repository#root_ref on line 483
The remaining time is spread across a few different lines with most timings taking only a few dozen milliseconds

lib/banzai/filter/relative_link_filter.rb

This file makes the calls to Repository#empty? as well as Repository#exists? in Banzai::Filter::RelativeLinkFilter#linkable_files? on line 32. All the time spent in this file is spent on this line.

Conclusion

So loading issues seems to be slow because it's doing a bunch of Git operations that don't have to be performed for every request. Rugged internally uses git_repository_is_empty to check if a repository is empty (https://libgit2.github.com/libgit2/#HEAD/group/repository/git_repository_is_empty), which as far as I can tell is only true before the first commit is ever pushed. For example, if you follow the following steps:

Initialize a repo
Add a file and commit it
Remove said file, commit the changes
Run Rugged::Repository.new('path/to/repo').empty?

Then step 4 will return false. As a result of this I think we can just cache the output of Repository#empty? forever. Example shell session:

foo $ git init
Initialized empty Git repository in /tmp/foo/.git/
foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?'
true
foo $ echo 'hello' > test.txt
foo $ git add test.txt
foo $ git commit -m 'Initial commit'
[master (root-commit) 134f396] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 test.txt
foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?'
false
foo $ git rm test.txt 
rm 'test.txt'
foo $ git commit -m 'Removed test.txt'
[master abeaeb0] Removed test.txt
 1 file changed, 1 deletion(-)
 delete mode 100644 test.txt
foo $ ruby -r rugged -e 'p Rugged::Repository.new(".").empty?'
false

Repository#has_visible_content?

The output of Repository#has_visible_content? can also be cached as it simply checks if the number of branches is greater than 0. This cache could be flushed whenever the last branch of a repository is removed (we might need an extra hook of sorts for this).

cc @joshfng

Edited 2016-02-08 10:59:19 UTC

2016-02-08T11:03:49Z

Applying the following patch reduces the loading time of #1 locally to around 4 seconds (instead of 21):

diff --git a/app/models/repository.rb b/app/models/repository.rb
index e813c94..4f7f8d6 100644
--- a/app/models/repository.rb
+++ b/app/models/repository.rb
@@ -44,7 +44,7 @@ class Repository
   end

   def empty?
-    raw_repository.empty?
+    cache.fetch(:empty) { raw_repository.empty? }
   end

   #
@@ -57,7 +57,9 @@ class Repository
   # This method return true if repository contains some content visible in project page.
   #
   def has_visible_content?
-    raw_repository.branch_count > 0
+    cache.fetch(:has_visible_content) do
+      raw_repository.branch_count > 0
+    end
   end

   def commit(id = 'HEAD')

Edited 2016-02-08 11:03:49 UTC

2016-02-08T11:11:05Z

Repository#root_ref can probably also be cached as it simply returns the name of the default branch. The cache for this can be flushed whenever the user changes the default branch. Caching this further reduces the loading time to around 2.8 seconds. The diff would be:

diff --git a/app/models/repository.rb b/app/models/repository.rb
index e813c94..086e989 100644
--- a/app/models/repository.rb
+++ b/app/models/repository.rb
@@ -44,7 +44,7 @@ class Repository
   end

   def empty?
-    raw_repository.empty?
+    cache.fetch(:empty) { raw_repository.empty? }
   end

   #
@@ -57,7 +57,9 @@ class Repository
   # This method return true if repository contains some content visible in project page.
   #
   def has_visible_content?
-    raw_repository.branch_count > 0
+    cache.fetch(:has_visible_content) do
+      raw_repository.branch_count > 0
+    end
   end

   def commit(id = 'HEAD')
@@ -480,7 +482,7 @@ class Repository
   end

   def root_ref
-    @root_ref ||= raw_repository.root_ref
+    @root_ref ||= cache.fetch(:root_ref) { raw_repository.root_ref }
   end

   def commit_dir(user, path, message, branch)

Edited 2016-02-08 11:11:05 UTC

mentioned in merge request gitlab-org/gitlab-ce!2752

WIP merge request to resolve the above: gitlab-org/gitlab-ce!2752

mentioned in issue gitlab-com/support-forum#382

@yorickpeterse

Created another merge request that makes our cache flushing less aggressive: gitlab-org/gitlab-ce!2769. tl;dr previously branch statistics were flushed upon every push (for all branches), now we only flush for branches where it's actually needed. A push to the default branch still results in all branch caches being flushed.

mentioned in issue gitlab-org/gitlab-ce#13299

mentioned in issue gitlab-org/gitlab-ce#13300

mentioned in issue gitlab-org/gitlab-ce#13301

mentioned in issue gitlab-org/gitlab-ce#13302

mentioned in issue #75

@connorshea thank you for your suggestions in #42 (comment 3588320) which I split up in:

autocomplete gitlab-org/gitlab-ce#13299
note check time gitlab-org/gitlab-ce#13300
dynamic award emoji picker gitlab-org/gitlab-ce#13301
load smaller versions of non-gravatar avatars gitlab-org/gitlab-ce#13302
use a CDN for GItLab.com #75

2016-02-09T17:47:30Z

@connorshea Thanks for the comment/info! Sorry I completely overlooked it in my whirlwind of replies above.

Edited 2016-02-09 17:47:30 UTC

mentioned in issue gitlab-org/gitlab-development-kit#94

@sytses thank you for splitting them up into issues :)

@yorickpeterse no problem! Your empty repository improvement is awesome.

@yorickpeterse for the nfs benchmark graphs see: https://checkmk.gitlap.com/gitlab/check_mk/index.py?start_url=%2Fgitlab%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dworker1.cluster.gitlab.com%26srv%3DNFS_benchmark%26theme%3Dmultisite%26baseurl%3D..%2Fcheck_mk%2F

@jnijhof What are the values on the Y axes in these graphs, the time it takes to initialize a Git repo?

It's in ms

@jnijhof So in this graph:

Does this mean it takes 800 milliseconds to initialize a repository? If not, maybe it's best to run through these graphs and their meanings during the devops meeting, saves pinging 17 participants in this issue :)

To test the impact of 8.5 RC1 so far I ran the following query:

select percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-01-11 14:25:00' and time < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';

This spits out the 95th percentile of gitlab-org/gitlab-ce#1 for the past 24 hours leading up to today's deploy. The output is:

name: rails_transactions
------------------------
time                    duration_95th
1452522300000000001     11.419827557000001

In other words, for the 24 hours leading up to the deploy the 95th percentile timing was roughly 11.5 seconds.

Next I ran the following query:

percentile("duration", 95) / 1000 as "duration_95th" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';

This calculates the 95th percentile since today's 8.5 deploy, the output is as following:

name: rails_transactions
------------------------
time                    duration_95th
1455200700000000001     3.480613037

In other words, we went from a 95th percentile of ~11.5 seconds down to ~3.5 seconds, an improvement of roughly 3.3 times.

2016-02-11T16:09:32Z

And for the sake of it, here's the timings when we calculate the mean instead of the 95th percentile:

> select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and timestamp > '2016-01-11 14:25:00' and timestamp < '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';
name: rails_transactions
------------------------
time    mean
0       8.047509426767022

> select mean("duration") / 1000 as "mean" from rails_transactions where action = 'Projects::IssuesController#show' and time > '2016-02-11 14:25:00' and request_uri = '/gitlab-org/gitlab-ce/issues/1';

name: rails_transactions
------------------------
time                    mean
1455200700000000001     2.933122648561403

Edited 2016-02-11 16:09:32 UTC

@yorickpeterse what if we run that query for all controller actions?

@yorickpeterse forget about my question, I can see that other controllers also greatly improved http://yorick-influxdb.gitlap.com/dashboard/db/projectscontroller from 2 to 1 second.

mentioned in commit gitlab-com/www-gitlab-com@19420897

The speed of GitLab.com is not where we want to be yet but the worst is over with the 3x speed increase. I've removed the warning on our website gitlab-com/www-gitlab-com@19420897 as we keep working to increase the speed further.

Yorick is working on increasing the speed of the git commands in gitlab-org/gitlab-ce#12443

For reference our Dashboard::ProjectsController timings

Little factoid about GitLab.com: in the ~15 hours before the 8.5.0 RC1 deploy (11th at ~14:00 UTC) our timings were as following:

95th percentile: 2.026 sec 99th percentile: 6.169 sec mean: 658.6 ms

Since the deploy (until now) the timings are as following:

95th percentile: 1.448 sec (= 1.4x faster) 99th percentile: 3.779 sec (= 1.6x faster) mean: 462.8 ms (= 1.4x faster)

Per this comment in a thread on switching to GitLab, pulling the same repo from GitHub vs. GitLab, GitLab takes significantly longer:

GitHub:

❯❯❯ time git pull origin master
From github.com:CWSpear/local-persist
 * branch            master     -> FETCH_HEAD
Already up-to-date.
git pull origin master  0.02s user 0.04s system 4% cpu 1.396 total

GitLab:

❯❯❯ time git pull gitlab master
From gitlab.com:CWSpear/local-persist
 * branch            master     -> FETCH_HEAD
Already up-to-date.
git pull gitlab master  0.02s user 0.02s system 0% cpu 6.837 total

Similarly, he compared BitBucket a few comments up and got:

git pull bitbucket master  0.02s user 0.02s system 2% cpu 1.658 total
git pull gitlab master     0.02s user 0.02s system 0% cpu 7.318 total

2016-02-25T22:59:00Z

# GitLab
Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://gitlab.com/connorshea/connorshea.github.io.git
Cloning into 'connorshea.github.io'...
remote: Counting objects: 708, done.
remote: Compressing objects: 100% (300/300), done.
remote: Total 708 (delta 370), reused 708 (delta 370)
Receiving objects: 100% (708/708), 21.46 MiB | 10.70 MiB/s, done.
Resolving deltas: 100% (370/370), done.
Checking connectivity... done.

real    0m3.216s
user    0m0.631s
sys 0m0.366s

# GitHub
Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://github.com/connorshea/connorshea.github.io.git connorshea.github.io2
Cloning into 'connorshea.github.io2'...
remote: Counting objects: 708, done.
remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708
Receiving objects: 100% (708/708), 21.46 MiB | 9.87 MiB/s, done.
Resolving deltas: 100% (370/370), done.
Checking connectivity... done.

real    0m2.881s
user    0m0.680s
sys 0m0.429s

# GitHub
Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://github.com/connorshea/connorshea.github.io.git connorshea.github.io3
Cloning into 'connorshea.github.io3'...
remote: Counting objects: 708, done.
remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708
Receiving objects: 100% (708/708), 21.46 MiB | 8.58 MiB/s, done.
Resolving deltas: 100% (370/370), done.
Checking connectivity... done.

real    0m3.095s
user    0m0.693s
sys 0m0.466s

# GitLab
Connors-MacBook-Pro:test-gitlab-perf connor$ time git clone https://gitlab.com/connorshea/connorshea.github.io.git connorshea.github.io4
Cloning into 'connorshea.github.io4'...
remote: Counting objects: 708, done.
remote: Compressing objects: 100% (300/300), done.
remote: Total 708 (delta 370), reused 708 (delta 370)
Receiving objects: 100% (708/708), 21.46 MiB | 10.90 MiB/s, done.
Resolving deltas: 100% (370/370), done.
Checking connectivity... done.

real    0m3.337s
user    0m0.626s
sys 0m0.369s

Interestingly, cloning a new repo from scratch (in this case it isn't a very big repo), GitHub leads by only about half a second. Tested from Colorado.

The most notable difference between the two that I can find is that GitHub has remote: Total 708 (delta 0), reused 0 (delta 0), pack-reused 708 whereas GitLab has remote: Total 708 (delta 370), reused 708 (delta 370). Not sure what that means exactly, but maybe it makes a difference?

Edited 2016-02-25 22:59:00 UTC

@connorshea Thanks! That first example appears to be using Git over SSH, which indeed is quite a bit slower than Git over HTTP at the moment. See #99 for some more information on this.

mentioned in issue gitlab-org/gitlab-ce#13889

2016-03-04T11:32:24Z

Merge request gitlab-org/gitlab-ce!2987 will make global search around 10x faster (at least). For example, the time for the search query "yorick" went from > 30 seconds to 2-3 seconds. Searching for "git" now takes around 5-6 seconds instead of timing out (meaning it used to take over 2 minutes at least).

Project specific search queries have also been improved. For example, searching for "git" in the GitLab CE repository takes around 2-3 seconds.

This applies to setups using CE's search feature, these changes do not apply to ElasticSearch based searching.

Edited 2016-03-04 11:32:24 UTC

Merge request gitlab-org/gitlab-ce!3123 (and gitlab-org/gitlab_git!69) will hopefully make displaying commit logs faster. What the final impact is of these merge requests on GitLab.com has yet to be tested though.

Merge request gitlab-org/gitlab-ce!3128 adds some methods for count and caching the number of branches and tags. Using these methods the loading times of a project's branches page can be improved up to 1.45x, in my local setup the timings went from 6.65 seconds on average to 4.6 seconds.

gitlab-org/gitlab-ce!3158 will make retrieving CI services faster, which seems to have quite an impact on the performance of merge request pages (see the graph in gitlab-org/gitlab-ce#14186).

mentioned in issue gitlab-org/gitlab-ce#13651

2016-03-24T10:28:12Z

I'm currently looking into the following two issues:

The first issue affects viewing individual issues and has a pretty significant impact on performance. The second issue affects basically everything where we render Markdown, though it will probably be quite hard to resolve as the Markdown rendering pipeline is quite complex.

Edited 2016-03-24 10:28:12 UTC

mentioned in issue gitlab-org/gitlab-ce#14685

People are rightfully complaining that loading a list of issues shouldn't take 3 seconds https://news.ycombinator.com/item?id=11375663

@yorickpeterse do you think http://stats.pingdom.com/81vpf8jyr1h9/1902794 is taking 4 seconds because of markdown rendering?

BTW people are noticing that GitLab is becoming faster in https://news.ycombinator.com/item?id=11375912 and https://news.ycombinator.com/item?id=11375748

@sytses Yes, rendering Markdown takes up quite a lot of time especially as the number of notes grows (it seems to grow exponentially even).

I'm working on making Markdown rendering faster, the WIP merge request is gitlab-org/gitlab-ce!3389.

2016-04-02T23:26:13Z

Wanted to note that we don't have GZIP enabled on GitLab.com, from gitlab-org/gitlab-ce#14344:

Per Google PageSpeed, we could reduce this page's size 1.4MB by enabling gzip. That's a huge difference.

PageSpeed gives that issue page a 37/100 on Desktop and a 31/100 on Mobile. We shouldn't base everything off the PageSpeed stats, but we should aim to at least have a 70/100 on both. I don't claim to be a security researcher, but from what I can tell this is an unnecessary precaution, and I wanted to make sure it was discussed more. Based on gitlab-org/gitlab-ce#4088, I think this can just as easily be mitigated by removing SPDY support.

Edited 2016-04-02 23:26:13 UTC

Merge request gitlab-org/gitlab-ce!3389 was merged and is currently running on GitLab.com along with the other changes that made it into 8.7.0 RC1. I'll continue to monitor performance in the next hour or so to see what the impact is of this RC. After that I'll continue looking into the performance of Issue#referenced_merge_requests and associated code as this appears to currently be the main cause of loading issues being slow.

8.7.0 RC1 sadly does not have an impact as drastic as it has locally. Timings for 24 hours leading up to the deploy (today 12:00 UTC) are as follows:

Mean: 798.15 ms
95th percentile: 2.439 sec
99th percentile: 5.939 sec

After the deploy until now the timings are as follows:

Mean: 769.06 ms
95th percentile: 2.462 sec
99th percentile: 5.648 sec

Note that these timings are the timings of all requests across all controllers.

In gitlab-org/gitlab-ce#14315 (comment 4596105) and gitlab-org/gitlab-ce#14315 (comment 4596248) I noted that it seems quite a bit of time in the Markdown pipeline is spent fetching data from Redis. Moving the caching to the database would help because:

The records to render are already fetched, thus the column containing the pre-rendered HTML (if any) is also already there. This removes the need for any extra network calls.
Data cached in Redis expires after 2 weeks (IIRC), storing this in the DB works around that. The cache expiring means that every now and then we'd have to re-build the HTML for all data involved (e.g. we'd have to rebuild the HTML of all notes in this particular issue)

Since this will take a few days (at least) to implement I'll first look into loading the "Related merge requests" section of an issue asynchronously. While this won't make that part faster it will at least cut down the loading time of the document request when loading an issue.

It seems I also broke Markdown rendering in some odd cases (gitlab-org/gitlab-ce#14936) so I'll have to fix that as well.

Git wise, when viewing individual issues there's still Git code involved that's slow as a snail. For example, Rugged::BranchCollection#each seems to be called somewhere and in the worst case takes around 1.6 seconds to complete. There also seems to be revision parsing going on using Rugged::Repository#rev_parse (taking up to half a second), this is probably used for rendering commit references. Other than that there are a handful of random Git related methods running but most only take 20-30 milliseconds with an occasional outlier (again probably used for rendering Git related references).

mentioned in issue gitlab-org/gitlab-ce#14949

mentioned in issue #182

Thanks @yorickpeterse, removing Rugged::BranchCollection#each calls in individual issues looks important to me. Good idea to make "Related merge requests" async.

BTW I've made a longer term issue in #182

@yorickpeterse I'd be interested in seeing if changes in the next RC make any difference. Are those timings for serverside rendering or clientside? If clientside, I would think the 99th percentile would decrease some with the smaller page size, although I may be wrong.

@connorshea All timings are server side. Due to InfluxDB overwriting data I have to re-verify what the actual impact of RC1 was (as I no longer trust the data we had at that time). Our timings did quite significantly drop the moment we fixed InfluxDB (see #175) so maybe RC1 did have the desired impact.

@yorickpeterse have you been tracking time to DOMContentLoaded via Piwik at all?

@connorshea I'm not sure what exactly we're tracking in Piwik. Maybe @jnijhof knows more about that.

Worth mentioning: I managed to more or less confirm that on the page used for displaying single issues (Projects::IssuesController#show) Git code is no longer the main offender, at least for 95% of the requests:

Here the little boxes at the top (ignore the colors of the text, I'm still making adjustments to the thresholds) show the 95th percentiles of the various timings. There are still some random spikes (e.g. in one instance Rugged::Repository.new took over 7 seconds) but in 95% of the requests we only seem to spend around 85 milliseconds in either Gitlab::Git or Rugged code.

It seems to me there is a negative correlation between number of requests and timings, which I find a bit weird and perhaps even suspicious: Peak number of requests occur between 17 and 18, which is also a period of low mean timing. Likewise, the spike in requests around 20:30 or so corresponds to a very low mean timing.

I would instinctively expect that the high number of requests would increase the load on the databse, and lead to (s)lower performance.

@elygre There's no guarantee that more requests equals higher timings as this all depends on what those requests are doing. There may also be external factors (e.g. requests for other controllers) that can impact timings.

@yorickpeterse Absolutely, it just seemed systematic, and hence both interesting, weird and suspicious :-)

mentioned in merge request gitlab-org/gitlab-ce!3804

mentioned in issue gitlab-com/support-forum#453

All action is now happening in the issues labeled with performance https://gitlab.com/gitlab-org/gitlab-ce/issues?scope=all&sort=updated_desc&state=opened&utf8=%E2%9C%93&label_name%5B%5D=performance Please consider contributing code.

@sytses do you think it makes sense to keep this one open now that we have all the other ones?

Added direction performance labels

Milestone changed to Milestone-1

mentioned in issue gitlab-org/gitlab-ce#17693

Small status update since it's been a while in this issue:

gitlab-org/gitlab-ce!3969 adds support for batch retrieval of Markdown references and refactors the code used for getting participants of an Issuable. This MR as a whole greatly cuts down the number of queries executed when viewing individual issues/merge requests (or any code that happens to get participants of a Participable)
gitlab-org/gitlab-ce!4362 cuts down the number of queries needed to render HTML in UserReferenceFilter by not running a SQL query for every username reference.
I'm working on another set of changes that cut down the number of queries executed in IssueReferenceFilter when rendering issue references. This is still a giant hack so there's no MR available for this yet (I hope to create one tomorrow or so)
gitlab-org/gitlab-ce!4378 and gitlab-org/gitlab-ce!4377 aim to cache some simple counters (e.g. running build counters) to further reduce the number of queries in a bunch of places

Merge request for cutting down SQL queries when rendering issue references (still a WIP): gitlab-org/gitlab-ce!4410

Also worth mentioning: I aim to get all these MRs in 8.9.

Mayor behaviour change with this: #276

@pcarranza thanks for linking that comment!

My pleasure, @sytses

2016-06-09T18:49:51Z

@pcarranza great job!

It's starting to be visible on graphs under status.gitlab.com

I have crossed fingers for next performance enhancements

Edited 2016-06-09 18:49:51 UTC

mentioned in issue gitlab-org/gitlab-ce#15607

Milestone removed

@yorickpeterse have your markdown rendering changes already landed? http://stats.pingdom.com/81vpf8jyr1h9/1902794 is still above 4 seconds

@sytses Yes, they landed in RC1. In various cases query counts dropped significantly, though timings were not reduced as much as I hoped. Having said that we did have the noise of Unicorn workers being killed very quickly and RC1 introduced some changes that ended up running extra SQL queries for every note rendered.

More information on this (and where the remaining amount of time appears to be spent) can be found in gitlab-org/gitlab-ce#15607 (comment 12372507).

@sytses that's the avg for 7 days, the current numbers are down to ~3.5 seconds after the change on memory that happen last Thursday. So I'm expecting that number to go below 4s to roughly 3.5 in the next days.

mentioned in issue gitlab-org/gitlab-ce#18581

mentioned in issue gitlab-com/support-forum#576

Milestone changed to Milestone-3

mentioned in issue gitlab-com/infrastructure#57

mentioned in issue gitlab-com/infrastructure#59

Make GitLab.com fast again