Plugin / theme checksum verification #6

danielbachhuber · Jan 10, 2017

Just like we have wp core verify-checksums, it would be helpful to have wp (plugin|theme) verify-checksums to verify file checksums for installed plugins and themes.

However, WordPress.org doesn't currently publish plugin and theme checksums. We'd need to generate these and host them at a publicly accessible URL.

https://github.com/eriktorsner/wp-checksum is an existing project that implements this feature.

TheLastCicada · Jan 15, 2017

This would be a great security addition.

danielbachhuber · Aug 22, 2017

@schlessera Some thoughts on this while I'm thinking of it.

First, @eriktorsner has an existing checksum middleware implementation that he's graciously offered to let us crib from. I'll let him weigh in here with more details.

Second, the simplest possible infrastructure to go with would be flat files (no database). I've chatted with the corresponding WordPress.org folks about hosting. If our middleware application can generate flat files served by some API, then it will be fine to sync those flat files to a WordPress.org server (with rsync or similar).

Lastly, the SVN checkouts are going to be hundreds of GB if not TB. DreamHost (via @getsource) has volunteered a server for us to run the checksum generator on.

eriktorsner · Aug 22, 2017

Some thoughts:

First, for reference, my implementation is in two major parts:

1. The Worker
Keeps a local SVN repo in sync with the official repos using svnsync command on a cron schedule. After each sync, checks svn log to figure out what plugins or themes that have modifications. Any changed plugin or theme is checked out and checksums are calculated (md5 and sha256). Data is stored in a fairly simple 3-table structure.

2. The API
A Slim3 REST application with two important endpoints:
(a) /v1/checksum/{type}/{slug}/{version}
(b) /v1/file/{type}/{slug}/{version}

With the suggested approach of storing all checksum data in files rather than in a database, some changes will be needed but I think they are fairly small. From the top of my head, this is what I'd like to do:

Keep the existing worker infrastructure as is. Including the database
Whenever a plugin/theme is changed, pre-generate the resulting JSON document to a file structure
Rsync that file structure on to the target production environment

The Slim3 application would not be needed any longer. However, the worker would still need to be hosted somewhere with a PHP parser and a PDO compliant database (MySQL, haven't tried SQLite, but I'm sure it would work fine).

Currently, 446829 versions are indexed in the database, roughly 10-times the number of individual plugins and themes in the official repos. I trust this would be well within the limits of what a file system handles without performance issues, but please weigh in if you see a problem with this.

Using a file based approach, we'd need to look for another solution if we also want to serve individual files from a plugin (for local diffing). It's entirely possible to request files from the SVN web front end, but I guess we'd have to check with a sysadmin if additional load is welcome.

I currently have experimental mechanisms for getting checksums from a couple of premium plugin vendors (Gravity Forms and Easy Digital Downloads). In the interest of overall security for WordPress users, I'd really like to see a way we can keep this even if this ends up being hosting WordPress.org.

eriktorsner · Aug 22, 2017

Lastly, the SVN checkouts are going to be hundreds of GB if not TB. DreamHost (via @getsource) has volunteered a server for us to run the checksum generator on.

Current sizes (actual SVN repos, not checked out):
plugins: 69 Gb (1.7 million revisions)
themes: 18Gb (80k revisions)

eriktorsner · Aug 29, 2017

We had a Slack discussion about the server CPU intensity of using svnsync. To move things forward I conducted a test using a real world svn repo. During the test, roughly 3k revisions was synced (going from about revision 27k to 30k) at a speed of 10 revisions per second.

Server: Digital Ocean 4Gb Droplet. 2 CPU, Ubuntu 16.04
Client: MacBook Pro i7 2017
Sample frequency: 1 sample every 3 sec
Sample size: 100 = 300 seconds = 5 minutes

Method:

Server side CPU usage was measured using command top -b -p NNN > cpulog.txt where NNN is the pid of the svnserve process
Client side CPU usage was measured using a script from @griffman found here https://gist.github.com/griffman/220745
Synced up the two time series manually by aborting at the same time and selecting the 100 last entries. They are probably 1.5 seconds apart, so don't try draw conclusions about the relative position about where individual peaks occur in the time series, they are a little bit off.

During the 5 minute sample time, the server CPU hovered between 0-4% for most of the time. Average CPU usage was 1.2%, peak CPU usage 8.3%. On the client, the average CPU usage was 17% with a peak at 49%.

Note 1: The CPU usage was measured different tools, it's not obvious that they can be compared like for like.

Note 2: The results are kind of expected. Reading revisions from a repo is mostly disk I/O. Subversion stores the diff between two revisions in a separate file per revision. The svnsync process is really just a long series of requests for individual diffs going from revision n to n+1. So this aligns closely with how Subversion stores things in its revision database. This works well for all non-binary files, but I think the process might be a little bit more CPU intensive for images and other binary files, they probably need to be transformed to a format suitable for transfer. I suspect that the peaks in server side usage above might be from managing binary files in individual revisions.

On the receiving end, each diff is then merged and committed into the revision tree which is much more CPU intensive. As this happens, I'm fairly sure (please correct me if I'm wrong) that Subversion actually needs to take the content of the old file, apply the diff and then store the new file (compressed, in the file rep-cache.db) along with the diff itself in its separate revision file. All in all, much more work for the CPU.

Note 3: From the server's perspective, svnsync is roughly the same as doing svn up but instead of overwriting files in a working directory as the normal svn client, svnsync commits the diffs to a local repo for full history (not unlike git).

Note 4: Committing new revisions takes more time the more revisions the repository has. The speed in this test (10 revisions/sec) is not realistic when we're at revision 1.7M. We're going to see well under 1 revision/sec which will further decrease the average load we put on the server.

Conclusion: Using svnsync to keep a separate repo in sync with the official plugin repo is not very hard on the Subversion server. It's going to spend most of its CPU time committing things from developers, we're just interested in reading individual revisions which is cheap in comparison.

schlessera · Aug 30, 2017

@eriktorsner Wow, great work on that!

With the setup you did there, is it possible to test the bandwidth consumption between an svnsync, and a simple copy of the same data?

schlessera · Aug 30, 2017

...oh, and maybe an rsync might be interesting as well.

eriktorsner · Aug 30, 2017

I did some additional testing to see what kind of network effects svnsync has in real life. This svnsync was done over ssh+svn which uses an ssh tunnel rather than https which is the case for the WordPress repos, but I think we're talking about the same order of magnitude in regards of resources.

The server is the same Digital Ocean server as in my last comment and I usually get 60Mbit download speed from my home office.

First, I've synced a remote copy of a large repository about for about an hour. Total work done was:
Revisions: 10288 (from 33912 to 44200)
Time: 3669 seconds (1h 1m 9s)
Speed: 2.8 revisions/s
Disk size increase: 156Mb (from 537Mb to 693Mb)
Network transferred: 150Mb (measured by nettop on Mac)
Average revision size: 14 Kb (transferred)
Average bandwidth: 335 kbit/s (1228800 bits transferred in 3669 s)

It's easy to see that the download rate isn't limited by bandwidth. At this rate, syncing all 1.7M revisions from the official plugin repo would take 168 hours/1 week, but from experience, I know that the rate goes down as the revision nr goes up, so in the real world, it's more like 2-3 weeks.

Next. I've rsync'ed a 67Gb svn repo with a total of 1.68M revisions (slightly smaller than the current size of the live plugin repo) between two devices on the same server (USB 3.0 flash drive to internal SSD). Even if bandwidth is this case is pretty much unlimited, there are just over 3.3 million files to sync and rsync also does some integrity checking on each file so things still take time. Here are some numbers:

$ find /media/svn2/plugins | wc -l
3376356
$ du -hs /media/svn2/plugins
67G	/media/svn/plugins/

$ time rsync -a /media/svn/plugins ~/src/svn/
real    62m30.717s
user   9m25.896s
sys     8m36.140s

As a next step, I allowed the source repo to svnsync 51 revisions from it's "parent" repo so that it was 51 revisions ahead compared to the copy created above. I then re-ran the rsync operation. In theory, there should be about 103 modified files to handle but I didn't try to count them first.

$ time rsync -a /media/svn/plugins ~/src/svn/
real    0m50.189s
user   0m16.580s
sys    0m17.012s

So once the two repos are fairly in sync, rsync can be used to quickly establish a perfect file by file copy.

The issue with rsync (or copy) is an all or nothing type affair. Subversions have a proprietary storage format that is very easy to mess up. Anything else than a perfect copy of a repository simply won't work. Somewhat simplified, each revision creates two new files, one for the diff and one for revision metadata. In addition to those, there's one large FSFS file (see https://stackoverflow.com/questions/19687614/what-does-fsfs-stand-for-as-related-to-subversion). If we rsync from a live repo that is receiving commits from developers, it will sometimes give us a repo copy that is just a little bit out of sync internally. I'm certain that svnsync offer better transaction integrity, it either gets the diff or not.

Regarding the question from @schlessera about the difference between svnsyncing data vs copying the same data. I guess it's hard to answer because the trick is to figure out what exact data to copy. Rsync will do a good job of finding individual files. Just to give some sort of comparison between svnsyncing 150Mb from the same server down to my laptop, here's a benchmark of the bandwidth + ssh overhead:

$ time scp server01:/root/dummy150M .
dummy                                                                                                                        100%  150MB   6.6MB/s   00:22    

real	0m23.496s
user	0m1.500s
sys	0m1.347s

About 51 Mbit/s

schlessera · Sep 1, 2017

@eriktorsner: Great stuff! This data will help us make the case for using svnsync.
I'll work on a proposal to clarify our plans, lay out the options we've examined, and the arguments behind our decision on one specific option. Once I've started an initial doc, I'll send you a link so we can go through this and discuss.
Thanks for all the effort so far!

Otto42 · Sep 29, 2017

Question: why don't we extend the w.org API to return checksums for plugin and theme files directly?

Otto42 · Sep 29, 2017

Also, svn info for a file returns the sha-1 checksum for it, iirc.

danielbachhuber · Sep 29, 2017

Question: why don't we extend the w.org API to return checksums for plugin and theme files directly?

We certainly can. My thought is to build the underlying infrastructure first and then incorporate it into WordPress.org infrastructure.

schlessera · Sep 29, 2017

@Otto42 Yes, the planned approach was to build the API under a separate URL to be able to quickly iterate on it, and then migrate it over to the w.org API when everything is finalized (similarly to how we work on feature plugins before merging into Core, to keep initial velocity high). Do you think this does not make sense in this specific case?

Re: svn info, this is true, but only for working copy paths. It is not included for remote files.

Otto42 · Sep 30, 2017

No, I meant, why do we need to do a process to sync the repo instead of simply generating the checksums on w.org?

When we build zip files for plugins or themes, we have all the files right there. Adding code to generate checksums would be relatively simple in that process. We store those, make an endpoint to serve them, done.

Doing this all externally seems like adding a ton of load for code that we would never use anyway because it's the wrong way to integrate it to start with.

Otto42 · Sep 30, 2017

My thinking is that for any such process, we only want to do the code to change things when they actually change. We have such a system, where we build zip files when the plugin changes. Add checksums to that, sure them in a new table and voila. Add a simple API call to return json data for, say, a list of packages and there you go.

schlessera · Sep 30, 2017

The reasoning behind building this on a separate server was that we don't want to add additional burden to the current SVN server. If we add this to the current server, and it happens to be more useful and successful than we'd like, it will be difficult to separate it again for scaling.

However, I have to admit that I don't know much about how the current API runs behind the scenes yet, and what resources the corresponding servers have. I'm happy to discuss this in more detail. In general, though, I think this might be something where we "erring on the safe side" might be useful.

Otto42 · Oct 1, 2017

We don't do the ZIP building on the SVN servers, we do it on the normal web servers. Basically, we use cavalcade for job scheduling. When a plugin is committed to, a job is added to there to be run. That job runs on one of the web servers. It builds the ZIP, updates the plugin directory database, etc. Since it's already doing svn export to get the files to build the ZIPs, it can also make checksums at that time. This means that checksums will only update when plugins actually update, which is obviously the right way to scale it.

Themes operate differently, but in a similar fashion with a cron job to update them from time to time.

Your suggested approach of syncing the SVN elsewhere seemingly adds far more load, not less. I'm thinking that you should examine the existing system for a proper integration instead of trying to create something new. Because whatever it is that you're thinking of creating will simply have to be thrown away in the end for something a lot more like what I'm suggesting here anyway.

anantshri referenced this issue Jan 19, 2017
Open
wp sec checks : some no nonsense security checks combined together and reported in one go #21

wp-cli/ideas

Plugin / theme checksum verification #6

danielbachhuber commented Jan 10, 2017

TheLastCicada commented Jan 15, 2017

anantshri referenced this issue Jan 19, 2017

wp sec checks : some no nonsense security checks combined together and reported in one go #21

danielbachhuber commented Aug 22, 2017

eriktorsner commented Aug 22, 2017

eriktorsner commented Aug 22, 2017

eriktorsner commented Aug 29, 2017

schlessera commented Aug 30, 2017

schlessera commented Aug 30, 2017

eriktorsner commented Aug 30, 2017

schlessera commented Sep 1, 2017

Otto42 commented Sep 29, 2017

Otto42 commented Sep 29, 2017

danielbachhuber commented Sep 29, 2017

schlessera commented Sep 29, 2017

Otto42 commented Sep 30, 2017

Otto42 commented Sep 30, 2017

schlessera commented Sep 30, 2017

Otto42 commented Oct 1, 2017

wp-cli/ideas

Join GitHub today

Plugin / theme checksum verification #6

Comments

danielbachhuber commented Jan 10, 2017

TheLastCicada commented Jan 15, 2017

anantshri referenced this issue Jan 19, 2017

wp sec checks : some no nonsense security checks combined together and reported in one go #21

danielbachhuber commented Aug 22, 2017

eriktorsner commented Aug 22, 2017

eriktorsner commented Aug 22, 2017

eriktorsner commented Aug 29, 2017

schlessera commented Aug 30, 2017

schlessera commented Aug 30, 2017

eriktorsner commented Aug 30, 2017

schlessera commented Sep 1, 2017

Otto42 commented Sep 29, 2017

Otto42 commented Sep 29, 2017

danielbachhuber commented Sep 29, 2017

schlessera commented Sep 29, 2017

Otto42 commented Sep 30, 2017

Otto42 commented Sep 30, 2017

schlessera commented Sep 30, 2017

Otto42 commented Oct 1, 2017