Black Duck Open Hub Blog

That was not fun

Peter Degen-Portnoy — Fri, 03 Mar 2017 10:34:21 +0000

The Open Hub is up and running again after a full day of being unavailable. We apologize for any inconvenience this unexpected downtime caused and want to share what we know about what happened.

In brief; while performing a minor version upgrade of our PostgreSQL database from version 9.4 to 9.6, the upgrade process had a catastrophic failure and we lost the entire database.

Fortunately, we had made a backup before starting the process, and were able to restore from it. However, we did loose a few days of data and changes. For that we are truly sorry.

We’ve done these upgrades before. As a general rule, we don’t like to get more than 2 minor revisions behind in anything in our stack. So, we planned for the upgrade, tested it rigorously in our staging environment, carefully documented each step and command that would need to be executed. Normally we would only do this kind of work on a Sunday morning, when the Open Hub has the least amount of traffic.

The decision to proceed with the upgrade rests entirely with me as team lead.

We expected a 20 minute upgrade process, followed by an Analyze to generate the necessary statistics which could have taken up to an hour. We figured the site would be back up in less than 2 hours.

But very early in the process, one of the first pg_upgrade statements generated an error because the target data directory was erroneously entered as the mount point, owned by root, instead of a subdirectory owned by postgres. This should have simply generated the error, we would have fixed the command and continued on our way.

However, when we checked file systems, it was immediately apparent that the data directory in the original 9.4 location was completely gone, along with all our data. We’ve scoured the history files and the logs to see if there was anything else that could have been a factor, but do not see anything else. We have even read the source code of the pg_upgrade feature (available at https://doxygen.postgresql.org/pg__upgrade_8c.html#a3c04138a5bfe5d72780bb7e82a18e627).

We are now looking over the entire site and getting updates we know we’ve made since the database backup re-implemented. Please don’t hesitate to ping us on Twitter at @bdopenhub, or contact us at info@openhub.net with any observations, insults, questions, or comments, etc.

Open Hub in 2017

Peter Degen-Portnoy — Tue, 17 Jan 2017 21:02:01 +0000

Hail Hubbites!

We’d love to share some of the things that have been going on and will be going on here in Open Hub Land. We accomplished some very significant work in 2016 and would like to take a moment to lay it out and then talk about what we’d like to accomplish in 2017.

2016 Review

Please recall from our 2016 Review what we did in 2015: rebuilt the UI, addressed spam account creation, improved back-end performance (5X in some cases), started inventing new security data features. The plan for 2016 was to create a new Project Vulnerability Report and Project Security Pages, run the Spammer Cleanup Program, virtualize the back end (the FISbot project), switch to Ohcount4J, connect to other sites related to OSS. Here’s how we did:

Invented the Project Vulnerability Report algorithms and presentations
Prototyped Project Security Pages with the (now closed) security.openhub.net pages
Deployed FISbots and Ohloh Analysis onto virtual servers (this involved migration some 10TB of OSS project data from multiple servers to a single SAN)
Started running batches of accounts through the Spammer Cleanup Program. To date, we’ve cleared out some 350,000 spam accounts (YAY!!)
Design and implemented a Prototype Project Security Page to report known vulnerabilities in OSS projects. Collected user feedback from that experiment
Explored using Ohcount4J instead of Ohcount. Decided to stay with Ohcount.
Added a feature to add an entire GitHub account to a single Open Hub project
Numerous back end improvements and defect resolutions to consistently delivery web pages under 200 ms (6X faster than 2015 on average)
Defended against a number of malicious attacks against our API service and web site (comes with the territory of running a non-trivial web application, amirite?)

There’s more though!

The FISbot was implemented as a stop gap measure to address issues we had with the back end bare metal crawlers. We were waiting for another project to provide a central set of Fetch, Import, and SLOC services to the Black Duck enterprise. The plan was to shut down the FISbots and use this other service. However, after deploying our FISbots, it was decided that we should expand the FISbot to handle the additional enterprise scenarios. So, completely unplanned at the beginning of the year, we implemented the eFISbot Project, which we also delivered last year.

Last point: as we talked about in Detail on the Infrastructure post, the migration of that 10TB collection of OSS project data onto the production server ran into serious issues that forced us to re-fetch every one of the nearly 600K code locations we monitor. This was a serious multi-month disruption, from which we have mostly entirely recovered. We have re-fetched all the repositories, but there are lingering issues in getting all those repositories and corresponding projects refreshed in the 24 – 72 hour window we’ve set for ourselves.

So, in summary, we’ll add to our 2016 Review:

Implemented and delivered eFISbot
Survived the treacherous NFS SNAFU and the Great Code Location ReFetch

I feel it is also important that we mention again the passing of our friend and colleague Pugalraj Inbasekaran in February. I still feel his absence as an ache near my heart and miss him.

2017 Plan

We have a few main focuses for 2017

Make the back end screamingly fast
Make it wicked easy to add projects from GitHub to the Open Hub and get data from the Open Hub into your GitHub pages
Continue the UI update with wider pages and more responsive layouts
Add new languages to Ohcount

For that back end, we’ve been given permission to obtain a new set of servers. Currently, the Open Hub runs off a single database (we’ve talked about that over and over again). We’ve put in a purchase request for 2 database servers that have over 4X the CPU cores and 9X the RAM. One server will be the master and the other the replicate. These servers will support only Fetch, Import, SLOC and Analysis operations (write intensive) so, we’re calling this the FISA DB. The current database will remain with the purpose of only presenting generated analysis (read intensive) through the Ohloh-UI application, so that will be the UI DB. We are SO VERY EXCITED!!! SQUEEEEEE!!! Ah. Sorry; sorry. Please excuse the author (but it’s SOO exciting!)

As always, thank you so very much for being part of the Open Source Software community and your continued support of the Open Hub.

It’s Time to Select Our 2016 Open Source Rookies

Peter Degen-Portnoy — Fri, 09 Dec 2016 16:01:50 +0000

Looking forward to this year’s Rookies and looking back at Rookies past

This time of year is one of great anticipation at Black Duck. We are eagerly anticipating a very special delivery. A crew of helpers is busy putting together a list. It will be thoroughly checked and even checked twice. I wouldn’t say any on this list are naughty – in fact, most are pretty good. But we’re looking for the ones that are really really nice.

I’m speaking, of course, about the candidate list for the Black Duck 2016 Open Source Rookies. Each year, we review the open source projects started during the last 12 months and recognize those that stand out because of their mission, community growth, and market impact. A lot of great software is being built using Open Source, as was demonstrated by the 2015 Open Source Rookie Class, and we’re looking forward to our review of this year’s candidates.

You Can’t Win if You Don’t Enter

I’ve previously written about how we select the Open Source Rookies so I won’t go in to detail about it here. Suffice to say that it’s a thorough process that starts when we pull data from our open source project database, OpenHub. OpenHub allows open source project contributors and teams to aggregate data about their projects and communities. While this is not the only data source we use, the information in it helps us get a more complete picture of what’s happening with each project.

Here’s where you come in. Remember that Christmas when you didn’t write to Santa and instead of getting that cool new video game you got socks? This is kind of like that. If you participate in or know of a new open source project that deserves a place in the 2016 Open Source Rookies, it will significantly improve the project’s chances of being selected if it has been registered in OpenHub by December 15th.

A Look Back at Prior Rookies

This will be the 9th year for Open Source Rookies and a quick look back shows you just how ambitious open source projects are, and how mainstream they have become. Of course, we’d like to think that these projects were helped, at least a little, by having been recognized as Back Duck Open Source Rookies.

Hashicorp Vault – Class of 2015

Rising Star with Open Source in its DNA

https://www.hashicorp.com/
We recognized Hashicorp last year for the launch of Vault, a framework for securely storing, accessing, and managing secrets across an enterprise, but most people probably know them as the team behind the popular development environment management solution, Vagrant. 2016 has been a good year for Hashicorp, who in September announced a $24 million series B funding round led by GCV Capital and Mayfield fund. We’ll be watching for more news from them in 2017.

Kubernetes – Class of 2014

Container Orchestration at Scale

http://kubernetes.io/

Google has been using containers for years to develop its current scale of technologies. At the summer 2014 DockerCon, the Internet giant open sourced a container management tool, Kubernetes, that was developed specifically to meet the needs of the exponentially growing Docker ecosystem. Since then use and development of Kubernetes has flourished and it has become the one of the most widely adopted orchestration solutions for management of large scale container-based deployments.

Docker – Class of 2013

Has raised over $180M in venture funding

https://www.docker.com/

Docker was a clear stand-out for us back in 2013. Few projects outside the highly corporate-sponsored arena garner the level of excitement and attention that Docker did. While Docker was started by a small, commercial firm previously known as dotCloud, it quickly caused industry heavy hitters like RedHat and Google to take notice.  Docker has revolutionized the way teams build scalable applications for the cloud. Since launch, Docker has raised an impressive $180M in venture funding. Many expect them to reach unicorn status if they go public.

Ansible – Class of 2012

Acquired by Red Hat in October 2015

https://www.ansible.com/

Managing a large number of servers on site or in the cloud can be a complex, time-consuming task, but Michael DeHaan, founder of Ansible, didn’t think it had to be that way.

“System managers shouldn’t have to worry about lots of complicated syntax,” he said. With a simpler approach to system orchestration, part-time sys-admins can do what they need to do, getting in and out quickly. Apparently Red Hat agreed and acquired Ansible in October of 2015.

Bootstrap – Class of 2011

Ubiquitous toolkit for responsive websites

http://getbootstrap.com/

Do you remember the dark days when most websites were designed and built to look great on a desktop monitor, but many of them were practically unusable if viewed on the small screen of a tablet or mobile phone? Mobile visitors now account for the majority of traffic on many websites so it’s important that your website be “responsive,” adapting to the different screen sizes while remaining usable and engaging. Bootstrap, a toolkit originated by Twitter, has become the foundation of many responsive websites, with base CSS and HTML for typography, forms, buttons, tables, grids, navigation and more.

NuGet – Class of 2011

Universal package manager for .NET development

https://www.nuget.org/

NuGet is a free, open source developer-focused package management system for the .NET platform designed to simplify the process of incorporating third party libraries into a .NET application during development. Originally developed by developers from Microsoft and the .NET Foundation, it should come as no surprise that it has become a standard component of the development platform in many Windows-based software development environments.NuGet is now pre-installed as part of current versions of Microsoft Visual Studio.

OpenStack – Class of 2010

Orchestration Framework for the World’s Largest Clouds

https://www.openstack.org/

Originally developed as a collaboration between RackSpace Hosting and NASA, OpenStack is an open source, open standards platform for large scale cloud computing. Since 2010, OpenStack has grown tremendously and gained active support from over 500 companies, including industry giants like Oracle, HP, and Cisco. Many of the world’s largest clouds are build using OpenStack. If you use any cloud-based applications or services, it’s almost certain that some of them are running on OpenStack.

By any measure, that’s a pretty impressive list. Are there any projects launching this year that will have a similar impact on the software development industry? History suggests yes, and maybe it’s a project you are working on? If so, make sure it gets noticed by registering it on OpenHub. Maybe you too can join this illustrious group of Rookies turned All Stars!

Project Security

Lucy Wilcox — Tue, 04 Oct 2016 18:56:48 +0000

Hi Everyone! As we talked about in our post on the Open Hub in 2016, we are adding even more project security information to Open Hub projects. Not only this, but the project pages have also been widened! All new pages added to the Open Hub will be take up the entire screen width and other the other pages will be updated over time.

You’ll find all the same content on the project pages, but now there is a project security row for project that have had vulnerabilities reported against them. Remember, if a project has vulnerabilities that is not strictly a bad thing, it means that the open source community is doing a good job of finding and fixing security flaws.

In order to help you assess if security vulnerabilities are affecting a version of a project you are using, reported issues in the ten most recent versions are now shown on project pages. To see vulnerabilities in previous versions and information on exactly which vulnerabilities are present click into the Vulnerabilities per Version or Project Vulnerability Report header. This will take you to a page with more detail on each version with descriptions of each vulnerability and links to the National Vulnerability Database (https://nvd.nist.gov/), where the vulnerabilities we display are publicly available. When an Open Hub project has no security material on the page it means that the have been no vulnerabilities reported against it in the NVD.

Keep in mind that there may be vulnerabilities in projects which have not been found, or have not been reported in the NVD yet.This is especially pertinent for recent versions as contributors are actively in the process of finding and reporting issues. Vulnerabilities can be found at any point and sometimes live within code long before they are found. At Black Duck we collect a comprehensive vulnerability set from several additional data sources, however, only publicly accessible vulnerabilities are posted on the Open Hub. If you want to scan some of your code against Black Duck’s full vulnerability database you can do that through our Security Checker.

We have also revamped the Project Vulnerability Report, more on this here.

Lastly the now each of the project pages has a new Did You Know section that we hope highlights different features on the Open Hub that you might find useful and more context for OSS security.

Update: We’re doing it!

Peter Degen-Portnoy — Wed, 21 Sep 2016 17:53:47 +0000

Back-End Background

Here’s a quick summary of the issue about which we will be talking:

In mid-June, we moved off our bare-metal back-end crawlers into a virtualized environment. There were reasonable drivers and pressures pushing us to do this quickly and we found, about two weeks after the irreversible migration, that there were fundamental problems with the SAN storage that were unrepairable. Not only was this SAN unusable, it had caused an irrevocable loss of data quality in all of our repositories.

So we found a back-up for our back-end that would work and started the process of refetching all our repositories.

How Big is Big?

Just what does it really mean, this “refetching all our repositories” thing? The Open Hub is organized around projects. Each project may have zero or more enlistments, which is a mapping of a project to a code location. A code location may belong to more than one project. We currently have 675K projects on the Open Hub, of which 495K have enlistments. Those enlistments are comprised of 594K distinct code locations. Each of those code locations is what we mean we we talk about “repositories”: we have to re-fetch nearly 600K repositories from literally hundreds, if not thousands, of different servers.

We started with the most popular projects, which also tend to be some of the largest and most complex. We had to delete all the old job records and clear out a number of related data elements and schedule new Complete Jobs — Fetch, Import and SLOC (Single Line Of Code counting) — for each repository. We scheduled jobs for the first 300K projects in order of decreasing popularity. That generated some 550K jobs, most of which were Complete Jobs, but there were some Fetches as well (the scheduler has logic to determine which is best).

Completed Work

The Great Rescheduling started at the end of July — July 29 — and quickly moved through 100K or so jobs. Things were looking good. That “back-up for our back-end” is a system from which another team was migrating. It will have ample storage for us when this other team has cleared off their files and it has sufficient storage now for us to have gotten started. But with two teams performing significantly heavy work on the same SAN device, we’ve loaded this system to it’s maximum capacity. As a matter of fact, we loaded it so heavily that we went through a few weeks of the server regularly hanging and interrupting both team’s work (we got the vendor to help us clear out those issues).

Since July 29, we’ve worked through 95K projects, which represents some 128K repositories. Remembering that most repositories will use one Complete Job, but some will require three jobs — Fetch, Import and SLOC, plus the project will have another Analysis Job, we can see how 100K jobs that were reported can cover much fewer than 100K repositories.

This leaves almost 398K projects in need of an updated analysis and just over 3K new projects that have not yet had a first analysis. (It’s nice to see new projects being added to the Open Hub!) Understanding that there are currently 208K jobs remaining (from the original 550K jobs scheduled just about 8 weeks ago) helps explain why many projects have not had new analysis generated in the past two months. New job creation is blocked by the backlog of currently scheduled jobs. Oh, and we’ve manually scheduled updates for many, many projects when folks ask (we’re doing our best to keep up with requests, please drop us a line if you need something updated!).

You see, when the back-end job scheduler is all caught up, as it was before this upheaval, the majority of repositories would have been checked within the service window and did not need to be processed again. That’s when the job scheduler looks for new work to do — it searches for projects with no analysis or an out-of-date analysis and schedules brand new work for all the enlistments in that project. But since there is such a large backlog of existing jobs, the job scheduler never gets to the point of looking for new projects or stale projects. Nor will it until we can get through the backlog of initially scheduled work.

Remaining Work

Going back to the shared SAN: now that this system is stable, work is being performed, but we can see that the load over the past two months has dramatically impacted the throughput. The graph below shows the count of completed and updated analyses by day in the columns. The trend line is a 7-day moving average. The periods of practically no activity were due to us crashing the server.

On September 8, we completed 3500 analysis. Since then we’ve been averaging about 470 per day. This seems to be only due to the heavy use of the shared SAN device, which forms a bottleneck to the process.

Daily Analyses Updated and 7-Day Trend Line

The other team is nearing the end of their work — somewhere in the 2-3 week range is the current best estimate. And they have begun clearing out directories that have been confirmed as successfully migrated, which is beginning to alleviate the load on the system, so we remain hopeful that the throughput will being to rise again. If we process 3000 Analyses per day, which means another 4+ months to get through all the remaining projects before we can start the updates (which go much faster than the initial fetches). That’s considering the average through to September 8. If we can maintain the more optimal 6000+ Analysis per day, then we’re looking at 2-ish months (after the other team is completely off the shared SAN).

Because the bottleneck is the SAN, but other work can be done, we increased the back-end capacity by 50% to help push everything through (yay VM’s!).

TL; DR

Total Number of Project to Update and Analyze: 495K

Total Number of Projects Updated since July 29: 95K (These are the most popular, which tend to be some of the biggest too)

Initial Number of Jobs Scheduled on July 29: 550K

Number of Jobs Remaining: 208K

Projected Duration to Complete Initial Refetch of ALL projects: 2 – 5 months after the other team frees up the shared SAN, which could be in 2-3 weeks.

Why So Slow: Multiple teams making heavy use of a shared SAN resource. The other team is migrating off of it as we are moving on to it. Not ideal, but it was necessary.

We’re doing it!

Peter Degen-Portnoy — Fri, 29 Jul 2016 15:51:36 +0000

It’s happening! We’ve started clean fetches of ALL of our repositories using the new SAN! For background, please see the Details on the Infrastructure blog post.

We currently have 497K projects that have 592K distinct repositories that we are going to reprocess from scratch. To do this, we cleared out all the old jobs that have not completed, connected our FISbots (Fetch, Import, SLOC bots) to the new SAN, and started re-scheduling new Fetch jobs.

We’ve completed nearly 100,000 repository fetches and have some 445,000 scheduled with a few more to schedule. We are also monitoring the failures. Unlike the last set of failures, which could include problems due to the old SAN, these failures should all be actionable. While there will be some repositories that will just be hard to get because they are large or the servers are slow, most of the failures are turning out to be repositories that are no longer present. These types of failures are a real opportunity to look at the projects and determine if we can update the enlistments, or if the project has been abandoned and is no longer available any where (in which case, we will remove it from the Open Hub).

So, what does this mean for you, the awesome Open Hub User?

It means that it might take some time to get your project re-fetched and updated. We’ll do our best to respond to requests to get things updated, but please know that there is now a massive backlog that will take some time to process.

And after this is all over, we will have a smaller, leaner set of projects on the Open Hub that fulfill our mandate of monitoring active OSS projects. And that will be better for all of us!

Details on the Infrastructure

Peter Degen-Portnoy — Fri, 15 Jul 2016 20:09:29 +0000

In the blog post, Stepping Forward and Back, we mentioned that “we found additional complications with our new back end infrastructure.” We’d like to give you some more details about these complications.

We are referring to an NFS mounted system with enough storage for all 592,000 distinct repositories we track on the Open Hub. Without naming names, we have three problems with the currently installed system:

It does not support characters that are present in some repositories, thus generating an I/O Error when we try to fetch and update these repositories.
It is case insensitive by default so files and directories that differ only by capitalization overwrite one another. This impacts a difficult to quantify number of repositories. It would be very expensive to try to compare nearly 600K source directories with local copies in order to identify those that are missing files and/or directories. Our current opinion is that nearly every repository is at risk of being impacted by this.
Performance through the NFS mount point can be so poor that updates can time out and the server at the source will terminate the connection.

There is an alternative solution (which was actually the system that was requested) available from our vendor without the above issues, but that solution has a hard-coded limit of the number of entries that can be in a single directory. We’ve reviewed existing repositories and have found multiple directories with more entries than the limit, which definitively precludes the use of this alternative solution.

You may be asking yourself why didn’t we detect these problems before committing to this system? I wish we had. We did not because we were not the first team to use this system for this exact purpose and these problems were not detected then. We had used a different system for verifying functionality and performance and were under the impression that the target production system was simply a better system in all regards, unaware that the installed system was not what had been specified. Finally, there were other scheduling pressures that encouraged us to move from our previous 18 bare metal infrastructure to our current VM infrastructure at an accelerated pace.

Here’s what we are doing to fix it: The system upon which we did functional and performance testing is still available and will have more space freed to ensure we will have enough for all our repositories. We are starting the work to relocate storage of new fetches to this new system. Then we will start clean, new fetches of every repository in the Open Hub.

We will keep the existing data until we have had a chance to test every single repository. Right now, we know that there are over 60K repositories impacted by some kind of detectable failure. Most of them are for repositories that have moved and the enlistments on the Open Hub have not yet been updated. We are taking this work as an opportunity to review all those repositories that cannot be cleanly fetched. At the end of this process, we will have clean, local copies of the repositories upon which the Open Hub depends as well as a clear list of repositories that need to be reviewed to see if we can recover the projects that have enlisted them.

Again, we apologize for any inconveniences this may have caused and thank you for your continued support and patience. We are also so grateful that you are a member of the Open Hub and the Open Source Software community.

Stepping Forward and Back

Peter Degen-Portnoy — Thu, 23 Jun 2016 17:03:04 +0000

Today we have some good news and some less good news.

On the positive side, we have pushed a number of fixes and improvements into production recently. One is that we have added a new “Add New Project” button to the Explore Projects page and changed the link at at the bottom of the page to a button as well so that it is much easier to see where one can add new projects to the Open Hub. We have also fixed an issue that arose when trying to import really big GitHub repositories.

We made important fixes to the Fetch, Import and Sloc job processes to that the back end systems so that the system is more accurately detecting when repositories need to be updated and scheduling jobs for those. We fixed the internal tracking issue so that the way low-level jobs report their work and update their status is correct and more consistent. We also addressed some small UI issues where the correct and full text was not included in a tool tip when claiming new positions and some other UI tweaks.

On the less positive side, we found additional complications with our new back end infrastructure. In short, there are issues with the file system behind the NFS mount point where we store our repositories that is blocking a number of jobs from being able to run. It looks like many of the major projects — such as Mozilla Firefox, MySQL, Apache HTTP, etc. — cannot be updated. Nor can we update these projects until the issue is resolved. One current plan is to replace the NFS Server and reformat the disk with our repository data. Obviously, that would mean a major loss of data.

However, due to the limitations of the file system, we have an unknown number of repositories that are already generating bad analysis because of data loss. In essence, we’ve already been impacted by that risk and therefore we have to find a way to get a suitable NFS mount point and start refetching all the repositories. We’ll do our best to keep everyone informed.

Hey, Hey, Hey; What’s Happening Today?

Peter Degen-Portnoy — Wed, 08 Jun 2016 11:17:02 +0000

Hail Hubbites!

As we talked about in our Open Hub in 2016 post, we have recently made a major step forward in addressing significant infrastructure concerns. Down in the “More Infrastructure” section, we mentioned, “So we started an effort to virtualize our crawlers and are pilot testing that work now.” The FISbot servers are now out of the pilot test and the old crawlers are being decommissioned and un-racked. That’s not to say that there are no problems, but the problems we have are not worth switching horses back to the old infrastructure. No, we’d rather take care of the horse we’re riding now.

However, the issues we are having are impacting data on the site and, while we are moving quickly to address them, we’d like to share what we know with you so that everyone can be kept up to date:

There is an issue that after a Fetch, Import, SLOC cycle is completed, the follow-on Analysis Job is not always being generated. This is leaving some projects with fresh raw data, but no updated analysis.
There is an issue that the Job Scheduler is not always detecting projects with out of date analysis and scheduling new jobs. This is leaving some projects with no new fresh raw data.
We’ve changed the way we are doing some internal tracking and accounting of when jobs were executed. This switch has resulted in a mismatch between the fields where we are tracking job progress and the data we are presenting on the site so that some projects either show the wrong date the data were collected or do not show that value at all.
There are some new low-level issues with local copies of repositories. Since we’ve switched from 18 crawlers with dedicated local storage to virtual servers with a NFS mount to a SAN, we are seeing new file system level issues. These issues typically cause Fetch jobs to fail.

To address these, we are combing through project and repositories repeatedly throughout the day and scheduling jobs to try and keep everything up to date. Please let us know if you project has fallen behind so we can address it while we work on the code fixes to bring the new FISbot infrastructure up to snuff.

In other news, the Spammer Cleanup program is also out of the Pilot phase and is chugging through our accounts and inviting account holders to verify their account. We are focusing on those accounts that were created and then show no activity on the Open Hub. If you get one of these re-verification emails, please simply log on to the site and provide one of the requested forms of verification. However, if you have been an active member of the Open Hub, then you should not be part of this email re-verification process. However, we will still ask for verification when you log in if you’ve not logged in since these new security checks were put in place.

The “Invention Process” for our new security pages has started and is very exciting. We are looking at what we can produce and deploy quickly that will help illustrate the security landscape for OSS projects. After the initial deployment of fact-based data presentation, we will look towards adding additional elements that provide a broader overview of OSS security. Oh, and look forward to a new Project page layout that will begin moving throughout the site and will take advantage of the larger screen size of modern day browsers.

Final point: Such Perform. Wow Speed.

In the post GitHub, Performance, and Crawlers (Oh My!) from October 2015, we talked about the People Index page performance improving from 18-60 seconds to less than 1 second, and the Explore Projects page improving from 100 seconds (!) also improving to less than 1 second, and widget performance improving to 1.5 seconds. We were very pleased that we restored the average web server response times to under 1.2 seconds, or 1200 milliseconds.

Ladies and Gentlemen, Boys and Girls, Things and Its; for the past few months, average web server response time has been under 400 milliseconds — a 3X improvement in speed. Since the deployment of FISbot, average web server response time has been around 200 milliseconds, a 6X improvement in speed. With a number of FIS jobs and Analysis jobs going unscheduled, we expect some impact to the site performance when we fix these code defects. Never fear; the next infrastructure project will separate the analysis database from the web application database and result in consistently speedy web application performance.

I know it’s been a tough process and at times the site was nigh unusable. Thanks for sticking in there with us. You guys are the best (I’m getting teary over here). And we’re continuing to work hard to bring you the unparalleled best set of freely available analysis of ALL the OSS projects. Thank you so very much for being part of the OSS community and member of the Open Hub.

Open Hub in 2016

Peter Degen-Portnoy — Fri, 15 Apr 2016 20:10:48 +0000

Hail Hubbites!

There has been a lot of activity behind the scenes at Open Hub Central with a steady stream of improvements rolling into production. We’d like to ~~brag~~ talk about them and also tell you what we have coming up in 2016.

2015 In Quick Review

Project PURR (Platform Upgrade Ruby and Rails) — we wrote a whole new Open Hub UI in the latest tech with 99.5% test coverage (I kid you not!)
Effective Spammer Throttling — using verification tools to ensure a real, verified human behind new accounts. Spammer account creation has dropped from way over 700% to a very manageable 13%
Focus on Infrastructure
- Improved a few critically slow queries that dragged the site down
- Performed the first VACUUM FULL in ages on some critical tables
- Improved average site performance 5X in 2015. Of course, it was pretty bad at times
New Inventions: Security Data. Let’s talk about that, please keep reading.

2016 In Plan

Security Data

We started by adding a new button to project pages. When we have vulnerability data from the National Vulnerability Database and/or VulnDB, we add a “Review Security Info” button in the Quick Reference section. This will take you to a new security feature we’re trying out. We’ll show you a graph of the number of vulnerabilities reported by version for the last 10 releases grouped by category.

We’ve gotten some very nice feedback from this initial feature and have decided to do more.

Project Vulnerability Report

The first thing we’re going to roll out is a new Project Vulnerability Report (PVR) that will show two ways of considering project vulnerability data across a project lifetime. One way will be a weighted absolute score: the Project Security Score, where a lower value will be better. The other will be a scaled scoring of projects based up on the weighted score against time: the Project Vulnerability Score, where a higher value will be better. When we roll this new feature out, we’ll include a blog post that details the ideas behind this new feature

Project Security Pages

Based upon the interest and feedback in the security info button, we are going to add some new pages to the set of project pages. These will follow the current focus of the Open Hub — the facts about Open Source Software projects. We’ll show the number of open defects over time, broken down into groups by severity, with trends, scores, and other factual data about vulnerability reports.

More Spammer Cleanup

This has already started and some of you have received some email requests during the Pilot run of this program. We are running a long term email campaign and requesting nearly all users to verify their account. If you have positions claimed, we do not intend to bother you with a few emails. However, you will be required to verify your account when you come back to the Open Hub if you’ve not already done so. The expectation with the outreach effort is that the vast majority of account holders don’t really exist. Account holders will have a generous period of time of about half a year, plus a few reminders (not too many!) to verify their accounts before they are flagged as a spam account and eventually deleted.

More Infrastructure

You may remember when we lost a crawler last year, had no new data for about two weeks, and then took a few months to get back caught up? (I do!) We recognized that our crawler infrastructure has been getting more and more fragile. So we started an effort to virtualize our crawlers and are pilot testing that work now. This will give us greater stability, a simpler code base, cleaner architecture, and horizontal scalability in our back end.

After this new Fetch, Import, and SLOC code (FISbot) is in place and serving the Open Hub and the Black Duck Knowledge Base, we will start work on separating the analytics database from the web application database. This will give each part of the Open Hub — the data collection side and the data presentation side — their own dedicated database that can be optimized for fulfilling their primary purpose.

We’re also going to switch from using the C-based Ohcount to the Java-based Ohcount4J for line counting so that all Black Duck products are reporting the same project statistics.

More Other Stuff

We also would like to do some updates to our UI and may roll out updated pages incrementally (rather than wait until we can touch the entire site entirely). We’d like to get some connection to GitHub with data on Stars, Watches, and Forks, and may be StackOverflow to show the top questions, most recent questions, best answers and answerers on the project pages. It would be pretty cool if we could connect Open Hub accounts to StackExchange accounts and let folks click through to see the answerer’s Open Hub account page with their Open Source resume as well as their answers on StackExchange.

So Far in 2016

So, in addition to the “Review Security Info” button with the security.openhub.net security page, and the Project Vulnerabilty Report, which will be pushed out into production soon, and the significant improvements to our back end that have yielded additional 2X speed improvements on the site, we have also just released a new feature to bulk-add GitHub repositories to your project. The way this works is when you add a new code location, you can select “GitHub Repositories” from the SCM type and then enter the GitHub account name. We’ll then add all the public repositories in that GitHub account to the project.

There are other variations that we’re considering:

Bulk create new projects for each GitHub repository
Bulk create new projects from other forges

Also, we’re looking at the possibility of defining a new organization type — Distribution — this way we can identify organizations that package and distribute projects but don’t necessarily own or manage the project. Think “Fedora”, “Debian”, etc. This will require some internal changes to allow a project to be included in a distribution even if it is “claimed” by some other organization or is already part of a different distribution. We think this kind of distinction is long overdue and can be very helpful.

And, penultimately, we’ve been working hard on responding to those users who have contacted us via twitter, email, and have posted on the Forums. Thank you so very much for reaching out to us! And thank you for your continued patience as we work to get your issue resolved or question answered.

One more point: It’s time to say goodbye to “code.openhub.net”. In the near future, we will take the site down and replace it with a curtain message. There are lots of reasons including that the Black Duck product underneath this offering has been discontinued and the infrastructure is very expensive to run and maintain and, most importantly, it seems the most popular use of the Code Search site is to see if one’s own project is there and up to date. We’ve not been able to confirm a significant number of users who actually use the site for searching other repositories for code. On the other hand, we’ve not updated that site in a while, so it may also be that those users who may have been doing that have realized that the data is out of date and aren’t coming back. If you have an opinion, I’d love to hear it.

Thanks as always for being a member of the Open Source Software community and a member of the Open Hub. I’m always open to your email and tweets and am very interested in your thoughts and opinions.