The Netflix Tech Blog: python

Showing posts with label python. Show all posts

Tuesday, November 11, 2014

Genie 2.0: Second Wish Granted!

By Tom Gianos and Amit Sharma @ Big Data Platform Team

A little over a year ago we announced Genie, a distributed job and resource management tool. Since then, Genie has operated in production at Netflix, servicing tens of thousands of ETL and analytics jobs daily. There were two main goals in the original design of Genie:

To abstract execution environment from the Hadoop, Hive and Pig job submissions.
To enable horizontal scaling of client resources based on demand.

Since the development of Genie 1.0, much has changed in both the big data ecosystem and here at Netflix. Hadoop 2 was officially released, enabling clusters to use execution engines beyond traditional MapReduce. Newer tools, such as interactive query engines like Presto and Spark, are quickly gaining in popularity. Other emerging technologies like Mesos and Docker are changing how applications are managed and deployed. Some changes to our big data platform in the last year include:

Upgrading our Hadoop clusters to Hadoop 2.
Moving to Parquet as the primary storage format for our data warehouse.
Integrating Presto into our big data platform.
Developing, deploying and open sourcing Inviso, to help users and admins gain insights into job and cluster performance.

Amidst all this change, we reevaluated Genie to determine what was needed to meet our evolving needs. Genie 2.0 is the result of this work and it provides a more flexible, extensible and feature rich distributed configuration and job execution engine.

Reevaluating Genie 1.0

Genie 1.0 accomplished its original goals well, but the narrow scope of those goals lead to limitations including:

It only worked with Hadoop 1.
It had a fixed data model designed for a very specific use case. Code changes were required to accomplish minor changes in behavior.

As an example, the s3CoreSiteXml, s3HdfsSiteXml fields of the ClusterConfigElement entity stored the paths to the core-site and hdfs-site XML files of a Hadoop cluster rather than storing them as a generic collection field.

The execution environment selection criteria was very limited. The only way to select a cluster was by setting one of three types of schedules: SLA, ad hoc or bonus.

Genie 1.0 could not continue to meet our needs as the number of desired use cases increased and we continued to adopt new technologies. Therefore, we decided to take this opportunity to redesign Genie.

Designing and Developing Genie 2.0

The goals for Genie 2.0 were relatively straightforward:

Develop a generic data model, which would let jobs run on any multi-tenant distributed processing cluster.
Implement a flexible cluster and command selection algorithm for running a job.
Provide richer API support.
Implement a more flexible, extensible and robust codebase.

Each of these goals are explored below.

The Data Model

The new data model consists of the following entities:

Cluster: It stores all the details of an execution cluster including connection information, properties, etc. Some cluster examples are Hadoop 2, Spark, Presto, etc. Every cluster can be linked to a set of commands that it can run.

Command: It encapsulates the configuration details of an executable that is invoked by Genie to submit jobs to the clusters. This includes the path to the executable, the environment variables, configuration files, etc. Some examples are Hive, Pig, Presto and Sqoop. If the executable is already installed on the Genie node, configuring a command is all that is required. If the executable isn’t installed, a command can be linked to an application in order to install it at runtime.

Application: It provides all the components required to install a command executable on Genie instances at runtime. This includes the location of the jars and binaries, additional configuration files, an environment setup file, etc. Internally we have our Presto client binary configured as an application. A more thorough explanation is provided in the “Our Current Deployment” section below.

Job: It contains all the details of a job request and execution including any command line arguments. Based on the request parameters, a cluster and command combination is selected for execution. Job requests can also supply necessary files to Genie either as attachments or via the file dependencies field, if they already exist in an accessible file system. As a job executes, its details are recorded in the job record.

All the above entities support a set of tags that can provide additional metadata. The tags are used for cluster and command resolution as described in the next section.

Job Execution Environment Selection

Genie now supports a highly flexible method to select the cluster to run a job on and the command to execute, collectively known as the execution environment. A job request specifies two sets of tags to Genie:

Command Tags: A set of tags that maps to zero or more commands.

Cluster Tags: A priority ordered list of sets of tags that maps to zero or more clusters.

Genie iterates through the cluster tags list, and attempts to use each set of tags in combination with the command tags to find a viable execution environment. The ordered list allows clients to specify fallback options for cluster selection if a given cluster is not available.

At Netflix, nightly ETL jobs leverage this feature. Two sets of cluster tags are specified for these jobs. The first set matches our bonus clusters, which are spun up every night to help with our ETL load. These clusters use some of our excess, pre-reserved capacity available during lower traffic hours for Netflix. The other set of tags match the production cluster and act as the fallback option. If the bonus clusters are out of service when the ETL jobs are submitted, the jobs are routed to the main production cluster by Genie.

Richer API Support

Genie 1.0 exposes a limited set of REST APIs. Any updates to the contents of the resources had to be done by sending requests, containing the entire object, to the Genie service. In contrast, Genie 2.0 supports fine grained APIs, including the ability to directly manipulate the collections that are part of the entities. For a complete list of available APIs, please see the Genie API documentation.

Code Enhancements

An examination of the Genie 1.0 codebase revealed aspects that needed to be modified in order to provide the flexibility and standards compliance desired going forward.

Some of the goals to improve the Genie codebase were to:

Decouple the layers of the application to follow a more traditional three tiered model.
Remove unnecessary boilerplate code.
Standardize and extend REST APIs.
Improve deployment flexibility.
Improve test coverage.

Tools such as Spring, JPA 2.0, Jersey, JUnit, Mockito, Swagger, etc. were leveraged to solve most of the known issues and better position the software to handle new ones in the future.

Genie 2.0 was completely rewritten to take advantage of these frameworks and tools. Spring features such as dependency injection, JPA support, transactions, profiles and more are utilized to produce a more dynamic and robust architecture. In particular, dependency injection for various components allows Genie to be more easily modified and deployed both inside and outside Netflix. Swagger based annotations on top of the REST APIs provide not only improved documentation, but also a mechanism for generating clients in various languages. We used Swagger codegen to generate the core of our Python client, which has been uploaded to Pypi. Almost six hundred tests have also been added to the Genie code base, making the code more reliable and maintainable.

Our Current Deployment

Genie 2.0 has been deployed at Netflix for a couple of months, and all Genie 1.0 jobs have been migrated over. Genie currently provides access to all the Hadoop and Presto clusters, in our production, test and ad hoc environments. In production, Genie currently autoscales between twelve to twenty i2.2xlarge AWS instances, allowing several hundred jobs to run at any given time. This provides horizontal scaling of clients for our clusters with no additional configuration or overhead.

Presto and Sqoop commands are each configured with a corresponding application that points to locations in S3, where all the jars and binaries necessary to execute these commands are located. Every time one of these commands run, the necessary files are downloaded and installed. This allows us to continuously deploy updates to our Presto and Sqoop clients without redeploying Genie. We’re planning to move our other commands, like Pig and Hive, to this pattern as well.

At Netflix launching a new cluster is done via a configuration based launch script. After a cluster is up in AWS, the cluster configuration is registered with Genie. Commands are then linked to the cluster based on predefined configurations. After it is properly configured in Genie, the cluster will be marked as “available”. When we need to take down a cluster, it is marked as “out of service” in Genie so the cluster can no longer accept new jobs. Once all running jobs are complete, the cluster is marked as “terminated” in Genie and instances are shut down in AWS.

With Genie 2.0 going live in our environment, it has allowed us to bring together all the new tools and services we’ve added to the big data platform over the last year. We have already seen many benefits from Genie 2.0. We were able to add Presto support to Genie in a few days and Sqoop in less than an hour. Theses changes would have required code modification and redeployment with Genie 1.0, but were merely configuration changes in Genie 2.0.

Below is our new big data platform architecture with Genie at its core.

Future Work

There is always more to be done. Some enhancements that can be made going forward

include:

Improving the job execution and monitoring components for better fault tolerance, efficiency on hosts and more granular status feedback.
Abstracting Genie’s use of Netflix OSS components to allow adopters to implement their own functionality for certain components to ease adoption.
Improving the admin UI to expose more data to users. e.g. Show all clusters a given command is registered with.

We’re always looking for feedback and input from the community on how to improve and evolve Genie. If you have questions or want to share your experience with running Genie in your environment, you can join our discussion forum. If you’re interested in helping out, you can visit our Github page to fork the project or request features.

Monday, June 30, 2014

Announcing Security Monkey - AWS Security Configuration Monitoring and Analysis

We are pleased to announce the open source availability of Security Monkey, our solution for monitoring and analyzing the security of our Amazon Web Services configurations.

At Netflix, responsibility for delivering the streaming service is distributed and the environment is constantly changing. Code is deployed thousands of times a day, and cloud configuration parameters are modified just as frequently. To understand and manage the risk associated with this velocity, the security team needs to understand how things are changing and how these changes impact our security posture.

Netflix delivers its service primarily out of Amazon Web Services’ (AWS) public cloud, and while AWS provides excellent visibility of systems and configurations, it has limited capabilities in terms of change tracking and evaluation. To address these limitations, we created Security Monkey - the member of the Simian Army responsible for tracking and evaluating security-related changes and configurations in our AWS environments.

Overview of Security Monkey

We envisioned and built the first version of Security Monkey in 2011. At that time, we used a few different AWS accounts and delivered the service from a single AWS region. We now use several dozen AWS accounts and leverage multiple AWS regions to deliver the Netflix service. Over its lifetime, Security Monkey has ‘evolved’ (no pun intended) to meet our changing and growing requirements.

Viewing IAM users in Security Monkey - highlighted users have active access keys.

There are a number of security-relevant AWS components and configuration items - for example, security groups, S3 bucket policies, and IAM users. Changes or misconfigurations in any of these items could create an unnecessary and dangerous security risk. We needed a way to understand how AWS configuration changes impacted our security posture. It was also critical to have access to an authoritative configuration history service for forensic and investigative purposes so that we could know how things have changed over time. We also needed these capabilities at scale across the many accounts we manage and many AWS services we use.

Security Monkey's filter interface allows you to quickly find the configurations and items you're looking for.

These needs are at the heart of what Security Monkey is - an AWS security configuration tracker and analyzer that scales for large and globally distributed cloud environments.

Architecture

At a high-level, Security Monkey consists of the following components:

Watcher - The component that monitors a given AWS account and technology (e.g. S3, IAM, EC2). The Watcher detects and records changes to configurations. So, if a new IAM user is created or if an S3 bucket policy changes, the Watcher will detect this and store the change in Security Monkey’s database.
Notifier - The component that lets a user or group of users know when a particular item has changed. This component also provides notification based on the triggering of audit rules.
Auditor - Component that executes a set of business rules against an AWS configuration to determine the level of risk associated with the configuration. For example, a rule may look for a security group with a rule allowing ingress from 0.0.0.0/0 (meaning the security group is open to the Internet). Or, a rule may look for an S3 policy that allows access from an unknown AWS account (meaning you may be unintentionally sharing the data stored in your S3 bucket). Security Monkey has a number of built-in rules included, and users are free to add their own rules.

In terms of technical components, we run Security Monkey in AWS on Ubuntu Linux, and storage is provided by a PostgreSQL RDS database. We currently run Security Monkey on a single m3.large instance - this instance type has been able to easily monitor our dozens of accounts and many hundreds of changes per day.

The application itself is written in Python using the Flask framework (including a number of Flask plugins). At Netflix, we use our standard single-sign on (SSO) provider for authentication, but for the OSS version we’ve implemented Flask-Login and Flask-Security for user management. The frontend for Security Monkey’s data presentation is written in Angular Dart, and JSON data is also available via a REST API.

General Features and Operations

Security Monkey is relatively straightforward from an operational perspective. Installation and AWS account setup is covered in the installation document, and Security Monkey does not rely on other Netflix OSS components to operate. Generally, operational use includes:

Initial Configuration

Setting up one or more Security Monkey users to use/administer the application itself.
Setting up one or more AWS accounts for Security Monkey to monitor.
Configuring user-specific notification preferences (to determine whether or not a given user should be notified for configuration changes and audit reports).

Typical Use Cases

Checking historical details for a given configuration item (e.g. the different states a security group has had over time).
Viewing reports to check what audit issues exist (e.g. all S3 policies that reference unknown accounts or all IAM users that have active access keys).
Justifying audit issues (providing background or context on why a particular issues exists and is acceptable though it may violate an audit rule).

Note on AWS CloudTrail and AWS Trusted Advisor

CloudTrail is AWS’ service that records and logs API calls. Trusted Advisor is AWS’ premium support service that automatically evaluates your cloud deployment against a set of best practices (including security checks).

Security Monkey predates both of these services and meets a bit of each services’ goals while having unique value of its own:

CloudTrail provides verbose data on API calls, but has no sense of state in terms of how a particular configuration item (e.g. security group) has changed over time. Security Monkey provides exactly this capability.
Trusted Advisor has some excellent checks, but it is a paid service and provides no means for the user to add custom security checks. For example, Netflix has a custom check to identify whether a given IAM user matches a Netflix employee user account, something that is impossible to do via Trusted Advisor. Trusted Advisor is also a per-account service, whereas Security Monkey scales to support and monitor an arbitrary number of AWS accounts from a single Security Monkey installation.

Open Items and Future Plans

Security Monkey has been in production use at Netflix since 2011 and we will continue to add additional features. The following list documents some of our planned enhancements.

Integration with CloudTrail for change detail (including originating IP, instance, IAM account).
Ability to compare different configuration items across regions or accounts.
CSRF protections for form POSTs.
Content Security Policy headers (currently awaiting a Dart issue to be addressed).
Additional AWS technology and configuration tracking.
Test integration with moto.
SSL certificate expiry monitoring.
Simpler installation script and documentation.
Roles/authorization capabilities for admin vs. user roles.
More refined AWS permissions for Security Monkey operations (the current policy in the install docs is a broader read-only role).
Integration with edda, our general purpose AWS change tracker. On a related note, our friends at Prezi have open sourced reddalert, a security change detector that is itself integrated with edda.

Conclusion

Security Monkey has helped the security teams @ Netflix gain better awareness of changes and security risks in our AWS environment. Its approach fits well with the general Simian Army approach of continuously monitoring and detecting potential anomalies and risky configurations, and we look forward to seeing how other AWS users choose to extend and adapt its capabilities. Security Monkey is now available on our GitHub site.

If you’re in the San Francisco Bay Area and would like to hear more about Security Monkey (and see a demo), our August Netflix OSS meetup will be focused specifically on security. It’s scheduled for August 20th and will be held at Netflix HQ in Los Gatos.

-Patrick Kelley, Kevin Glisson, and Jason Chan (Netflix Cloud Security Team)

Monday, March 11, 2013

Python at Netflix

By Roy Rapoport, Brian Moyles, Jim Cistaro, and Corey Bertram

We’ve blogged a lot about how we use Java here at Netflix, but Python’s footprint in our environment continues to increase. In honor of our sponsorship of PyCon, we wanted to highlight our many uses of Python at Netflix.

Developers at Netflix have the freedom to choose the technologies best suited for the job. More and more, developers turn to Python due to its rich batteries-included standard library, succinct and clean yet expressive syntax, large developer community, and the wealth of third party libraries one can tap into to solve a given problem. Its dynamic underpinnings enable developers to rapidly iterate and innovate, two very important qualities at Netflix. These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy. To illustrate, here are some current projects taking advantage of Python:

Alerting

The Central Alert Gateway (CAG) is a RESTful web application written in Python to which any process can post an alert, though the vast majority of alerts are triggered by our telemetry system, Atlas (which will be open sourced in the near future). CAG can take these alerts and based on configuration send them via email to interested parties, dispatch them to our notification system to page on call engineers, suppress them if we’ve already alerted someone, or perform automated remediation actions (for example, reboot or terminate an EC2 instance if it starts appearing unhealthy). At our scale, we generate hundreds of thousands of alerts every day and handling as many of these automatically -- and making sure to only notify people of new issues rather than telling them again about something they’re aware of -- is critical to our production efficiency (and quality of life).

Chaos Gorilla

We’ve talked before about how we use Chaos Monkey to make sure our services are resilient to the termination of any small number of instances. As we’ve improved resiliency to instance failures, we’ve been working to set the reliability bar much, much higher. Chaos Gorilla integrates with Asgard and Edda, and allows us to simulate the loss of an entire availability zone in a given region. This sort of failure mode -- an AZ either going down or simply becoming inaccessible to other AZs -- happens once in a blue moon, but it’s a big enough problem that simulating it and making sure our entire ecosystem is resilient to that failure is very important to us.

Security Monkey and Howler Monkey

Security Monkey is designed to keep track of configuration history and alert on changes in EC2 security-related policies such as security groups, IAM roles, S3 access control lists, etc. This makes our Cloud Security team very happy, since without it there’s no way to know when, or how, a change occurred in the environment.

Howler Monkey is designed to automatically discover and keep track of SSL certificates in our environments and domain names, no matter where they may reside, and alert us as we get close to an SSL certificate’s expiration date, with flexible and powerful subscription and alerting mechanisms. Because of it, we moved from having an SSL certificate expire surprisingly and with production impact about once a quarter to having no production outages due to SSL expirations in the last eighteen months. It’s a simple tool that makes a huge difference for us and our dozens of SSL certificates.

Chronos

We push hard to always increase our speed of innovation, and at the same time reduce the cost of making changes in the environment. In the datacenter days, we forced every production change to be logged in a change control system because the first question everyone asks when looking at an issue is “What changed recently?”. We found a formal change control system didn’t work well for with our culture of freedom and responsibility, so we deprecated a formal change control process for the vast majority of changes in favor of Chronos. Chronos accepts events via a REST interface and allows humans and machines to ask questions like “what happened in the last hour?” or “what software did we deploy in the last day?”. It integrates with our monkeys and Asgard so the vast majority of changes in our environment are automatically reported to it, including event types such as deployments, AB tests, security events, and other automated actions.

Aminator

Readers of the blog or those who have seen our engineers present on the Netflix Platform may have seen numerous references to baking -- our name for the process by which we take an application and turn it into a deployable Amazon Machine Image. Aminator is the tool that does the heavy lifting and produces almost every single image that powers Netflix.

Aminator attaches a foundation image to a running EC2 instance, preps the image, installs packages into the image, and turns the resultant image into a complete Netflix application. Simple in concept and execution, but absolutely critical to our success. Pre-staging images and avoiding post-launch configuration really helps when launching hundreds or thousands of instances.

Cass Ops

Netflix Cassandra Operations uses Python for automation and monitoring tools. We have created many modules for management and maintenance of our Cassandra clusters. These modules use REST APIs to interface with other Netflix tools to manage our instances within AWS as well as interfacing directly with the Cassandra instances themselves. These activities include creating clusters using Asgard, tracking our inventory with Edda, monitoring Eureka to make sure clusters are visible to clients, managing Cassandra repairs and compactions, and doing software upgrades. In addition to our internally developed tools, we take advantage of various Python packages. We use JenkinsAPI to interface with Jenkins for both job configuration and status information on our monitoring and maintenance jobs. Pycassa is used to access our operational data stored in Cassandra. Boto gives us the ability to communicate with various AWS services such as S3 storage. Paramiko allows us to ssh to instances without needing to create a subprocess. Use of Python for these tools has allowed us to rapidly develop and enhance our tools as Cassandra has grown at Netflix.

Data Science and Engineering

Our Data Science and Engineering teams rely heavily on Python to help surface insights from the vast quantities of data produced by the organization. Python is used in tools for monitoring data quality, managing data movement and syncing, expressing business logic inside our ETL workflows, and running various web applications to visualize data.

One such application is Sting, a lightweight RESTful web service that slices, dices, and produces visualizations of large in-memory datasets. Our data science teams use Sting to analyze and iterate against the results of Hive queries on our big data platform. While a Hive query may take hours to complete, once the initial dataset is loaded in Sting, additional iterations using OLAP style operations enjoy sub-second response times. Datasets can be set to periodically refresh, so results are kept fresh and up to date. Sting is written entirely in Python, making heavy use of libraries such as pandas and numpy to perform fast filtering and aggregation operations.

General Tooling and the Service Class

Pythonistas at Netflix have been championing the adoption of Python and striving to make its power accessible to everyone within the organization. To do this we wrapped libraries for many of the OSS tools now being released by Netflix as well as a few internal services in a general use ‘Service’ class. With this we have helped our users quickly and easily stand up new services that have access to many common actions such as alerting, telemetry, Eureka, and easy AWS API access. We expect to make many of these these libraries available this year and will be around to chat about them at PyCon!
Here is an example of how easily we can stand up a service that has Eureka registration, Route 53 registration, a basic status page and exposes a fully functional Bottle service:

These systems and applications comprise a glimpse of the overall use and importance of Python to Netflix. They contribute heavily to our overall service quality, allow us to rapidly innovate, and are a whole lot of fun to work on to boot!

We’re sponsoring PyCon this year, and in addition to a slew of Netflixers attending we’ll also have a booth at the expo area and give a talk expanding on some of the use cases above. If any of this sounds interesting, come by and chat. Also, we’re hiring Senior Site Reliability Engineers, Senior DevOps Engineers, and Data Science Platform Engineers.