Category Name

Because it’s Friday: Subway maps to scale

David Smith — Fri, 09 Jun 2017 20:59:59 +0000

Many subway maps are masterpieces of information design, but inevitably make compromises in geographic fidelity for clarity. Inspired by a viral post on Reddit, you can now find a collection of subway maps from around the world, morphing from their map format to their real-world layout. Here, for example, is London's:

London's tube map — in map form — has always been lovely and clear, and you can easily see how much it deviates from geographic truth (and how much better it is for doing so). Compare it with New York's map:

I've always found the NYC subway map one of the most confusing to use as a traveler, and the fact that it hews so closely to geography probably has a lot to do with it.

That's all from us for this week. Have a great weekend, and we'll be back on Monday!

Schedule for useR!2017 now available

David Smith — Fri, 09 Jun 2017 19:17:33 +0000

The full schedule of talks for useR!2017, the global R user conference, has now been posted. The conference will feature 16 tutorials, 6 keynotes, 141 full talks, and 86 lightning talks starting on July 5 in Brussels. That's a lot to fir into 4 days, but I'm especially looking forward to the keynote presentations:

20 years of CRAN (Uwe Ligges)
Parallel Computation in R: What We Want, and How We (Might) Get It (Norm Matloff)
Structural Equation Modeling: models, software and stories (Yves Roseel)
Teaching data science to new useRs (Mine Cetinkaya-Rundel)
Dose-response analysis: considering dose both as qualitative factor and quantitative covariate using R (Ludwig Hothorn)
R tools for the analysis of complex heterogeneous data (Isabella Gollini)

I'm also pleased to be attending with several of my Microsoft colleagues. You can find our talks below.

Can you keep a secret? (Andrie deVries and Gabor Csardi) [A new package secret allows you to encrypt secrets using public key encryption.]
Recommendations for coworker collaborators using R and the Office Graph API (David Smith) [Graph theory, R and Power BI]
Deep Learning for Natural Language Processing in R (Angus Taylor) [Using the mxnet package to apply deep learning to text]

I hope you can attend too! Registration is still open if you'd like to join in. You can find the complete schedule linked below.

Sched: useR!2017

What is plan regression in SQL Server?

Jovan Popovic (MSFT) — Fri, 09 Jun 2017 13:02:10 +0000

Plan regression happens when SQL Server starts using the sub-optimal SQL plan to execute some T-SQL query. Usually you will see that some T-SQL query is executing really fast, but then it gets slower without any obvious reason. In this post you will see how can plan regression happen.

Setup

First I’m going to create a simple table that will be used for the test:

drop table if exists flgp; 
create table flgp (
       type int,
       name nvarchar(200),
       index ncci nonclustered columnstore (type),
       index ix_type(type)
);

This table has two columns, one classic B-tree index, and one COLUMNSTORE index. B-tree index is better for the queries that need a limited set of rows, while the columnstore index is better for queries that will scan most of the rows in the table.

I will populate this table with non-uniformly distributed data:

One row that has type = 1
999.999 rows that have type = 2

insert into flgp(type, name)
values (1, 'Single')

insert into flgp(type, name)
select TOP 999999 2 as type, o.name
from sys.objects, sys.all_columns o

Plan regression will be demonstrated on this data set.

Why we have different plans for the same T-SQL query?

Every T-SQL query can have different plans. Let’s look at the following query:

SELECT COUNT(*) FROM flgp WHERE type = @type

If we pass a parameter with the value 1, the query would need to take one row. Therefore, the optimal plan would be a plan that uses B-Tree index that seeks into the table, takes one row using the reference from the index and returns number 1 as a result. B-tree indexes are perfect for very selective seeks/lookups. Let’s look at this query:

ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;
EXECUTE sp_executesql @stmt = N'SELECT COUNT(*) FROM flgp WHERE type = @type', @params = N'@type int', @type = 1

This query will produce plan with B-Tree INDEX SEEK that is optimal for parameter 1. with the following execution statistics:

SQL Server Execution Times:
 CPU time = 0 ms, elapsed time = 7 ms.

If we pass parameter value 2, the query would need to read almost the whole table. The optimal plan would be the plan that uses columnstore index that scans the table, takes all rows from the index and counts them. Let’s try to execute this query:

ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;
EXECUTE sp_executesql @stmt = N'SELECT COUNT(*) FROM flgp WHERE type = @type', @params = N'@type int', @type = 2

Execution statistics for this query might be:

 SQL Server Execution Times:
 CPU time = 312 ms, elapsed time = 400 ms.

Depending on the parameter value, SQL Server will choose one or another plan. The second plan is slower, but this is expected because it needs to count million of rows.

In both cases I have cleared the cache before I executed the query using ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE, because I want SQL Server to recompile the query and create the new plan.

How plan regression happens?

When SQL Server creates a plan for a query, the plan will be cached and reused when the same query comes again. The plan is retained in the cache and reused for some time.Now, let’s see what would happen if I execute the query that will put a plan with index seek into the cache, and then I execute second query that reuses that plan:

ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE; 
EXECUTE sp_executesql @stmt = N'SELECT COUNT(*) FROM flgp WHERE type = @type', @params = N'@type int', @type = 1
-- this query will reuse plan cached from previous query:
EXECUTE sp_executesql @stmt = N'SELECT COUNT(*) FROM flgp WHERE type = @type', @params = N'@type int', @type = 2

If we look at the execution statistics for the second query we might see something like:

 SQL Server Execution Times:
 CPU time = 719 ms, elapsed time = 721 ms.

Both CPU and elapsed time for the second query is doubled (from 300-400ms to 700ms) because the second query used the plan that was optimal for the first query. The cause of regression is plan caching in this case.

Conclusion

Plan regression happens when SQL Server starts using sub-optimal plan, which increases CPU time and duration. One way to mitigate this is to recompile query with OPTION(RECOMPILE) if you find this problem. Do not clear procedure cache on production system because it will affect all queries!

Another option would be to use automatic plan choice correction in SQL Server 2017 that will look at the history of plans and force SQL Server to use last known good plan if plan regression is detected.

Run massive parallel R jobs cheaply with updated doAzureParallel package

David Smith — Thu, 08 Jun 2017 22:41:11 +0000

At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we've noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.

JS Tan using doAzureParallel #rstats package to run simulation on a cluster of 20 low-priority Azure VMs. Total cost: $0.02 #EARLConf2017 pic.twitter.com/Mpl3IUa9zY
— David Smith (@revodavid) June 7, 2017

Using the doAzureParallel package is simple. First, you need to define the cluster you're going to use as a JSON file. (You can see an example on the right.) Here, you'll specify your Azure credentials, the size of the cluster, and the type of nodes (CPUs and memory) to use in the cluster. You can also specify here R packages (from CRAN and/or Github) to be pre-loaded onto each node, and the maximum number of simultaneous tasks to run on each node (for within-node parallelism).

New to this update, the poolSize option allows you to specify the number of dedicated (standard) VM nodes to use, in addition to a number of low-priority nodes to use. Low-priority nodes can be pre-empted by the Azure system at any time, but are much cheaper to use. (Even if a node is pre-empted your parallel computation will be continue; it will just take a little longer with the reduced capacity.) You can even specify a minimum and maximum number of nodes of each class to use, in which case the cluster will automatically scale up and down according to either (your choice) the workload or the time of day (e.g. only expand the low-priority part of the cluster on weekends, when pre-emption is less likely).

Once you've defined the parameters of your cluster, all you need to do is declare the cluster as a backend for the foreach package. The body of the foreach loop runs just like a for loop, except that multiple iterations run in parallel on the remote cluster. Here are the key parts of the option price simulation example JS presented at the conference.

This same approach can be used for any "embarrassingly parallel" iteration in R, and you can use any R function or package within the body of the loop. For example, you could use a cluster to reduce the time required for parameter tuning and cross-validation with the caret package, or speed up data preparation tasks when using the dplyr package.

In addition to support for auto-scaling clusters, this update to doAzureParallel also includes a few other new features. You'll also find new utility functions for managing multiple long-running R jobs, functions to read data from and write data to Azure Blob storage, and the ability to pre-load data into the cluster by specifying resource files.

The doAzureParallel package is available for download now from Github, under the open-source MIT license. For details on how to use the package, check out the README and the doAzureParallel guide.

Github (Azure): doAzureParallel

How to create dot-density maps in R

David Smith — Wed, 07 Jun 2017 20:17:44 +0000

Choropleths are a common approach to visualizing data on geographic maps. But choropleths — by design or necessity — aggregate individual data points into a single geographic region (like a country or census tract), which is all shaded a single colour. This can introduce interpretability issues (are we seeing changes in the variable of interest, or just population density?) and can fail to express the richness of the underlying data.

For an alternative approach, take a look at the recent Culture of Insight blog post which provides a tutorial on creating dot-density maps in R. The chart below is based on UK Census data. Each point represents 10 London residents, with the colour representing one of five ethnic categories. Now, the UK census only reports ethnic ratios on a borough-by-borough basis, so the approach here is to simulate the individual resident data (which is not available) by randomly distributing points across the borough following the reported distribution. In a way, this is suggesting a level of precision which isn't available in the source data, but it does provide a visualization of London's ethnic diversity that isn't confounded with the underlying population distribution.

Follow the link below to the detailed blog post, which includes R code (in both base and ggplot2 graphics) for creating density dot-charts like these. Also be sure to check out the zoomable version of the chart at the top of the page, which used Microsoft's Deep Zoom Composer in conjunction with OpenSeadragon to provide the zooming capability.

Culture of Insight: Building Dot Density Maps with UK Census Data in R

Announcing Microsoft Machine Learning Library for Apache Spark

Cortana Intelligence and ML Blog Team — Wed, 07 Jun 2017 16:00:54 +0000

This post is authored by Roope Astala, Senior Program Manager, and Sudarshan Raghunathan, Principal Software Engineering Manager, at Microsoft.

We’re excited to announce the Microsoft Machine Learning library for Apache Spark – a library designed to make data scientists more productive on Spark, increase the rate of experimentation, and leverage cutting-edge machine learning techniques – including deep learning – on very large datasets.

Simplifying Data Science for Apache Spark

We’ve learned a lot by working with customers using SparkML, both internal and external to Microsoft. Customers have found Spark to be a powerful platform for building scalable ML models. However, they struggle with low-level APIs, for example to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms. Microsoft Machine Learning for Apache Spark (MMLSpark) simplifies many of these common tasks for building models in PySpark, making you more productive and letting you focus on the data science.

The library provides simplified consistent APIs for handling different types of data such as text or categoricals. Consider, for example, a DataFrame that contains strings and numeric values from the Adult Census Income dataset, where “income” is the prediction target.

To featurize and train a model for this data using vanilla SparkML, you would have to tokenize strings, convert them into numerical vectors, assemble the numerical vectors together and index the label column. These operations result in substantial amounts of code that is not modular, as it depends on the data layout and the chosen ML algorithm. However, in MMLSpark, you can simply pass the data to the model, while the library takes care of the rest. Furthermore, you can easily change the feature space and algorithm without having to re-code your pipeline.

model = mmlspark.TrainClassifier(model=LogisticRegression(), labelCol=” income”).fit(trainData)
predictions = model.transform(testData)

MMLSpark uses DataFrames as its core datatype and integrates with SparkML pipelines for composability and modularity. It is implemented as Python bindings over Scala APIs, ensuring native JVM performance.

Deep Learning and Computer Vision at Scale

Deep Neural Networks (DNNs) are a powerful technique and can yield near-human accuracy for tasks such as image classification, speech recognition and more. But building and training DNN models from scratch often requires special expertise, expensive compute resources and access to very large datasets. An additional challenge is that DNN libraries are not easy to integrate with SparkML pipelines. The data types and APIs are not readily compatible, requiring custom UDFs which introduce additional code complexity and data marshalling overheads.

With MMLSpark, we provide easy-to-use Python APIs that operate on Spark DataFrames and are integrated into the SparkML pipeline model. By using these APIs, you can rapidly build image analysis and computer vision pipelines that use the cutting-edge DNN algorithms. The capabilities include:

DNN featurization: Using a pre-trained model is a great approach when you’re constrained by time or the amount of labeled data. You can use pre-trained state-of-the-art neural networks such as ResNet to extract high-order features from images in a scalable manner, and then pass these features to traditional ML models, such as logistic regression or decision forests.
Training on a GPU node: Sometimes, your problem is so domain specific that a pre-trained model is not suitable, and you need to train your own DNN model. You can use Spark worker nodes to pre-process and condense large datasets prior to DNN training, then feed the data to a GPU VM for accelerated DNN training, and finally broadcast the model to worker nodes for scalable scoring.
Scalable image processing pipelines: For a complete end-to-end workflow for image processing, DNN integration is not enough. Typically, you have to pre-process your images so they have the correct shape and normalization, before passing them to DNN models. In MMLSpark, you can use OpenCV-based image transformations to read in and prepare your data.

Consider, for example, using a neural network to classify a collection of images. With MMLSpark, you can simply initialize a pre-trained model from Microsoft Cognitive Toolkit (CNTK) and use it to featurize images with just few lines of code. We perform transfer learning by using a DNN to extract features from images, and then pass them to traditional ML algorithms such as logistic regression.

cntkModel = CNTKModel().setInputCol(“images”).setOutputCol(“features”).setModelLocation(resnetModel).setOutputNode(“z.x”)
featurizedImages = cntkModel.transform(imagesWithLabels).select([‘labels’,’features’])
model = TrainClassifier(model=LogisticRegression(),labelCol=”labels”).fit(featurizedImages)

Note that the CNTKModel is a SparkML PipelineStage, so you can compose it with any SparkML transformations and estimators in scalable manner. For more examples, see our Jupyter notebooks for image classification.

Open Source

To make Spark better for everyone, we’ve released MMLSpark as an Open Source project on GitHub – and we would welcome your contributions. For example, you can:

Provide feedback as GitHub issues, to request features and report bugs.
Contribute documentation and examples.
Contribute new features and bug fixes, and participate in code reviews.

Getting Started

You can quickly get started by installing the library on your local computer as a container from Docker Hub using this one-line command:

docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark

Then, when you’re ready to take your model to scale, you can install the library on your cluster as a Spark package. The library can be installed on any Spark 2.1 cluster, including on-premises, Azure HDInsight, and a Databricks cluster.

Take a look at our GitHub repository for installation instructions, links to documentation, and example Jupyter Notebooks.

Roope & Sudarshan

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

In case you missed it: May 2017 roundup

David Smith — Tue, 06 Jun 2017 19:33:04 +0000

In case you missed them, here are some articles from May of particular interest to R users.

Many interesting presentations recorded at the R/Finance 2017 conference in Chicago are now available to watch.

A review of some of the R packages and projects implemented at the 2017 ROpenSci Unconference.

An example of applying Bayesian Learning with the "bnlearn" package to challenge stereotypical assumptions

Data from the Billboard Hot 100 chart used to find the most popular words in the titles of pop hits.

Microsoft R Open 3.4.0 is now available for Windows, Mac and Linux.

How to use the "tweenr" package to create smooth transitions in data animations.

A preview of some of the companies and R applications to be presented at the EARL conference in San Francisco.

The AzureDSVM package makes it easy to spawn and manage clusters of the Azure Data Science Virtual Machine.

An online course on spatial data analysis in R, from the Consumer Data Research Centre in the UK.

Video and slides of Lixun Zhang's presentation "R in Financial Services: Challenges and Opportunity" from the New York R Conference.

Visual Studio 2017 now features built-in support for both R and Python development.

Quantifying the home-field advantage in English Premier League football.

Using the new CRAN_package_db function to analyze data about CRAN packages.

Stack Overflow Trends tracks the trajectory of questions about R and Python.

A recorded webinar on using Microsoft R to predict length of stay in hospitals.

The new "Real-Time Scoring" capability in Microsoft R Server creates a service to generate predictions from certain models in milliseconds.

"Technical Foundations of Informatics" is an open course guide on data analysis and visualization with R with a modern slant.

The Datasaurus Dozen generalizes Anscombe's Quartet with a process to create datasets of any shape with (nearly) the same summary statistics.

CareerCast ranks Statistician as the best job to have in 2017.

You can now use Microsoft R within Alteryx Designer.

How to clean messy data in Excel by providing just a few examples of transformations.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

Introducing the new Data Science Virtual Machine on Windows Server 2016

Cortana Intelligence and ML Blog Team — Tue, 06 Jun 2017 16:00:38 +0000

This post is authored by Udayan Kumar, Software Engineer at Microsoft.

We are excited to offer a Windows Server 2016 version of our very popular Microsoft Azure Data Science Virtual Machine (DSVM). This new DSVM version is based on the latest Windows Server 2016 Data Center edition. We’ve added new tools and upgraded existing tools to the latest versions as part of this release. Highlights of these new additions include:

1. Windows Server 2016 with Docker container support, to run and design Windows containers. You can refer to the getting started guide here.

2. Microsoft Office Pro-Plus with shared activation, including Excel, Word, OneNote, and PowerPoint. More Information about Microsoft Office Pro-Plus is here. An Office 365 subscription or Office license is required.

3. Unified support for Deep Learning on GPU or CPU-only based virtual machines. Earlier, Windows DSVM users had to install the GPU based deep learning capabilities via an extension script on the Windows Server 2012 version of the DSVM. With this release, we are pre-installing the GPU Nvidia drivers, CUDA toolkit 8.0, and cuDNN library in the image. Along with it, we have also installed the latest GPU versions (these will also work with CPU-only machines) of the following popular deep learning frameworks: Microsoft Cognitive toolkit (CNTK), TensorFlow, Mxnet.

Note: Compute GPUs based on NVidia Tesla K80 are offered as NC class virtual machines and are currently available in the South Central US, East US, West US 2, West Europe and South East Asia locations in Azure.

4. An upgrade to the latest Microsoft R Server 9.1. Major changes in this include cognitive models such as sentiment analysis & image featurizers as well as enterprise-grade operationalization with real-time scoring and dynamic scaling of VMs.

A complete list of available tools available on DSVM is here.

Why DSVM

Because time is money. DSVM allows users to get a data science desktop in the cloud without worrying about installation, configuration and maintenance. Since this environment is repeatable, you can destroy the machine as soon as you are done (of course, after saving your data on persistent storage).
Scale. Leveraging the benefits of the cloud, DSVM also allows users to scale the machine as needed. For instance, if you’re regularly using an 8GB RAM machine and you have a workload that needs more RAM or a GPU, you can elastically resize the machine into the desired configuration. Once again, once you’re done, you can scale down or even shutdown the VM, reducing your compute costs.
Standardized work environment. With the DSVM, all the users of an organization get the exact same standardized setup. No more “but it works on my machine” problem to deal with!

Get Started Today

We invite you to explore the new Windows Server 2016 based DSVM for your machine learning, deep neural network, and data science projects. It’s available on the Azure Marketplace today, to run on either CPU-only or GPU-based VMs. We also offer the DSVM in both Ubuntu and CentOS-based Linux. Free Azure credits are available to help get you started.

Udayan

Microsoft AI – Now Serving Critical Care Patients, Water-Insecure Populations in Africa, Bank Customers in New Zealand & Many More

Cortana Intelligence and ML Blog Team — Mon, 05 Jun 2017 16:04:12 +0000

New customers are benefiting from Microsoft’s Artificial Intelligence offerings at a record pace – here’s a look at some of the latest.

Saving the Lives of Patients in Critical Care

A large part of our current healthcare system is geared towards treating patients only after they have fallen quite sick. While it would clearly be very desirable to proactively prevent debilitating medical incidents, doing so has been a formidable challenge.

Consider, for instance, a patient who has recently undergone major surgery, and whose condition might deteriorate rapidly, even fatally, within minutes because of a risk of multi-organ failure following that surgery. Even though the human body signals this type of failure in advance, such signals can be very subtle (e.g. momentary heart fluctuations, short episodes of breathing difficulty, profuse sweating that stops after a few minutes, etc.). The signals are also transient, so nurses and doctors may not catch them during their routine checks on the patient. However, medical sensors can detect and stream such signals in real-time to centralized monitoring systems which can then alert healthcare providers before something bad happens.

KenSci began as a collaboration between the University of Washington and Microsoft Research. Their unique team, comprised mostly of doctors and data scientists, share a passion around applying AI to help people live longer, healthier lives. KenSci is pioneering the use of machine learning in predictive healthcare risk management. By identifying which patients will get sick, when, how sick they’ll get, and what can be done to help them, their solutions are helping patients, doctors, nurses, and other participants in our healthcare ecosystem.

KenSci uses data from electronic medical records and claims, as well as psychosocial, operational, and patient-generated data to align payers and providers around value-based care initiatives. The KenSci Risk Prediction platform, built on Microsoft’s cloud and data technologies, including Azure Machine Learning, is powered by over 150 prebuilt ML models, and predicts care and cost risks for 17 million patients today. These ML algorithms can identify patterns that indicate serious health deterioration in patients, sometimes as many as eight hours in advance.

Read more about the KenSci story here – it’s a testimonial to how AI, IoT and cloud computing are coming together to save the lives of patients in critical care.

Addressing Water Security in Kenya

Millions of people around the globe live in a state of “water insecurity”, in the constant fear of not having enough water on a given day. The time spent finding and carrying water, if local wells are not reliable, steals precious time away from farming, making a living or going to school, and water issues are closely tied to a cycle of poverty.

The REACH initiative is hoping to break this vicious cycle. Funded by the UK government, REACH is using the expertise of professors and machine learning experts from the University of Oxford and, in partnership with UNICEF and other organizations, working towards the goal of making five million poor people water-secure in Africa and Asia.

The team has developed sensors such as the ones that go into our smartphones or fitness bands and putting them inside water pump handles on rural wells. These sensors are used to monitor groundwater level and to improve the pace of repairs that may be needed to fix broken pumps. Their accelerometer and gyroscope record the motion and vibration of pump handles and, by applying machine learning techniques on this data, predicts whether the water is coming from a deep or a shallow source, and how much remains underground. These models were developed on a desktop computer but the team was able to accelerate their work by using the Azure cloud, including R and Python in Azure Machine Learning, which helped them go straight from lab to production.

Learn more about the REACH story here. Their solution is now getting ready for widespread deployment across Kenya, and they hope to bring similar benefits to water-insecure populations elsewhere.

Providing Best-in-Class Banking to New Zealand Customers (by Switching from SAS to Microsoft)

Heartland Bank has served businesses, households, and the rural sector in New Zealand since 1875. To stay competitive with larger banks and newer internet payment services while still staying true to their customer-focused roots, the bank decided to adopt a data-driven approach that would lead to more business agility, innovation and growth.

There are specific product niches where Heartland is strong, each with its own requirements, such as around analyzing risk, evaluating credit lines, understanding cash flows etc., and it was increasingly critical for the bank to be proactive, providing early warnings when things were changing in any one of these areas. Heartland found it very labor-intensive and time-consuming to change financial models on their earlier SAS system. What’s more, these analytics tools were expensive, being licensed on a per-user basis, thus limiting their access to only a small group of IT staff.

What Heartland needed was a flexible solution that would evolve and scale with the company. They wanted an analytics platform to support future innovation, with sophisticated predictive analytics capabilities. To support their growth strategy, the bank decided to replace their existing SAS system with a platform based on Microsoft R Server and SQL Server 2016. R Server uses the powerful, open source R statistical programming language which would enable Heartland to take advantage of a broad range of analytics, including big data statistics, predictive modeling, and machine learning. The bank also had the confidence that they could tap into a global community of millions of R developers.

The changes they made have sparked a new attitude towards data at Heartland. Business users across the bank can directly work with the new data models, rather than having to rely on IT to produce reports. Employees have direct access to real-time information and no longer need to wait for overnight batch processing. The expanded access to critical information has resulted in employees seeking answers to more complex questions and thus more sophisticated analytics. Bank employees are now able to get answers to questions such as “What would happen to our deposits if we increased our six-month deposit rate by 20 basis points over the next three months?” Many more exciting opportunities are on the horizon, such as using cognitive services for facial recognition, for instance, to compare the image on photo IDs with loan applicants, as an added layer of security.

You can learn about the Heartland Bank story here. Their partnership with Microsoft is helping them provide innovative, best-in-market products to customers across New Zealand.

Boosting the ROI on Marketing Campaigns (by Switching from AWS to Azure)

Silicon Valley startup Track Revenue offers powerful marketing analytics software-as-a-service. Track Revenue helps advertisers optimize their return on investment (ROI) on online ads.

Although Track Revenue initially hosted their service on Amazon Web Services (AWS), when they were looking to launch the next generation of their service, after extensive evaluation, they decided to switch to Azure. Aside from unique technology advantages that the team could find only in Azure, Track Revenue also got strong engineering collaboration and support from Microsoft, as well as support for all their open source technology (which, as they discovered, was as good as or better than AWS, depending on the software). In many cases, they got comparable or better performance than AWS, and for significantly lower cost (for instance, with their use of Mongo virtual machines).

A key technology advantage they found in Azure was Cortana Intelligence, a fully managed suite that delivers state of the art big data and AI capabilities on Azure. The suite, which includes Azure Machine Learning, helped Track Revenue transform customer data into intelligent action. When a user clicks on an ad, for instance, a set of data points are sent to the Cortana Intelligence Suite, and predictions based on those data points get sent back to Track Revenue. The Track Revenue service then serves the most appropriate landing page or offer to the potential buyer, with the entire process getting completed in milliseconds. This has helped Track Revenue optimize conversions and revenues, and in near real-time. Aside from making machine learning very easy, Cortana Intelligence also supported extensive custom configuration options, and proved to be a big win for Track Revenue, right out of the gate.

The interoperability between services was another factor that tipped the scales in Azure’s favor. For example, Track Revenue is able to take advantage of streaming analytics, Event Hubs, Azure SQL Database, Azure SQL Data Warehouse, Redis Cache, Table Storage and more.

Based on tens of thousands of campaigns that are using their service, Track Revenue on Azure has boosted customers’ ROI by 15 percent, conversion rate by 12 percent, and earnings-per-click (EPC) by 38 percent (with some customers seeing EPC soar by as much as 47 percent). What’s more, Track Revenue shaved months off the development of its new service by taking advantage of Azure, rather than their former cloud platform. In the end, Track Revenue wasn’t content to move just their flagship service to Azure, they did a complete migration of their entire infrastructure to the Microsoft cloud service.

Learn more about Track Revenue here. By betting on Azure, they’ve successfully released a next-generation version of their service, one that extensively optimizes customers’ marketing campaigns.

Delivering Optimal Airline Seat Pricing, Blazing Fast

The price of airline seats depends on thousands of factors that change very rapidly because of the dynamics of supply and demand. There can be tens of millions of price inquiries per hour and prices are affected by competitor prices, flight availability, weather changes, and network demand.

When airline companies – as well as top retailers and manufacturers – need competitive up-to-the-minute pricing, they turn to PROS. PROS runs lightning-fast analytics on billions of records to deliver the right prices at the right time to their customers.

Until a few years ago, PROS software was deployed on-premises, either at the customer sites or in PROS’ private cloud. This naturally required large capital investments for hardware infrastructure, either on the part of PROS or their customers. If customers were to scale down or end their service subscription, they would be left with tons of unused hardware. Each datacenter also required a lot of manpower attached to it. Furthermore, when PROS acquired a couple of SaaS companies, they further saw how they could benefit by consolidating their infrastructure at a cloud provider.

PROS decided to move their operations to the Azure cloud and get out of the business of buying and managing hardware. In doing so, they found they could substantially lower their infrastructure costs. They also took advantage of SQL Server 2016 with R Services, which allows the same ML used by data scientists in their research to be used in production. Furthermore, R Services in SQL Server also brought scalable, R-based analytics into the very place where PROS’ data is generated, helping them improve database process performance by a factor of 100x. The company turned to Azure Data Lake (ADL) to store unstructured data, running processes in ADL to interpret customer behavior, mine insights and determine the right price for the customer, while also helping their clients gain incremental revenue. With operations in the cloud, scaling is easy, and the company is now able to grow with the accelerating pace of demand.

You can read more about the PROS case study here. Using the power of the Azure intelligent cloud, PROS is able to deliver state of the art software-as-a-service (SaaS) solutions that run complex calculations on numerous data sources, and establish optimized prices with great speed, precision and consistency – and at a lower cost.

CIML Blog Team

Powe[R] BI: Free e-book on using R with Power BI

David Smith — Mon, 05 Jun 2017 13:51:36 +0000

A new (and free!) e-book on extending the capabilities of Power BI with R is now available for download, from analytics consultancy BlueGranite. The introduction to the book explains why R and Power BI are a great match together:

As a specialized, open source statistical environment, R represents the primary analysis language for a large number of data scientists and statisticians. In recent years, R has also undergone a significant shift in user base by gaining wider adoption in the business world.

By extending Power BI with R, Microsoft has opened up numerous opportunities to enhance your Business Intelligence solutions. In addition to its versatility for data science, R is a great language and ecosystem for work related to both data visualization and data processing. By incorporating R into its products, Microsoft has signaled a strong commitment from Microsoft not only to data science, but the R platform in general.

The book provides a step-by-step guide to using R within Power BI, including:

How to find, install and use pre-built custom visuals based on R withing Power BI
How to create your own R Visuals using the R language, and use them in both Power BI Desktop and the cloud-based Power BI Service
How to perform custom data processing with R scripts

The e-book is available for download at the link below (free registration required). For more links to Power BI resources, you might also want to check out the blog post, "R with Power BI".

BlueGranite: Power[R] BI: Enhance Your Microsoft Power BI Experience (via David Eldersveld)

Because it’s Friday: Disappearing Dots

David Smith — Fri, 02 Jun 2017 18:40:28 +0000

It's been a while since we posted an optical illusion, and this one (via Max Galka) jis ust too good to pass up. Here are the instructions, from the source:

First, look at any yellow dot as the figure moves. The yellow dot remains present and stationary. If you concentrate on all three yellow dots, they remain there as well.

But now concentrate on the central green dot. You will see one or more of the yellow dots disappearing and then reappearing sporadically. They are not—this is an optical illusion. The dots remain and your brain simply doesn’t register their presence from time to time. Weird, eh?

It's a bit of a puzzle why this one works. Why, for instance, is the rotating grid necessary? The mind works in mysterious ways.

That's all for us for this week. We'll be back on Monday, when we'll be checking in from the EARL Conference in San Francisco. See you then, and have a great weekend!

Teach kids about R with Minecraft

David Smith — Fri, 02 Jun 2017 15:19:42 +0000

As I mentioned earlier this week, I was on a team at the ROpenSci Unconference (with Brooke Anderson, Karl Broman, Gergely Daróczi, and my Microsoft colleagues Mario Inchiosa and Ali Zaidi) to work on a project to interface the R language with Minecraft. The resulting R package, miner, is now available to install from Github. The goal of the package is to introduce budding programmers to the R language via their interest in Minecraft, and to that end there's also a book (R Programming with Minecraft) and associated R package (craft) under development to provide lots of fun examples of manipulating the Minecraft world with R.

Create objects in Minecraft with R functions

If you're a parent you're probably already aware of the Minecraft phenomenon, but if not: it's kinda like the Lego of the digital generation. Kids (and kids-and-heart) enter a virtual 3-D world composed of cubes representing ground, water, trees and other materials, and use those cubes to build their own structures, which they can then explore with their friends using their in-game avatars. Inspired by the Python-focused book "Learn to program with Minecraft", Karl had the brilliant idea of building a similar interface around R.

The miner package provides just a few simple functions to manipulate the game world: find or move a player's position; add or remove blocks in the world; send a message to all players in the world. The functions are deliberately simple, designed to be used to build more complex tasks. For example, you could write a function to detect when a player is standing in a hole. It uses the getPlayerIds and getPlayerPos functions to find the ID and locations of all players in the world, and the getBlocks function to check if the player is surrounded by blocks that are not code 0 (air).

The package also provides are also a couple of "listener" functions: you can detect player chat messages, or when players strike a block with their sword. You can then use to write functions to react to player actions. For example, you can write a chat-bot to play a game with the player, create an AI to solve an in-game maze, or give the player the power's of Elsa from Frozen to freeze water by walking on it:

To get started with the miner package, you'll need to purchase a copy of Minecraft for Windows, Mac or Linux if you don't have one already. (Note: the Windows 10 version from the Microsoft Store isn't compatible with the miner package.) This is what you'll use to explore the world managed by the Minecraft server, which you'll also need to install along with RaspberryJuice plugin. You can find setup details in the miner package vignette. We installed the open-source Spigot server on a Ubuntu VM running in Azure; you might find this Dockerfile helpful if you're trying something similar.

The miner package and craft package available to install from Github at the link below. The packages are a work-in-progress: comments, suggestions, pull-requests and additional examples for the R Programming with Minecraft book are most welcome!

Github (ROpenSci labs): miner and craft

Performance impact of memory grants on data loads into Columnstore tables

Denzil Ribeiro — Fri, 02 Jun 2017 15:10:23 +0000

Reviewed by: Dimitri Furman, Sanjay Mishra, Mike Weiner, Arvind Shyamsundar, Kun Cheng, Suresh Kandoth

Background

Some of the best practices when bulk inserting into a clustered Columnstore table are:

Specifying a batch size close to 1048576 rows, or at least greater than 102400 rows, so that they land into compressed row groups directly.
Using concurrent bulk loads if you want to reduce the time to load.

For additional details, see the blog post titled Data Loading performance considerations with Clustered Columnstore indexes, specifically the Concurrent Loading section.

Customer Scenario

I was working on a customer engagement SQL Server 2017 CTP 2.0 on Linux with a Columnstore implementation, where the speed of the load process was of critical importance to the customer.

Data was being loaded concurrently from 4 jobs, each one loading a separate table.
Each job spawned 15 threads, so in total there were 60 threads concurrently bulk loading data into the database.
Each thread specified the commit batch size to be 1048576.

Observations

When we tested with 1 or 2 jobs, resulting in 15 or 30 concurrent threads loading, performance was great. Using the concurrent approach, we had greatly reduced the load time. However, when we increased the number of jobs to 4 jobs running concurrently, or 60 concurrent threads loading, the overall load time more than doubled.

Digging into the problem

Just like in any performance troubleshooting case, we checked physical resources, but found no bottleneck in CPU, Disk IO, or memory at the server level. CPU on the server was hovering around 30% for the 60 concurrent threads, and that was almost the same as with 30 concurrent threads. Mid-way into job execution, we also checked DMVs such as sys.dm_exec_requests and sys.dm_os_wait_stats, and saw that INSERT BULK statements were executing, but there was no predominant wait. Periodically, there was LATCH contention, which made little sense – given the ~1 million batch sizes, data from each bulk insert session should have landed directly in its own compressed row group.

Then we spot checked the row group physical stats DMVs, and observed that despite the batch size specified, the rows were landing in the delta , and not into the compressed row groups directly, as we expected they would.

Below is an example of what we observed from sys.dm_db_column_store_row_group_physical_stats:

select row_group_id, delta_store_hobt_id,state_desc,total_rows,trim_reason_desc
from sys.dm_db_column_store_row_group_physical_stats
where object_id = object_id('MyTable')

As you may recall from the previously referenced blog, inserting into the delta store, instead of into compressed row groups directly, can significantly impact performance. This also explained the latch contention we saw since we were inserting from many threads into the same btree. At first, we suspected that the code was setting the batch size incorrectly, but then we ran an XEvent session and observed the batch size of 1 million specified as expected, so that wasn’t a factor. I didn’t know of any factors that caused a bulk insert to revert to delta store when it was supposed to go to compressed row groups. Hence, we collected a full set of diagnostics for a run using PSSDIAG, and did some post analysis.

Getting closer…

We found that only at the beginning of the run, there was contention on memory grants (RESOURCE_SEMAPHORE waits), for a short period of time. After that and later into the process, we could see some latch contention on regular data pages, which we didn’t expect as each thread was supposed to insert into its own row group. You would also see this same data by querying sys.dm_exec_requests live, if you caught it within the first minute of execution, as displayed below.

Figure 1: Snapshot of sys.dm_exec_requests

Looking at the memory grant DMV sys.dm_exec_query_memory_grants, we observed that at the beginning of the data load, there was memory grant contention. Also, interestingly, each session had a grant of ~5GB (granted_memory_kb), but was using only ~1GB (used_memory_). When loading data from a file, the optimizer doesn’t have knowledge of number of rows in the file and memory grant is estimated based on the schema of the table, taking into account maximum length of variable length columns defined. In this specific case, this server was commodity hardware with 240 GB of memory. Memory grants of 5 GB per thread across 60 threads exceeded the total memory on the box. If this were a larger machine, this situation would not arise. You can also observe multiple sessions that have requested memory, but memory has not yet been granted (second and third rows in the snapshot in Figure 2). See additional details on memory grants here.

Figure 2: Snapshot of sys.dm_exec_query_memory_grants

Root cause discovered!

We still didn’t know the reason for reverting into delta store, but armed with the knowledge that there was some kind of memory grant contention, we created an extended event session on the query_memory_grant_wait_begin and query_memory_grant_wait_end events, to see if there were some memory grant timeouts that caused this behavior. This XE session did strike gold; we were able to see several memory grants time out after 25 seconds and could correlate these session_ids to the same session_ids that were doing the INSERT BULK commands.

Figure 3: Output of Extended event collection. Duration of wait is the difference between the query_memory_grant_wait_end and query_memory_grant_wait_begin time for that specific session.

Collecting a stack on the query_memory_grant_wait_begin extended event and with some source code analysis, we found out the root cause for this behavior. For every bulk insert we first determine whether it can go into a compressed row group directly based on batch size. If it can, we request a memory grant with a timeout of 25 seconds. If we cannot acquire the memory grant in 25 seconds, that bulk insert reverts to the delta store instead of compressed row group.

Working around the issue

Given our prior dm_exec_query_memory_grants diagnostic data, you could also observe from Figure 2 that we asked for a 5GB grant, but used only 1GB. There was room to reduce the grant size, to avoid memory grant contention, and still maintain performance. Therefore, we created and used a resource governor workload group that reduced the grant percent parameter to allow greater concurrency during data load. We then tied this workload group via a classifier function for just the login that the data load jobs were executed under. We first lowered the grant percentage to 10% from the default %, but even at that level, we couldn’t sustain 60 sessions concurrently bulk loading due to RESOURCE_SEMAPHORE waits, as each memory grant requested was still 5 GB. We iterated on the grant percentage a couple times, lowering it until we landed at 2% for this specific data load. Setting it to 2% means that we are preventing a query from being able to get a memory grant greater than 2% of the target_memory_kb value in the DMV sys.dm_exec_query_resource_semaphores. Binding the specific login that was only used for data load jobs to the workload group prevented this configuration from affecting the rest of the workload. Only load queries ended up in the workload group with the 2% limit on memory grants, while the rest of the workload used the default workload group configuration. At 2%, the memory grant requested for each thread was around 1GB, and allowed the level of concurrency we were looking for.

-- Create a Workload group for Data Loading
CREATE WORKLOAD GROUP DataLoading
WITH (REQUEST_MAX_MEMORY_GRANT_PERCENT = 2)

-- If the Login is DataLoad it will go to workload group DataLoading
DROP FUNCTION IF EXISTS DBO.CLASSIFIER_LOGIN
GO
CREATE FUNCTION CLASSIFIER_LOGIN ()
RETURNS SYSNAME WITH SCHEMABINDING
BEGIN
DECLARE @val varchar(32) = 'default';
IF 'DataLoad' = SUSER_SNAME()
SET @val = 'DataLoading';
RETURN @val;
END
GO
-- Make function known to the Resource Governor as its classifier
ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.CLASSIFIER_LOGIN)
GO

Note: Usually with memory grants, you can often use query level hints MAX_GRANT_PERCENT and MIN_GRANT_PERCENT. In this case given it was an ETL workflow there wasn’t a user defined query to add the hint for example in the case of an SSIS package.

Final Result

Once we did that, our 4 Jobs could execute in parallel (60 threads loading data simultaneously) in the same timeframe that our prior 2 Jobs, reducing total data load time significantly. Running 4 jobs in parallel in almost the same interval of time allowed us to load twice the amount of data, increasing our data load throughput.

Concurrent Load Jobs	Tables Loaded	Threads loading data	RG Configuration	Data Load Elapsed Time (sec)
2	2	30	Default	2040
4	4	60	Default	4160
4	4	60	REQUEST_MAX_MEMORY_GRANT_PERCENT = 2	1040

We could drive CPU to almost 100% now, compared to 30% before the Resource Governor changes.

Figure 4: CPU Utilization Chart

Conclusion

Concurrently loading data into clustered Columnstore indexes requires some considerations including memory grants. Use the techniques outlined in this article to identify if you are running into similar bottlenecks related to memory grants, and if so use the Resource Governor to adjust the granted memory to allow for higher concurrency. We hope you enjoyed reading this as much as we enjoyed bringing it to you! Feedback in the Comments welcome.

Python and R top 2017 KDnuggets rankings

David Smith — Thu, 01 Jun 2017 21:30:03 +0000

The results of KDnuggets' 18th annual poll of data science software usage are in, and for the first time in three years Python has edged out R as the most popular software. While R increased its share of usage from 45.7% in last year's poll to 52.1% this year, Python's usage among data scientists increased even more, from 36.6% of users in 2016 to 52.6% of users this year.

There were some interesting moves in the long tail, as well. Several tools entered the KDNuggets chart for the first time, including Keras (9.5% of users), PyCharm (9.0%) and Microsoft R Server (4.3%). And several returning tools saw big jumps in usage, including Microsoft Cognitive Toolkit (3.4% of users), Tensorflow (20.2%) and Power BI (10.2%). Microsoft SQL Server increased its share to 11.6% (up from 10.8%), whereas SAS (7.1%) and Matlab (7.4%) saw declines. Julia, somewhat surprisingly, remained flat at 1.1%.

For the complete results and analysis of the 2017 KDnuggets data science software poll, follow the link below.

KDnuggets: New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll

GA of Cognitive Toolkit 2.0 – Microsoft’s Open Source, Enterprise-Ready, TensorFlow-Outperforming AI Toolkit

Cortana Intelligence and ML Blog Team — Thu, 01 Jun 2017 18:15:16 +0000

Re-posted from the Microsoft Next blog and the Cognitive Toolkit blog.

We’re excited to announce the general availability of Cognitive Toolkit 2.0, Microsoft’s open source, enterprise-ready, production-grade AI offering. Cognitive Toolkit allows users to create, train, and evaluate their own neural networks that can then scale efficiently across multiple GPUs and machines on massive data sets. Cognitive Toolkit is being used extensively by companies worldwide with a need to deploy deep learning at scale, and by a wide variety of Microsoft products as well as students and academics worldwide. Key upgrades in this new version of the toolkit (which was formerly known as CNTK) include:

A preview of Keras support natively running on Cognitive Toolkit.
Java bindings and Spark support for model evaluation.
Model compression to increase the speed of evaluating a trained model on CPUs.
Hundreds of additional features and fixes since our beta was introduced.
Performance improvements that make it the fastest deep learning framework.
- Cognitive Toolkit ranked #1 in a performance benchmark against other similar platforms, as seen in the results of the independently run performance tests below. This particular test was on a single GPU; on multiple GPUs, the performance gets even better with scale, more information is available at the original post here.

Source: http://dlbench.comp.hkbu.edu.hk/

To cite one customer example, the Chesapeake Conservancy is using the toolkit to train a neural network to speed up the creation of one-meter resolution land cover datasets, to be used in prioritizing restoration and protection efforts in the Chesapeake Bay, an area spanning 64,000 square miles in six states and Washington, D.C. Learn more here.

The Chesapeake Conservancy uses Cognitive Toolkit to create land cover datasets that are used to monitor restoration and protection initiatives throughout the Chesapeake Bay. (Photo credit: Chesapeake Conservancy.)

You can learn more about this latest release of Cognitive Toolkit from the original announcement here or this developer-friendly blog post. We’ve even compiled a set of compelling reasons why data scientists and developers using other frameworks should switch to Cognitive Toolkit today.

Go ahead and get started on building cool AI apps right away, Cognitive Toolkit 2.0 is available on GitHub.

CIML Blog Team