Revolutions

July 24, 2017

Analyzing Github pull requests with Neural Embeddings, in R

At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distributed on Spark, and the resulting neural network trained using the Microsoft Cognitive Toolkit. You can see the slides here, and you can watch the presentation below.

Posted by David Smith at 14:52 in events, Microsoft, R | Permalink | Comments (0)

July 21, 2017

Because it's Friday: How Bitcoin works

Cryptocurrencies have been in the news quite a bit lately. Bitcoin prices have been soaring recently after the community narrowly avoided the need for a fork, while $32M in rival currency Etherium was recently stolen, thanks to a coding error in wallet application Purity. But what is a crypto-currency, and what does a "wallet" or a "fork" mean in that context? The video below gives the best explanation I've seen for how cryptocurrencies work. It's 25 minutes long, but it's a complex and surprisingly subtle topic, made easy to understand by math explainer channel 3Blue1Brown.

That's all from the blog for this week. Have a great weekend, and we'll be back on Monday.

Posted by David Smith at 15:10 in random | Permalink | Comments (0)

IEEE Spectrum 2017 Top Programming Languages

IEEE Spectrum has published its fourth annual ranking of of top programming languages, and the R language is again featured in the Top 10. This year R ranks at #6, down a spot from its 2016 ranking (and with an IEEE score — derived from search, social media, and job listing trends — tied with the #5 place-getter, C#). Python has taken the #1 slot from C, jumping from its #3 ranking in 2016.

For R (a domain specific language for data science) to rank in the top 10, and for Python (a general-purpose language with many data science applications) to take the top spot, may seem like a surprise. I attribute this to continued broad demand for machine intelligence application development, driven by the growth of "big data" initiatives and the strategic imperative to capitalize on these data stores by companies wordwide. Other data-oriented languages appear in the Top 50 rankings, including Matlab (#15), SQL (#23), Julia (#31) and SAS (#37).

For the complete announcement of the 2017 IEEE Spectrum rankings, including additional commentary and analysis of changes, follow the link below.

IEEE Spectrum: The 2017 Top Programming Languages

Posted by David Smith at 11:41 in popularity, python, R | Permalink | Comments (1)

July 20, 2017

Data Analysis for Life Sciences

Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The EdX courses are linked below, which you can take for free, or simply follow the series of YouTube videos and materials provided in the index.

Data Analysis for the Life Sciences Series

A companion book and associated R Markdown documents are also available for download.

Genomics Data Analysis Series

For links to all of the course components, including videos and supplementary materials, follow the link below.

rafalab: HarvardX Biomedical Data Science Open Online Training

Posted by David Smith at 08:00 in courses, life sciences, R | Permalink | Comments (0)

July 19, 2017

Securely store API keys in R scripts with the "secret" package

If you use an API key to access a secure service, or need to use a password to access a protected database, you'll need to provide these "secrets" in your R code somewhere. That's easy to do if you just include those keys as strings in your code — but it's not very secure. This means your private keys and passwords are stored in plain-text on your hard drive, and if you email your script they're available to anyone who can intercept that email. It's also really easy to inadvertently include those keys in a public repo if you use Github or similar code-sharing services.

To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust. You can see how it works in the vignette secret: Share Sensitive Information in R Packages, and in this presentation by Andrie de Vries at useR!2017:

To use the secret package, you'll need access to your private key, which you'll also need to store securely. For that, you might also want to take a look at the in-progress keyring package, which allows you to access secrets stored in Keychain on macOS, Credential Store on Windows, and the Secret Service API on Linux.

The secret package is available now on CRAN, and you can also find the latest development version on Github.

Posted by David Smith at 05:00 in packages, R | Permalink | Comments (1)

July 18, 2017

Neural Networks from Scratch, in R

By Ilia Karmanov, Data Scientist at Microsoft

This post is for those of you with a statistics/econometrics background but not necessarily a machine-learning one and for those of you who want some guidance in building a neural-network from scratch in R to better understand how everything fits (and how it doesn't).

Andrej Karpathy wrote that when CS231n (Deep Learning at Stanford) was offered:

"we intentionally designed the programming assignments to include explicit calculations involved in backpropagation on the lowest level. The students had to implement the forward and the backward pass of each layer in raw numpy. Inevitably, some students complained on the class message boards".

Why bother with backpropagation when all frameworks do it for you automatically and there are more interesting deep-learning problems to consider?

Nowadays we can literally train a full neural-network (on a GPU) in 5 lines.

import keras
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=RMSprop())
model.fit()

Karpathy, abstracts away from the "intellectual curiosity" or "you might want to improve on the core algorithm later" argument. His argument is that the calculations are a leaky abstraction:

“it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will 'magically make them work' on your data”

Hence, my motivation for this post is two-fold:

Continue reading "Neural Networks from Scratch, in R" »

Posted by Guest Blogger at 09:30 | Permalink | Comments (2)

July 17, 2017

Revisiting the useR!2017 conference: Recordings now available

The annual useR!2017 conference took place July 4-7 in Brussels, and in every dimension it was the best yet. It was the largest (with over 1,100 R users from around the world in attendance), and yet still very smoothly run with many amazing talks and lots of fun for everyone. If you weren't able to make it to Brussels, take a look at these recaps from Nick Strayer & Lucy D'Agostino McGowan, Once Upon Data and DataCamp to get a sense of what it was like, or simply take a look at this recap video:

From my personal point of view, if I were to try and capture user!2017 in just one word, it would be: vibrant. With so many first-time attendees, an atmosphere of excitement was everywhere, and the conference was noticeably much more diverse than in prior years — a really positive development. Kudos to the organizers for their focus on making useR!2017 a welcoming and inclusive conference, and a special shout-out to the R-Ladies community for encouraging and inspiring so many. I especially enjoyed meeting the diversity scholars and being a part of the special beginner's session held before the conference officially began (and so sadly unrecorded). Judging from the 200+ attendees reactions there, many welcomed getting a jump-start on the R project, its community, and how best to participate and contribute.

The diversity was reflected in the content, too, with a great mix of tutorials, keynotes and talks on R packages, R applications, the R community and ecosystem, and the R project itself. With thanks to Microsoft, all of this material was recorded, andis now available to view on Channel 9:

useR!2017 Recordings: useR! International R User 2017 Conference

All recordings are streamable and downloadable, and are shared under a Creative Commons license. (Note: a few talks are still in the editing room awaiting posting, but all the content should be available at the link above by July 21.) In many cases, you can also find slides in the sessions listed in the useR!2017 schedule.

With around 300 videos it might be tricky to find the one you want, but you can use the Filters button to reveal a search tool, and you can also filter by specific speakers:

Here are a few searches you might find useful:

Next year's useR! conference, useR!2018, will be held July 10-13 in Brisbane, Australia. The organizers have opened a survey on useR!2018 to give the R community an opportunity to make suggestions on the content. If you have ideas for tutorial topics and presenters, keynote speakers, services like child care, or sign language interpreters, or how scholarships should be awarded, please do contribute your ideas.

Looking even further out, useR!2019 will be in Toulouse (France), and useR!2020 will be in Boston (USA). That's a lot to be looking forward to, and with useR!2017 setting such a high a high bar I'm sure these will be outstanding conferences as well. See you there!

Posted by David Smith at 13:12 in events, R | Permalink | Comments (0)

July 14, 2017

Because it's Friday: Hidden Holes

They recently resurfaced the street in front of my house in Chicago. The first step was to grade away the existing layer of bitumen, to level the ground ready for a fresh layer. To my surprise (and I'm sure to the surprise of the engineers — the project was suspended for a couple of weeks), the old bitumen layer was hiding two sinkholes, one easily large enough to swallow a car. It was shocking to think we'd driven over that hole hundreds of times, and the only thing keeping us from falling in was a thin layer of bitumen.

As the video below explains, such sinkholes are usually caused by water erosion — in our case, probably by a broken water main. Check out the demo at the 4:00 mark to see how this can happen.

That's all for this week. Enjoy your weekend, and we'll be back with more on the blog on Monday.

Posted by David Smith at 13:57 in random | Permalink | Comments (0)

R in Minecraft: the lightning talk

My lightning talk at the useR!2017 conference was R in Minecraft: a five-minute tour of the miner and craft packages and associated book designed to teach kids how to use the R language while manipulating the world of Minecraft. You can see my talk below (starting at the 10:40 mark), and the slides (mostly created in Minecraft) are available for download as well.

Check the rest of that video for some other interesting lighting talks from that session:

[00:00] Christian Thiele: The cutpointr package: Improved and tidy estimation of optimal cutpoints
[05:20] Edwin Thoen: Preparing Datetime Data with Padr
[10:40] David Smith: R in Minecraft
[16:00] Munshi Imran Hossain: Digital Signal Processing with R
[21:20] Lars Rönnegård: Data Analysis Using Hierarchical Generalized Linear Models with R
[26:40] Florian Privé: The R package bigstatsr: Memory- and Computation-Efficient Statistical Tools for Big Matrices
[32:00] Malte Grosser: Advanced R Solutions -- A Bookdown Project
[37:20] Eugene Ha: Functional Input Validation with valaddin
[42:40] Florian Schwendinger: ROI - R Optimization Infrastructure
[48:00] Bart Smeets: simmer: Discrete-Event Simulation for R (very interesting if you want to simulate interconnected queues, inventories, or things like that)
[53:20] Edwin de Jonge: Data Error! But where?

Posted by David Smith at 10:32 in events, packages, R | Permalink | Comments (0)

July 13, 2017

20 years of CRAN

The presentations and tutorials are starting to become available at the Channel 9 useR!2017 page. There's a wealth of amazing content to explore there already, and I wanted to call out one presentation in particular: Uwe Ligges' keynote presentation, 20 years of CRAN.

There are many reasons for the success of R: the language itself, the mathematical and statistical libraries, the data visualization tools, the community, and many other factors. But in my opinion, the most important factor that has driven the growth of R over the last 20 years has been CRAN. This repository of R extensions has been marvellously successful at enabling the world-wide community of R developers to extend R, and to make their extensions available to R users everywhere with the utmost convenience and reliability.

In the presentation you can see below, Uwe Ligges (member of R-core and one of the first maintainers of CRAN after it was founded by Friedrich Leisch) describes the evolution of the CRAN system over the past 20 years, and gives some insights into how it may evolve in the future. If you've ever used an R package (and who hasn't!) it's well worth watching to appreciate the achievements and dedication of the volunteer CRAN team behind providing one of R's most important features.

Scroll to back to the beginning of the video to see Dirk Eddelbuettel's introduction (and those introduction slides are also available to download on at the user!2017 schedule).

Posted by David Smith at 14:34 in events, packages, R | Permalink | Comments (2)