Planet Python
Last update: April 15, 2016 10:49 AM
April 14, 2016
A. Jesse Jiryu Davis
Code Podcast: Event Loop & Coroutines

It was a treat to talk with Andrey Salomatin for Episode Three of the Code Podcast. We discussed async event loops and coroutines in Python 3.
Andrey doesn't simply broadcast an interview: he carefully edits his episodes to tell a story about a specific subject, setting him apart from all the other podcasts about software. Less like Charlie Rose, more like Radiolab. I'm eager to hear the next one.
Listen: Code Podcast Episode 3, "Concurrency: Event Loop & Coroutines"
Image: Vacuum tube radio receiver, Adams-Morgan Co. 1922
Mike Driscoll
Python 201 – What’s a deque?
According to the Python documentation, deques “are a generalization of stacks and queues”. They are pronounced “deck” which is short for “double-ended queue”. They are a replacement container for the Python list. Deques are thread-safe and support memory efficient appends and pops from either side of the deque. A list is optimized for fast fixed-length operations. You can get all the gory details in the Python documentation. A deque accepts a maxlen argument which sets the bounds for the deque. Otherwise the deque will grow to an arbitrary size. When a bounded deque is full, any new items added will cause the same number of items to be popped off the other end.
As a general rule, if you need fast appends or fast pops, use a deque. If you need fast random access, use a list. Let’s take a few moments to look at how you might create and use a deque.
>>> from collections import deque >>> import string >>> d = deque(string.ascii_lowercase) >>> for letter in d: ... print(letter)
Here we import the deque from our collections module and we also import the string module. To actually create an instance of a deque, we need to pass it an iterable. In this case, we passed it string.ascii_lowercase, which returns a list of all the lower case letters in the alphabet. Finally, we loop over our deque and print out each item. Now let’s look at at a few of the methods that deque possesses.
>>> d.append('bork') >>> d deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bork']) >>> d.appendleft('test') >>> d deque(['test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bork']) >>> d.appendleft('test') >>> d deque(['test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bork']) >>> d.rotate(1) >>> d deque(['bork', 'test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])
Let’s break this down a bit. First we append a string to the right end of the deque. Then we append another string to the left side of the deque. Lastly, we call rotate on our deque and pass it a one, which causes it to rotate one time to the right. In other words, it causes one item to rotate off the right end and onto the front. You can pass it a negative number to make the deque rotate to the left instead.
Let’s finish out this section by looking at an example that’s based on something from the Python documentation:
from collections import deque def get_last(filename, n=5): """ Returns the last n lines from the file """ try: with open(filename) as f: return deque(f, n) except OSError: print("Error opening file: {}".format(filename)) raise
This code works in much the same way as Linux’s tail program does. Here we pass in a filename to our script along with the n number of lines we want returned. The deque is bounded to whatever number we pass in as n. This means that once the deque is full, when new lines are read in and added to the deque, older lines are popped off the other end and discarded. I also wrapped the file opening **with** statement in a simple exception handler because it’s really easy to pass in a malformed path. This will catch files that don’t exist for example.
Wrapping Up
Now you know the basics of Python’s deque. It’s yet another handy little tool from the collections module. While I personally have never had a need for this particular collection, it remains a useful structure for others to use. I hope you’ll find some great uses for it in your own code.
Python Does What?!
Base64 vs UTF-8
Often when dealing with binary data in a unicode context (e.g. JSON serialization) the data is first base64 encoded. However, Python unicode objects can also use escape sequences.
What is the size relationship for high-entropy (e.g. compressed) binary data?
>>> every_byte = ''.join([chr(i) for i in range(256)])
>>> every_unichr = u''.join([(unichr(i) for i in range(256)])
>>> import base64
>>> len(every_unichr.encode('utf-8'))
384
>>> len(base64.b64encode(every_byte))
344
Surprisingly close! Unicode has the advantage that many byte values are encoded 1:1; however, if it does have to encode it will be 2:1 as opposed to 3:4 of base64. JSON serializing shifts the balance dramatically in favor of base64 however:
>>> import json
>>> len(json.dumps(every_unichr))
1045
>>> len(json.dumps(base64.b64encode(every_byte))
346
For the curious, here is what the encoded bytes looks like:
>>> every_unichr.encode('utf-8')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xc2\x80\xc2\x81\xc2\x82\xc2\x83\xc2\x84\xc2\x85\xc2\x86\xc2\x87\xc2\x88\xc2\x89\xc2\x8a\xc2\x8b\xc2\x8c\xc2\x8d\xc2\x8e\xc2\x8f\xc2\x90\xc2\x91\xc2\x92\xc2\x93\xc2\x94\xc2\x95\xc2\x96\xc2\x97\xc2\x98\xc2\x99\xc2\x9a\xc2\x9b\xc2\x9c\xc2\x9d\xc2\x9e\xc2\x9f\xc2\xa0\xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad\xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf'
Import Python
ImportPython Issue 68
Worthy Read Sponsor flask Learn how to deploy a simple Flask application with an AngularJS user interface to IBM Bluemix® using the Cloud Foundry command-line tool. For this tutorial, we chose Flask over other frameworks like Django, Pyramid, and web2py because it is very lightweight and therefore easy to understand. For just writing up a REST endpoint it is a perfect fit. In addition, we also show you how a single REST endpoint can be used to multiplex between different functions. interview Looking for a Python job? Chances are you will need to prove that you know how to work with Python. Here are a couple of questions that cover a wide base of skills associated with Python. Focus is placed on the language itself, and not any particular package or framework. Each question will be linked to a suitable tutorial if there is one. Some questions will wrap up multiple topics. podcast Toptal freelance experts Damir Zekic and Amar Sahinovic argue the merits of Ruby versus Python, covering everything from speed to performance. Listen to the podcast and weigh in by voting on the superior language and commenting in the thread below. I started Math & Pencil just about two years ago. Before I started the companies, I had almost zero web development experience (I’m a data guy) - I started a company learning HTTP, Javascript, AJAX, and Django MVC from scratch. It’s been a wild ride, and our technology stack has since matured to using interesting technologies such as D3.js, Backbone.js, Celery, Mongo, Redis, and a bunch of other stuff - but it didnt happen over night. Looking at the thousands of lines of Django code everyday, I thought it would be worth pointing out things I wish I did differently. data science Slidedeck that shows tools needed to be productive as a data scientist. tutorial Here is what we are going to build 1) An Ember todo list CRUD app using a JSON API-compliant backend built with Django Rest Framework, Secured using token authentication, With the ability to login and register new users. IBM Watson is a technology platform that uses natural language processing and machine learning to reveal insights from large amounts of unstructured data. We use the Watson Text to Speech API through Python in this article to build a alert and notification system, django , azure You will create an application using the Django web framework. You will create an application using the Django web framework (see alternate versions of this tutorial for Flask and Bottle). You will create the web app from the Azure Marketplace, set up Git deployment, and clone the repository locally. Then you will run the application locally, make changes, commit and push them to Azure. The tutorial shows how to do this from Windows or Mac/Linux. Curator Note - Our sponsor Azure is offering free $200 to use on Azure for Linux projects. Check it out.- http://goo.gl/fd17H9 Segment is the customer data platform that developers and analysts love because of its elegant APIs and extensive partner ecosystem. Sponsor code snippet Provides a very simple decorator (~40 lines of code) that can turn some or all of your default arguments into keyword-only arguments. You select one of your default arguments by name and the decorator turn this argument along with all default arguments on its right side into keyword only arguments. nginx Everyone wants their website and application to run faster. Also, every website with growing traffic or sharp traffic spikes is vulnerable to performance problems and downtime, often occurring at the worst – that is, busiest – times. Also, nearly all websites suffer performance problems and downtime, whether traffic volume is growing steadily or they experience sharp spikes in usage. That’s where NGINX and NGINX Plus come in. They improve website performance in three different ways. Read to know more. In a system constructed in a object oriented fashion, we usually have two types of objects: Data objects, where stores the data and Service objects, which manipulates the data. For example, if it is a database backed application it usually has some object that talks to the database, which is the Service object. pycharm The topics in this just-over-four-minute video. pytest, because, PyCharm has run configuration support native to pytest, so we introduce that. Multi-Python-version testing with tox, especially since PyCharm recently added native support. Testing with doctests, using (again) a native run configuration. Did I mention PyCharm has a lot of native run configurations for testing? Ditto for BDD, so we cover test configurations for the behave package. Skipping tests and re-running a test configuration. core python This article explains the new features in Python 3.6, compared to 3.5. podcast Writing tests is important for the stability of our projects and our confidence when making changes. One issue that we must all contend with when crafting these tests is whether or not we are properly exercising all of the edge cases. Property based testing is a method that attempts to find all of those edge cases by generating randomized inputs to your functions until a failing combination is found. This approach has been popularized by libraries such as Quickcheck in Haskell, but now Python has an offering in this space in the form of Hypothesis. In my last article I presented an approach that simplifies computations of very complex probability models. It makes these complex models viable by shrinking the amount of needed memory and improving the speed of computing probabilities. The approach we were exploring is called the Naive Bayes model. The context was the e-commerce feature in which a user is presented with the promotion box. The box shows the product category the user is most likely to buy. Though the results we got were quite good, I promised to present an approach that gives much better ones. While the Naive Bayes approach may not be acceptable in some scenarios due to the gap between approximated and real values, the approach presented in this article will make this distance much, much smaller. Butter CMS takes you through exactly how to build a Heroku add-on and shares their experience with the entire process. pycon , community The Young Coders workshop explores Python programming by making games. It starts with learning Python's simple data types, including numbers, letters, strings, and lists. Next come comparisons, ‘if’ statements, and loops. Finally, all of the new knowledge is combined by creating a game using the PyGame library. PyCon is excited to once again offer a free full-day tutorial for kids! We invite children 12 and up to join us for a day of learning how to program using Python. This book hopes to rectify that situation. Between these covers is a collection of knowledge and ideas from many sources on dealing with and creating descriptors. And, after going through the things all descriptors have in common, it explores ideas that have multiple ways of being implemented as well as completely new ideas never seen elsewhere before. This truly is a comprehensive guide to creating Python descriptors. Jobs Bangalore, Karnataka, India Projects DynamicMemoryNetworks - 26 Stars, 9 Fork Python implementation of DMN PyCraft - 12 Stars, 4 Fork A fork of "Minecraft in 500 lines of python" intended to be used as a real engine, instead of as a learning example. api-star - 12 Stars, 0 Fork An API framework for Flask & Falcon. django-gunicorn - 12 Stars, 0 Fork Run Django development server with Gunicorn. waybackpack - 10 Stars, 0 Fork Download the entire Wayback Machine archive for a given URL. dodotable - 8 Stars, 1 Fork HTML table representation for SQLAlchemy unsafe - 7 Stars, 3 Fork Experiments in execution of untrusted Python code. This is a little experiment to see to what extent, if any, it is possible to run untrusted Python (or at least Python-like) code under Python 3 while successfully preventing it from escaping the sandbox it's put inside. quotekey - 7 Stars, 1 Fork Boutique SSH keypair generator. Plain old randomness just doesn't cut it sometimes. Personalize your SSH keys. spammy - 5 Stars, 0 Fork spammy: Spam filtering made easy for you python_console.log - 3 Stars, 1 Fork Its about time that python got a console.log. For years, JavaScript developers have had a one up on Python. They've been able to print whatever they like to the console using the infamous console.log command. It's about time python had this killer functionality. I've managed to replicate this behavior using state of the art python class(es). |
Vladimir Iakolev
Freeze time in tests even with GAE datastore
It’s not so rare thing to freeze time in tests and for that task I’m using freezegun and it works nice almost every time:
from freezegun import freeze_time
def test_get_date():
with freeze_time("2016.1.1"):
assert get_date() == datetime(2016, 1, 1)
But not with GAE datastore. Assume that we have a model Document
with created_at = db.DateTimeProperty(auto_now_add=True), so test like:
from freezegun import freeze_time
def test_created_at():
with freeze_time('2016.1.1'):
doc = Document()
doc.put()
assert doc.created_at == datetime(2016, 1, 1)
Will fail with:
BadValueError: Unsupported type for property created_at: <class 'freezegun.api.FakeDatetime'>
But it can be easily fixed if we monkey patch GAE internals:
from contextlib import contextmanager
from google.appengine.api import datastore_types
from mock import patch
from freezegun import freeze_time as _freeze_time
from freezegun.api import FakeDatetime
@contextmanager
def freeze_time(*args, **kwargs):
with patch('google.appengine.ext.db.DateTimeProperty.data_type',
new=FakeDatetime):
datastore_types._VALIDATE_PROPERTY_VALUES[FakeDatetime] = \
datastore_types.ValidatePropertyNothing
datastore_types._PACK_PROPERTY_VALUES[FakeDatetime] = \
datastore_types.PackDatetime
datastore_types._PROPERTY_MEANINGS[FakeDatetime] = \
datastore_types.entity_pb.Property.GD_WHEN
with _freeze_time(*args, **kwargs):
yield
And now it works!
Tim Golden
Network Zero
There’s a small trend in the UK-based education-related Python development world: creating easy-to-use packages and using the suffix “zero” to indicate that they involve “zero” boilerplate or “zero” learning curve… or something. Daniel Pope started it with his PyGame Zero package; Ben Nuttall carried on with GPIO Zero; and I’ve followed up with Network Zero. FWIW I’m fairly sure the Raspberry Pi Zero was named independently but it fits nicely into the extended Zero family. Nicholas Tollervey was even excited enough to create a Github organisation to collate ideas around the “Zero” family. His own Mu editor [*] although lacking the naming convention is, in spirit, an Editor Zero.
We’ve never really talked it out, but I think the common theme is to produce packages especially suitable for classroom & club use, where:
- The *zero package sits on top of an established package (PyGame, GPIO, 0MQ) which more advanced students can drop into once they’ve reached the bounds of the simplified *zero approach.
- The emphasis is on up-and-running use in a classroom or club rather than clever coding techniques. There’s a slight preference for procedural rather than object-based API (although everything in Python is an object but still…)
- Helpful messages: where it’s feasible, error messages should be made relevant to the immediate *zero package rather than reflecting an error several levels deep. This goes a little against a common Python philosophy of letting exceptions bubble to the top unaltered but is more suitable for less experienced coders
My own Network Zero is definitely a work in progress, but I’m burning through all my commuting hours (and a few more besides) to get to a stable API with useful examples, helpful docs and tests which pass on all current Python versions across all three major platforms. Tom Viner has been incredibly helpful in setting up tox and CI via Travis & Appveyor.
If you feel like contributing, the activity is happening over on Github and the built documentation is on readthedocs. I welcome comments and suggestions, always bearing in mind the Design Guidelines.
Any teachers I know are welcome to comment of course, but I’ll be reaching out to them specifically a little later when the codebase has stabilised and some examples and cookbook recipes are up and documented.
Feel free to raise issues on Github as appropriate. If you have more general questions, ping me on Twitter.
[*] I think Nicholas should have embraced the Unicode and named it μ with the next version called ν (with absolutely no apologies for the cross-language pun)
Talk Python to Me
#54 Enterprise Software with Python
How often have people asked what language / technology you work in and when you answered Python they got a little confused and asked, what can you actually build with Python? What type of apps? The implication being Python is just a notch above Bash scripts. That real things aren't built with Python but rather Java, C#, Objective-C and so on. <br/> <br/> Mahmoud Hashemi and I might be able to help you put some real evidence and experience behind your response. On episode 54 of Talk Python To Me, I talk with Mahmoud about his new online course he wrote for O'Reilly entitled Enterprise Software in Python. You'll hear many real-world examples from his experience inside PayPal and more. <br/> <br/> Links from the show: <br/> <div style="font-size: .85em;"> <br/> <b>Enterprise Python Course</b>: <a href='http://techbus.safaribooksonline.com/video/programming/python/9781491943755' target='_blank'>techbus.safaribooksonline.com/...</a> <br/> <b>Course (at O'Reilly)</b>: <a href='http://shop.oreilly.com/product/0636920047346.do' target='_blank'>shop.oreilly.com/product/0636920047346.do#</a> <br/> <b>Mahmoud on Talk Python</b>: <a href='https://talkpython.fm/episodes/show/4/enterprise-python-and-large-scale-projects' target='_blank'>talkpython.fm/episodes/show/4</a> <br/> <b>10 Myths of Enterprise Python Article</b>: <a href='https://www.paypal-engineering.com/2014/12/10/10-myths-of-enterprise-python/' target='_blank'>paypal-engineering.com/2014/12/10/10-myths-of-enterprise-python/</a> <br/> <b>Mahmoud on Twitter</b>: <a href='https://twitter.com/mhashemi' target='_blank'>@mhashemi</a> <br/> <b>Course github repo</b>: <a href='https://github.com/mahmoud/espymetrics' target='_blank'>github.com/mahmoud/espymetrics</a> <br/> <b>The bad pull request</b>: <a href='https://github.com/mahmoud/espymetrics/pull/2' target='_blank'>github.com/mahmoud/espymetrics/pull/2</a> <br/> </div>
Matthew Rocklin
Fast Message Serialization
This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project
Very high performance isn’t about doing one thing well, it’s about doing nothing poorly.
This week I optimized the inter-node communication protocol used by
dask.distributed. It was a fun exercise in optimization that involved
several different and unexpected components. I separately had to deal with
Pickle, NumPy, Tornado, MsgPack, and compression libraries.
This blogpost is not advertising any particular functionality, rather it’s a story of the problems I ran into when designing and optimizing a protocol to quickly send both very small and very large numeric data between machines on the Python stack.
We care very strongly about both the many small messages case (thousands of 100 byte messages per second) and the very large messages case (100-1000 MB). This spans an interesting range of performance space. We end up with a protocol that costs around 5 microseconds in the small case and operates at 1-1.5 GB/s in the large case.
Identify a Problem
This came about as I was preparing a demo using dask.array on a distributed
cluster for a Continuum webinar. I noticed that my computations were taking
much longer than expected. The
Web UI quickly pointed
me to the fact that my machines were spending 10-20 seconds moving 30 MB chunks
of numpy array data between them. This is very strange because I was on
100MB/s network, and so I expected these transfers to happen in more like 0.3s
than 15s.
The Web UI made this glaringly apparent, so my first lesson was how valuable visual profiling tools can be when they make performance issues glaringly obvious. Thanks here goes to the Bokeh developers who helped the development of the Dask real-time Web UI.
Problem 1: Tornado’s sentinels
Dask’s networking is built off of Tornado’s TCP IOStreams.
There are two common ways to delineate messages on a socket, sentinel values
that signal the end of a message, and prefixing a length before every message.
Early on we tried both in Dask but found that prefixing a length before every
message was slow. It turns out that this was because TCP sockets try to batch
small messages to increase bandwidth. Turning this optimization off ended up
being an effective and easy solution, see the TCP_NODELAY parameter.
However, before we figured that out we used sentinels for a long time. Unfortunately Tornado does not handle sentinels well for large messages. At the receipt of every new message it reads through all buffered data to see if it can find the sentinel. This makes lots and lots of copies and reads through lots and lots of bytes. This isn’t a problem if your messages are a few kilobytes, as is common in web development, but it’s terrible if your messages are millions or billions of bytes long.
Switching back to prefixing messages with lengths and turning off the no-delay optimization moved our bandwidth up from 3MB/s to 20MB/s per node. Thanks goes to Ben Darnell (main Tornado developer) for helping us to track this down.
Problem 2: Memory Copies
A nice machine can copy memory at 5 GB/s. If your network is only 100 MB/s then you can easily suffer several memory copies in your system without caring. This leads to code that looks like the following:
socket.send(header + payload)
This code concatenates two bytestrings, header and payload before
sending the result down a socket. If we cared deeply about avoiding memory
copies then we might instead send these two separately:
socket.send(header)
socket.send(payload)
But who cares, right? At 5 GB/s copying memory is cheap!
Unfortunately this breaks down under either of the following conditions
- You are sloppy enough to do this multiple times
- You find yourself on a machine with surprisingly low memory bandwidth, like 10 times slower, as is the case on some EC2 machines.
Both of these were true for me but fortunately it’s usually straightforward to reduce the number of copies down to a small number (we got down to three), with moderate effort.
Problem 3: Unwanted Compression
Dask compresses all large messages with LZ4 or Snappy if they’re available. Unfortunately, if your data isn’t very compressible then this is mostly lost time. Doubly unforutnate is that you also have to decompress the data on the recipient side. Decompressing not-very-compressible data was surprisingly slow.
Now we compress with the following policy:
- If the message is less than 10kB, don’t bother
- Pick out five 10kB samples of the data and compress those. If the result isn’t well compressed then don’t bother compressing the full payload.
- Compress the full payload, if it doesn’t compress well then just send along the original to spare the receiver’s side from compressing.
In this case we use cheap checks to guard against unwanted compression. We also avoid any cost at all for small messages, which we care about deeply.
Problem 4: Cloudpickle is not as fast as Pickle
This was surprising, because cloudpickle mostly defers to Pickle for the easy stuff, like NumPy arrays.
In [1]: import numpy as np
In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000)
In [3]: import pickle, cloudpickle
In [4]: %time len(pickle.dumps(data, protocol=-1))
CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms
Wall time: 16.9 ms
Out[4]: 10000161
In [5]: %time len(cloudpickle.dumps(data, protocol=-1))
CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms
Wall time: 44.4 ms
Out[5]: 10000161
But it turns out that cloudpickle is using the Python implementation, while
pickle itself (or cPickle in Python 2) is using the compiled C implemenation.
Fortunately this is easy to correct, and a quick typecheck on common large
dataformats in Python (NumPy and Pandas) gets us this speed boost.
Problem 5: Pickle is still slower than you’d expect
Pickle runs at about half the speed of memcopy, which is what you’d expect from a protocol that is mostly just “serialize the dtype, strides, then tack on the data bytes”. There must be an extraneous memory copy in there.
See issue 7544
Problem 6: MsgPack is bad at large bytestrings
Dask serializes most messages with MsgPack, which is ordinarily very fast. Unfortunately the MsgPack spec doesn’t support bytestrings greater than 4GB (which do come up for us) and the Python implementations don’t pass through large bytestrings very efficiently. So we had to handle large bytestrings separately. Any message that contains bytestrings over 1MB in size will have them stripped out and sent along in a separate frame. This both avoids the MsgPack overhead and avoids a memory copy (we can send the bytes directly to the socket).
Problem 7: Tornado makes a copy
Sockets on Windows don’t accept payloads greater than 128kB in size. As a result Tornado chops up large messages into many small ones. On linux this memory copy is extraneous. It can be removed with a bit of logic within Tornado. I might do this in the moderate future.
Results
We serialize small messages in about 5 microseconds (thanks msgpack!) and move large bytes around in the cost of three memory copies (about 1-1.5 GB/s) which is generally faster than most networks in use.
Here is a profile of sending and receiving a gigabyte-sized NumPy array of random values through to the same process over localhost (500 MB/s on my machine.)
381360 function calls (381323 primitive calls) in 1.451 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.366 0.366 0.366 0.366 {built-in method dumps}
8 0.289 0.036 0.291 0.036 iostream.py:360(write)
15353 0.228 0.000 0.228 0.000 {method 'join' of 'bytes' objects}
15355 0.166 0.000 0.166 0.000 {method 'recv' of '_socket.socket' objects}
15362 0.156 0.000 0.398 0.000 iostream.py:1510(_merge_prefix)
7759 0.101 0.000 0.101 0.000 {method 'send' of '_socket.socket' objects}
17/14 0.026 0.002 0.686 0.049 gen.py:990(run)
15355 0.021 0.000 0.198 0.000 iostream.py:721(_read_to_buffer)
8 0.018 0.002 0.203 0.025 iostream.py:876(_consume)
91 0.017 0.000 0.335 0.004 iostream.py:827(_handle_write)
89 0.015 0.000 0.217 0.002 iostream.py:585(_read_to_buffer_loop)
122567 0.009 0.000 0.009 0.000 {built-in method len}
15355 0.008 0.000 0.173 0.000 iostream.py:1010(read_from_fd)
38369 0.004 0.000 0.004 0.000 {method 'append' of 'list' objects}
7759 0.004 0.000 0.104 0.000 iostream.py:1023(write_to_fd)
1 0.003 0.003 1.451 1.451 ioloop.py:746(start)
Dominant unwanted costs include the following:
- 400ms: Pickling the NumPy array
- 400ms: Bytestring handling within Tornado
After this we’re just bound by pushing bytes down a wire.
Conclusion
Writing fast code isn’t about writing any one thing particularly well, it’s about mitigating everything that can get in your way. As you approch peak performance, previously minor flaws suddenly become your dominant bottleneck. Success here depends on frequent profiling and keeping your mind open to unexpected and surprising costs.
Links
- EC2 slow memory copy StackOverflow question.
- Tornado issue for sending large messages
- Wikipedia page on Nagle’s algorithm for TCP protocol for small packets
- NumPy issue for double memory copy
- Cloudpickle issue for memoryview support
April 13, 2016
Glyph Lefkowitz
I think I’m using GitHub wrong.
I use a hodgepodge of https: and : (i.e. “ssh”) URL schemes for my local
clones; sometimes I have a remote called “github” and sometimes I have one
called “origin”. Sometimes I clone from a fork I made and sometimes I clone
from the upstream.
I think the right way to use GitHub would instead be to always fork first, make my remote always be “origin”, and consistently name the upstream remote “upstream”. The problem with this, though, is that forks rapidly fall out of date, and I often want to automatically synchronize all the upstream branches.
Is there a script or a github option or something to synchronize a fork with upstream automatically, including all its branches and tags? I know there’s no comment field, but you can email me or reply on twitter.
PyCharm
PyCharm Migration Tutorial for Text Editors
If you’re a Python developer who uses a text editor such as Vim, Emacs, or Sublime Text, you might wonder what it takes to switch to PyCharm as an IDE for your development. We’ve written a helpful Migrating from Text Editors tutorial for just this topic.
The tutorial starts with the basic question of “What is an IDE?” The line between text editor and IDE can be blurry. PyCharm views the distinction as: a project-level view of your code and coding activities, with project-wide features such as coding assistance and refactoring.
This document then goes over some of the important points when migrating: the project-oriented UI, working with projects instead of files, Vim/Emacs specifics, keyboard shortcuts, customizing, and a discussion of facilities important to text editor users (multiple cursors, split windows, etc.) It then closes by discussing areas the IDE can really help, for example, the managed running and debugging of your code.
Of course, this document is just an overview. Vim and Emacs in particularly have decades of development and features, and PyCharm itself is now very mature with many features itself, so a complete comparison would break the Internet. If you have a specific question, feel free to comment, and we hope you find the tutorial helpful.
Kushal Das
Quick way to get throw away VMs using Tunir
The latest Tunir package has a –debug option which can help us to get some quick vms up, where we can do some destructive work, and then just remove them. Below is an example to fire up two vms using Fedora Cloud base image using a quickjob.cfg file.
[general]
cpu = 1
ram = 1024
[vm1]
user = fedora
image = /home/Fedora-Cloud-Base-20141203-21.x86_64.qcow2
[vm2]
user = fedora
image = /home/Fedora-Cloud-Base-20141203-21.x86_64.qcow2
In the quickjob.txt file we just keep one command to check sanity :)
vm1 free -m
After we execute Tunir, we will something like below as output.
# tunir --multi quickjob
... lots of output ...
Non gating tests status:
Total:0
Passed:0
Failed:0
DEBUG MODE ON. Destroy from /tmp/tmpiNumV2/destroy.sh
The above mention directory also has the temporary private key to login to the instances. The output also contains the IP addresses of the VM(s). We can login like
# ssh [email protected] -i /tmp/tmpiNumV2/private.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
The last two parts of the ssh command will make sure that we do not store the signature for the throwaway guests in the known_hosts file. To clean up afterwards we can do the following.
# sh /tmp/tmpiNumV2/destroy.sh
Vasudev Ram
A quick console ruler in Python
By Vasudev Ram
I've done this ruler program a few times before, in various languages.
Here is an earlier version: Rule the command-line with ruler.py!
This one is a simplified and also slightly enhanced version of the one above.
It generates a simple text-based ruler on the console.
Can be useful for data processing tasks related to fixed-length or variable-length records, CSV files, etc.
With REPS set to 8, it works just right for a console of 80 columns.
Here is the code:
# ruler.pyAnd the output:
"""
Program to display a ruler on the console.
Author: Vasudev Ram
Copyright 2016 Vasudev Ram - http://jugad2.blogspot.com
0123456789, concatenated.
Purpose: By running this program, you can use its output as a ruler,
to find the position of your own program's output on the line, or to
find the positions and lengths of fields in fixed- or variable-length
records in a text file, fields in CSV files, etc.
"""
REPS = 8
def ruler(sep=' ', reps=REPS):
for i in range(reps):
print str(i) + ' ' * 4 + sep + ' ' * 3,
print '0123456789' * reps
def main():
# Without divider.
ruler()
# With various dividers.
for sep in '|+!':
ruler(sep)
if __name__ == '__main__':
main()
$ python ruler.pyYou can also import it as a module in your own program:
0 1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 +
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0 ! 1 ! 2 ! 3 ! 4 ! 5 ! 6 ! 7 !
01234567890123456789012345678901234567890123456789012345678901234567890123456789
# test_ruler.py
from ruler import ruler
ruler()
# Code that outputs the data you want to measure
# lengths or positions of, goes here ...
print 'NAME AGE CITY'
ruler()
# ... or here.
print 'SOME ONE 20 LON '
print 'ANOTHER 30 NYC '
$ python test_ruler.py
Output:
0 1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
NAME AGE CITY
0 1 2 3 4 5 6 7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
SOME ONE 20 LON
ANOTHER 30 NYC
- Enjoy.
- Vasudev Ram - Online Python training and consulting Signup to hear about my new products and services. My Python posts Subscribe to my blog by email My ActiveState recipes
April 12, 2016
Continuum Analytics News
Using Anaconda with PySpark for Distributed Language Processing on a Hadoop Cluster
Developer Blog
Overview
Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases.
Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing.
In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster.
In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP, to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook.
In the remainder of this post, we'll:
-
Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster.
-
Load the text/language data into HDFS on the cluster.
-
Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions.
-
Perform distributed natural language processing on the data with the NLTK library from Anaconda.
-
Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization.
Provisioning Anaconda on a cluster
Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda on each node of the existing Hadoop cluster with a single command:
$ acluster create cluster-hadoop --profile cluster-hadoop
After a few minutes, we’ll have a centrally managed installation of conda across our Hadoop cluster in the default location of /opt/anaconda.
Installing Anaconda packages on the cluster
Once we’ve provisioned conda on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:
$ acluster conda install nltk pandas bokeh
We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management:
$ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'
Loading the data into HDFS
In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.
Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.
We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:
$ acluster ssh
The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions:
$ sudo -u hdfs hadoop fs -mkdir /user/ubuntu
$ sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu
We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster:
$ hadoop distcp s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/
Replace AWS_KEY and AWS_SECRET in the above command with valid Amazon AWS credentials.
Configuring the spark-submit command with your Hadoop Cluster
To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON environment variable on a per-job basis along with the spark-submit command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py) using the following command:
$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py
If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python):
$ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py
Installing and Configuring the Notebook with your Hadoop Cluster
Using the spark-submit command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.
Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.
Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:
$ acluster install notebook
$ acluster open notebook
Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.
Using the Anaconda Parcel with Cloudera CDH
If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.
>>> import os
>>> import sys
>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"
>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Using Anaconda for cluster management with Cloudera CDH
If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.
>>> import os
>>> import sys
>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"
>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Using Anaconda for cluster management with Hortonworks HDP
If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.
>>> import os
>>> import sys
>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Initializing the SparkContext
After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:
>>> from pyspark import SparkConf
>>> from pyspark import SparkContext
>>> conf = SparkConf()
>>> conf.setMaster('yarn-client')
>>> conf.setAppName('anaconda-pyspark-language')
>>> sc = SparkContext(conf=conf)
Loading the data into memory
Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:
>>> lines = sc.textFile("/user/ubuntu/*.json")
Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:
>>> import json
>>> data = lines.map(json.loads)
>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')
We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:
>>> movies.persist()
We can count the total number of comments in the movies subreddit (about 2.9 million comments):
>>> movies.count()
2905085
We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:
>>> movies.take(1)
CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms
[{u'archived': False,
u'author': u'kylionsfan',
u'author_flair_css_class': None,
u'author_flair_text': None,
u'body': u'Goonies',
u'controversiality': 0,
u'created_utc': u'1420070402',
u'distinguished': None,
u'downs': 0,
u'edited': False,
u'gilded': 0,
u'id': u'cnas90u',
u'link_id': u't3_2qyjda',
u'name': u't1_cnas90u',
u'parent_id': u't3_2qyjda',
u'retrieved_on': 1425124282,
u'score': 1,
u'score_hidden': False,
u'subreddit': u'movies',
u'subreddit_id': u't5_2qh3s',
u'ups': 1}]
Distributed Natural Language Processing
Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.
First, we define a parse() function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse() function to the movies RDD:
>>> def parse(record):
... import nltk
... tokens = nltk.word_tokenize(record["body"])
... record["n_words"] = len(tokens)
... record["pos"] = nltk.pos_tag(tokens)
... return record
>>> movies2 = movies.map(parse)
Let’s take a look at the body of one of the comments:
>>> movies2.take(10)[6]['body']
u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'
And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):
>>> movies2.take(10)[6]['pos']
[(u'Dawn', 'NN'),
(u'of', 'IN'),
(u'the', 'DT'),
(u'Apes', 'NNP'),
(u'was', 'VBD'),
(u'such', 'JJ'),
(u'an', 'DT'),
(u'incredible', 'JJ'),
(u'movie', 'NN'),
(u',', ','),
(u'it', 'PRP'),
(u'should', 'MD'),
(u'be', 'VB'),
(u'up', 'RP'),
(u'there', 'RB'),
(u'in', 'IN'),
(u'my', 'PRP$'),
(u'opinion', 'NN'),
(u'.', '.')]
We can define a get_NN() function that extracts nouns from the records, filters stopwords, and removes non-words from the data set:
>>> def get_NN(record):
... import re
... from nltk.corpus import stopwords
... all_pos = record["pos"]
... ret = []
... for pos in all_pos:
... if pos[1] == "NN" \
... and pos[0] not in stopwords.words('english') \
... and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:
... ret.append(pos[0])
... return ret
>>> nouns = movies2.flatMap(get_NN)
We can then generate word counts for the nouns that we extracted from the dataset:
>>> counts = nouns.map(lambda word: (word, 1))
After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.
>>> top_nouns = counts.countByKey()
>>> top_nouns = dict(top_nouns)
In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.
Local analysis with Pandas and Bokeh
Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:
>>> import pandas as pd
>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])
Let’s sort the resulting word counts, and view the top 10 nouns by frequency:
>>> df = df.sort_values('Count', ascending=False)
>>> df_top_10 = df.head(10)
>>> df_top_10
|
Noun |
Count |
|
movie |
539698 |
|
film |
220366 |
|
time |
157595 |
|
way |
112752 |
|
gt |
105313 |
|
http |
92619 |
|
something |
87835 |
|
lot |
85573 |
|
scene |
82229 |
|
thing |
82101 |
Let’s generate a bar chart of the top 10 nouns using Pandas:
>>> %matplotlib inline
>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])
Finally, we can use Bokeh to generate an interactive plot of the data:
>>> from bokeh.charts import Bar, show
>>> from bokeh.io import output_notebook
>>> from bokeh.charts.attributes import cat
>>> output_notebook()
>>> p = Bar(df_top_10,
... label=cat(columns='Noun', sort=False),
... values='Count',
... title='Top N nouns in r/movies subreddit')
>>> show(p)
Conclusion
In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.
You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:
$ conda install anaconda-client
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at [email protected]. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5.
If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN.
PyCon
Registration is open for our Young Coders tutorial!
PyCon is excited to once again offer a free full-day tutorial for kids! We invite children 12 and upto join us for a day of learning how to program using Python. The class is running twice, on each of the two final sprint days:
- Option 1. Saturday, June 4, 2016 from 9:00 AM to 4:30 PM.
- Option 2. Sunday, June 5, 2016 from 9:00 AM to 4:30 PM.
The sign-up page is here:
https://www.eventbrite.com/e/pycon-2016-young-coders-tickets-24319019843
The Young Coders tutorial was first offered at PyCon 2013 in Santa Clara. It was an immediate hit, and has been an important part of every PyCon since — including a French edition for the two years that PyCon was held in Montréal! Whether you and your family are local to Portland, or you are traveling to PyCon and bringing your family along, this class is a great way expose kids to programming.
The Young Coders workshop explores Python programming by making games. It starts with learning Python's simple data types, including numbers, letters, strings, and lists. Next come comparisons, ‘if’ statements, and loops. Finally, all of the new knowledge is combined by creating a game using the PyGame library.
Registration is limited — sign up soon if you know kids who will be interested!
Python Anywhere
System upgrade, 2016-04-12: Python 3.5
We upgraded PythonAnywhere today. The big story for this release is that we now support Python 3.5.1 everywhere :-) We've put it through extensive testing, but of course it's possible that glitches remain -- please do let us know in the forums or by email if you find any.
There were a few other minor changes -- basically, a bunch of system package installs and upgrades:
- mysqlclient for Python 3.x (so now Django should work out of the box with Python 3)
- pyodbc and its lower-level dependencies, so you should be able to connect to Microsoft SQL Servers elsewhere on the Internet.
- pdftk
- basemap for Python 3.x.
- pint
- uncertainties
- flask-openid
- And finally, we've upgraded Twilio so that it works properly from free accounts.
Montreal Python User Group
MTL NewTech Startup Demos & Networking + PyCon Contest
PyCon is partnering again with MTL NewTech and Montréal-Python this year to bring one lucky Montreal startup to PyCon at Portland, Oregon, to present alongside with Google, Facebook, Stripe, Heroku, Microsoft, Mozilla and many other technology companies.
If you are a startup that meets the requirements below, apply now by filling this form: http://goo.gl/forms/zf9jO8n8vR With the following information: a) size of the team, b) age of the startup c) your use of Python.
Deadline for applications: April 19nd 23h59. Announcement of the startups selected: Starting on April 21th. MTL NewTech Demo & announcement of the winner: April 26th Feel free to invite fellow startups
For more details about PyCon and the startup row in general, please head to the PyCon website at https://us.pycon.org/2016/events/startup_row/
==============
Eligible companies must meet the following criteria:
-
Fewer than 15 employees, including founders
-
Less than two years old
-
Use Python somewhere in your startup: backend, front-end, testing, wherever
-
If selected, please confirm that you will staff your booth in the Expo Hall on your appointed day. We will try accommodate your preferences: Monday or Tuesday
-
No repeats. If you were on startup row in a previous year, please give another a startup a chance this year.
==============
April 11, 2016
Continuum Analytics News
Data Science with Python at ODSC East
Company Blog
By: Sheamus McGovern, Open Data Science Conference Chair
At ODSC East, the most influential minds and institutions in data science will convene at the Boston Convention & Exhibition Center from May 20th to the 22nd to discuss and teach the newest and most exciting developments in data science.As you know, the Python ecosystem is now one of the most important data science development environments available today. This is due, in large part, to the existence of a rich suite of user-facing data analysis libraries.
Powerful Python machine learning libraries like Scikit-learn, XGBoost and others bring sophisticated predictive analytics to the masses. The NLTK and Gensim libraries enable deep analysis of textual information in Python and the Topik library provides a high-level interface to these and other, natural language libraries, adding a new layer of usability. The Pandas library has brought data analysis in Python to a new level by providing expressive data structures for quick and intuitive data manipulation and analysis.
The notebook ecosystem in Python has also flourished with the development of the Jupyter, Rodeo and Beaker notebooks. The notebook interface is an increasingly popular way for data scientists to perform complex analyses that serve the purpose of conveying and sharing analyses and their results to colleagues and to stakeholders. Python is also host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale data science powered web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular.
With Numba or Cython, code can approach speeds akin to that of C or C++ and new developments, like the Dask package, to make computing on larger-than-memory datasets very easy. Visualization libraries, like Plot.ly and Bokeh, have brought rich, interactive and impactful data visualization tools to the fingertips of data analysts everywhere.
Anaconda has streamlined the use of many of these wildly popular open source data science packages by providing an easy way to install, manage and use Python libraries. With Anaconda, users no longer need to worry about tedious incompatibilities and library management across their development environments.
Several of the most influential Python developers and data scientists will be talking and teaching at ODSC East. Indeed, Peter Wang will be speaking ODSC East. Peter is the co-founder and CTO at Continuum Analytics, as well as the mastermind behind the popular Bokeh visualization library, the Blaze ecosystem, which simplifies the the analysis of Big Data with Python and Anaconda. At ODSC East, there will be over 100 speakers, 20 workshops and 10 training sessions spanning seven conferences that focused on Open Data Science, Disruptive Data Science, Big Data science, Data Visualization, Data Science for Good, Open Data and a Careers and Training conference. See below for very small sampling of some of the powerful Python workshops and speakers we will have at ODSC East.
●Bayesian Statistics Made Simple - Allen Downey, Think Python
●Intro to Scikit learn for Machine Learning - Andreas Mueller, NYU Center for Data Science
●Parallelizing Data Science in Python with Dask - Matthew Rocklin, Continuum Analytics
●Interactive Viz of a Billion Points with Bokeh Datashader – Peter Wang, Continuum Analytics
Mike Driscoll
Pre-Order Python 201 Paperback
I have decided to offer a pre-order of the paperback version of my next book. You will be able to pre-order a signed copy of the book which will ship in September, 2016. I am limiting the number of pre-orders to 100. If you’re interested in getting the book, you can do so here

Python Engineering at Microsoft
How to deal with the pain of “unable to find vcvarsall.bat”
Python’s packaging ecosystem is one of its biggest strengths, but Windows users are often frustrated by packages that do not install properly. One of the most common errors you’ll see is this one:
As far as errors go, “unable to find vcvarsall.bat” is not the most helpful. What is this mythical batch file? Why do I need it? Where can I get it? How do I help Python find it? When will we be freed from this pain? Let’s look at some answers to these questions.
What is vcvarsall.bat, and why do I need it?
To explain why we need this tool, we need to look at a common pattern in Python packages. One of the benefits of installing a separate package is the ability to do something that you couldn’t normally do – in many cases, this is something that would be completely impossible otherwise. Like image processing with Pillow, high-performance machine learning with scikit-learn, or micro-threading with greenlet. But how can these packages do things that aren’t possible in regular Python?
The answer is that they include extension modules, sometimes called native modules. Unlike Python modules, these are not .py files containing Python source code – they are .pyd files that contain native, platform-specific code, typically written in C. In many cases the extension module is an internal detail; all the classes and functions you’re actually using have been written in Python, but the tricky parts or the high-performance parts are in the extension module.
When you see “unable to find vcvarsall.bat”, it means you’re installing a package that has an extension module, but only the source code. “vcvarsall.bat” is part of the compiler in Visual Studio that is necessary to compile the module.
As a Windows user, you’re probably used to downloading programs that are ready to run. This is largely due to the very impressive compatibility that Windows provides – you can take a program that was compiled twenty years ago and run it on versions of Windows that nobody had imagined at that time. However, Python comes from a very different world where every single machine can be different and incompatible. This makes it impossible to precompile programs and only distribute the build outputs, because many users will not be able to use it. So the culture is one where only source code is distributed, and every machine is set up with a compiler and the tools necessary to build extension modules on install. Because Windows has a different culture, most people do not have (or need) a compiler.
The good news is that the culture is changing. For Windows platforms, a package developer can upload wheels of their packages as well as the source code. Extension modules included in wheels have already been compiled, so you do not need a compiler on the machine you are installing onto.
When you use pip to install your package, if a wheel is available for your version of Python, it will be downloaded and extracted. For example, running pip install numpy will download their wheel on Python 3.5, 3.4 and 2.7 – no compilers needed!
I need a package that has no wheel – what can I do?
Firstly, this is become a more and more rare occurrence. The pythonwheels.com site tracks the most popular 360 packages, showing which ones have made wheels available (nearly 60% when this blog post was written). But from time to time you will encounter a package who’s developer has not produced wheels.
The first thing you should do is report an issue on the project’s issue tracker, requesting (politely) that they include wheels with their releases. If the project supports Windows at all, they ought to be testing on Windows, which means they have already handled the compiler setup. (And if a project is not testing on Windows, and you care a lot about that project, maybe you should to volunteer to help them out? Most projects do not have paid staff, and volunteers are always appreciated.)
If a project is not willing or able to produce wheels themselves, you can look elsewhere. For many people, using a distribution such as Anaconda or Python(x,y) is an easy way to get access to a lot of packages.
However, if you just need to get one package, it’s worth seeing if it is available on Christoph Gohlke’s Python Extension Packages for Windows page. On this page there are unofficial wheels (that is, the original projects do not necessarily endorse them) for hundreds of packages. You can download any of them and then use pip install (full path to the .whl file) to install it.
If none of these options is available, you will need to consider building the extension yourself. In many cases this is not difficult, though it does require setting up a build environment. (These instructions are adapted from Build Environment.)
First you’ll need to install the compiler toolset. Depending on which version of Python you care about, you will need to choose a different download, but all of them are freely available. The table below lists the downloads for versions of Python as far back as 2.6.
| Python Version | You will need |
|---|---|
| 3.5 and later | Visual C++ Build Tools 2015 or Visual Studio 2015 |
| 3.3 and 3.4 | Windows SDK for Windows 7 and .NET 4.0 (Alternatively, Visual Studio 2010 if you have access to it) |
| 2.6 to 3.2 | Microsoft Visual C++ Compiler for Python 2.7 |
After installing the compiler tools, you should ensure that your version of setuptools is up-to-date.
For Python 3.5 and later, installing Visual Studio 2015 is sufficient and you can now try to pip install the package again. Python 3.5 resolves a significant compatibility issue on Windows that will make it possible to upgrade the compilers used for extensions, so when a new version of Visual Studio is released, you will be able to use that instead of the current one.
For Python 2.6 through 3.2, you also don’t need to do anything else. The compiler package (though labelled for “Python 2.7″, it works for all of these versions) is detected by setuptools, and so pip install will use it when needed.
However, if you are targeting Python 3.3 and 3.4 (and did not have access to Visual Studio 2010), building is slightly more complicated. You will need to open a Visual Studio Command Prompt (selecting the x64 version if using 64-bit Python) and run set DISTUTILS_USE_SDK=1 before calling pip install.
If you have to install these packages on a lot of machines, I’d strongly suggest installing the wheel package first and using pip wheel (package name) to create your own wheels. Then you can install those on other machines without having to install the compilers.
And while this sounds simple, there is a downside. Many, many packages that need a compiler also need other dependencies. For example, the lxml example we started with also requires copies of libxml2 and libxslt – more libraries that you will need to find, download, install, build, test and verify. Just because you have a compiler installed does not mean the pain ends.
When will the pain end?
The issues surrounding Python packaging are some of the most complex in our industry right now. Versioning is difficult, dependency resolution is difficult, ABI compatibility is difficult, secure hosting is difficult, and software trust is difficult. But just because these problems are difficult does not mean that they are impossible to solve, that we cannot have a viable ecosystem despite them, or that people are not actively working on better solutions.
For example, wheels are a great distribution solution for Windows and Mac OS X, but not so great on Linux due to the range of differences between installs. However, there are people actively working on making it possible to publicly distribute wheels that will work with most versions of Linux, such that soon all platforms will benefit from faster installation and no longer require a compiler for extension modules.
Most of the work solving these issues for Python goes on at the distutils-sig mailing list, and you can read the current recommendations at packaging.python.org. We are all volunteers, and so over time the discussion moves from topic to topic as people develop an interest and have time available to work on various problems. More contributors are always welcome.
But even if you don’t want to solve the really big problems, there are ways you can help. Report an issue to package maintainers who do not yet have wheels. If they don’t currently support Windows, offer to help them with testing, building, and documentation. Consider donating to projects that accept donations – these are often used to fund the software and hardware (or online services such as Appveyor) needed to support other platforms.
And always thank project maintainers who actively support Windows, Mac OS X and Linux. It is not an easy task to build, test, debug and maintain code that runs on such a diverse set of platforms. Those who take on the burden deserve our encouragement.
PyCharm
In-Depth Screencast on Testing
Earlier this year we rolled out a Getting Started Series of screencast videos on the basics of using PyCharm: setup, the UI, running Python code, debugging, etc. We knew at the time that some topics would need more treatment than a quick screencast, so we planned a follow-on series of “in-depth” screencasts, each on a single topic.
Here’s the first: In-Depth: Testing covers more topics than we went over in the Getting Started: Testing screencast. Which makes sense, as PyCharm itself has such a wealth of support in each of its features:
Here are the topics in this just-over-four-minute video:
- pytest, because, you know…it’s pytest! PyCharm has run configuration support native to pytest, so we introduce that.
- Multi-Python-version testing with tox, especially since PyCharm recently added native support.
- Testing with doctests, using (again) a native run configuration
- Did I mention PyCharm has a lot of native run configurations for testing? Ditto for BDD, so we cover test configurations for the behave package.
- Skipping tests and re-running a test configuration
We hope you enjoy this first in the series of In-Depth screencasts. We have more planned, such as version control. And please, if you have any topics that you’d like to see get expanded screencast attention, let us know.
Doug Hellmann
bz2 — bzip2 Compression — PyMOTW 3
The bz2 module is an interface for the bzip2 library, used to compress data for storage or transmission. Read more… This post is part of the Python Module of the Week series for Python 3. See PyMOTW.com for more articles from the series.
Mike Driscoll
PyDev of the Week: John Cook
This week we welcome John Cook as our PyDev of the Week! John has a fun Python blog that I read from to time and he graciously accepted my offer of interviewing him this week. Let’s take a few moments to get to know him better.
Can you tell us a little about yourself (hobbies, education, etc):
I’m a consultant working in the overlap of math, data analysis, and software development. Most projects I do have two of these elements if not all three. I had a variety of jobs before starting my own company, and most of them involved some combination of math and software development.
Why did you start using Python?
What other programming languages do you know and which is your favorite?
I’ve written a lot of C++. When Python isn’t fast enough, I turn to C++, though I don’t do that often. In the last few years I’ve used R, C#, and Haskell on different projects.
I really like the consistency and predictability of Mathematica, though I haven’t used it in a while. I now use Python for the kinds of work I used to do in Mathematica. Even though some things are easier to do Mathematica, it’s worth some extra effort to keep from having to switch contexts and use two languages and environments. And of course Mathematica is expensive. Even if I decide the price of a Mathematica license is worth it for my own use, I can’t ask clients to buy Mathematica licenses.
What projects are you working on now?
Which Python libraries are your favorite (core or 3rd party)?
I use SciPy daily. It’s my favorite in the sense that I depend on it and I’m grateful for the tremendous effort that has gone into it. I can’t say it’s my favorite in terms of API design; I wish it were more consistent and predictable.
I wish I knew pandas and SymPy better. I use them occasionally, but not often enough to keep their syntax in my head.
Conda is a sort of meta library rather than a library per se, but I really appreciate conda. It’s made it so much easier to install packages. I go back and forth between Windows and Linux, and it’s so nice to be able to count on the same libraries in both environments. Before some packages would install smoothly on one OS but not the other.
Where do you see Python going as a programming language?
Caktus Consulting Group
Adopting Scrum in a Client-services, Multi-project Organization
Caktus began the process of adopting Scrum mid-November 2015 with two days of onsite Scrum training and fully transitioned to a Scrum environment in January 2016. From our original epiphany of “Yes! We want Scrum!” to the beginning of our first sprint, it took us six weeks to design and execute a process and transition plan. This is how we did it:
Step 1: Form a committee
Caktus is a fairly flat organization and we prefer to involve as many people as possible in decisions that affect the whole team. We formed a committee that included our founders, senior developers, and project managers to think through this change. In order for us to proceed with any of the following steps, all committee members had to be in agreement. When we encountered disagreement, we continued communicating in order to identify and resolve points of contention.
Step 2: Identify an approach
Originally we planned to adopt Scrum on a per-project basis. After all, most of the literature on Scrum is geared towards projects. Once we started planning this approach, however, we realized the overhead and duplication of effort required to adopt Scrum on even four concurrent projects (e.g. requiring team members to attend four discrete sets of sprint activities) was not feasible or realistic. Since Caktus works on more than four projects at a time, we needed another approach.
It was then that our CEO Tobias McNulty flipped the original concept, asking “What if instead of focusing our Scrum process around projects, we focused around teams?” After some initial head-scratching, some frantic searches in our Scrum books, and questions to our Scrum trainers, our committee agreed that the Scrum team approach was worth looking into.
Step 3: Identify cross-functional teams with feasible project assignments
Our approach to Scrum generated a lot of questions, including:
- How many teams can we have?
- Who is on which team?
- What projects would be assigned to which teams?
We broke out into several small groups and brainstormed team ideas, then met back together and presented our options to each other. There was a lot of discussion and moving around of sticky notes. We ended up leaving all the options on one of our whiteboards for several days. During this time, you’d frequently find Caktus team members gazing at the whiteboard or pensively moving sticky notes into new configurations. Eventually, we settled on a team/project configuration that required the least amount of transitions for all stakeholders (developers, clients, project managers), retained the most institutional knowledge, and demonstrated cross-functional skillsets.
Step 4: Role-to-title breakdown
Scrum specifies three roles: Development team member, Scrum Master, and Product Owner. Most organizations, including Caktus, specify job titles instead: Backend developer, UI developer, Project Manager, etc. Once we had our teams, we had to map our team members to Scrum roles.
At first, this seemed fairly straightforward. Clearly Development team member = any developers, Scrum Master = Project Manager, and Product Owner = Product Manager. Yet the more we delved into Scrum, the more it became obvious that roles ≠ titles. We stopped focusing on titles and instead focused on responsibilities, skill sets, and attributes. Once we did so, it became obvious that our Project Managers were better suited to be Product Owners.
This realization allowed us to make smarter long-term decisions when assigning members to our teams.
Step 5: Create a transition plan
The change from a client-services, multi-project organization to a client-services, multi-project organization divided into Scrum teams was not insignificant. In order to transition to our Scrum teams, we needed to orient developers to new projects, switch out some client contacts, and physically rearrange our office so that we were seated roughly with our teams. We created a plan to make the necessary changes over time so that we were prepared to start our first sprints in January 2016.
We identified which developers would need to be onboarded onto which projects, and the key points of knowledge transfer that needed to happen in order for teams to successfully support projects. We started these transitions when it made sense to do so per project per team, e.g., after the call with the client in which the client was introduced to the new developer(s), and before the holder of the institutional knowledge went on holiday vacation.
Step 6: Obtain buy-in from the team
We wanted the whole of Caktus to be on board with the change prior to January. Once we had a plan, we hosted a Q&A lunch with the team in which we introduced the new Scrum teams, sprint activity schedules, and project assignments. We answered the questions we could and wrote down the ones we couldn’t for further consideration.
After this initial launch, we had several other team announcements as the process became more defined, as well as kick-off meetings with each team in which everyone had an opportunity to choose team names, provide feedback on schedules, and share any concerns with their new Scrum team. Team name direction was “A type of cactus”, and we landed on Team Robust Hedgehog, Team Discocactus, and Team Scarlet Crown. Concerns were addressed by the teams first, and if necessary, escalated to the Product Owners for further discussion and resolution.
On January 4, 2016, Caktus started its first Scrum sprints. After three months, our teams are reliably and successfully completing sprints, and working together to support our varied clients.
What we’ve learned by adopting Scrum is that Scrum is not a silver bullet. What Scrum doesn’t cover is a much larger list than what it does. The Caktus team has earnestly identified, confronted, and worked together to resolve issues and questions exposed by our adoption of Scrum, including (but not limited to):
- How best to communicate our Scrum process to our clients, so they can understand how it affects their projects?
- How does the Product Strategist title fit into Scrum?
- How can we transition from scheduling projects in hours to relative sizing by sprint in story points, while still estimating incoming projects in hours?
- How do sales efforts get appointed to teams, scheduled into sprints, and still get completed in a satisfactory manner?
- What parts of Scrum are useful for other, non-development efforts at Caktus (retrospectives, daily check-ins, backlogs, etc)?
- Is it possible for someone to perform the Scrum Master on one team and Product Owner roles on a different team?
Scrum provides the framework that highlights these issues but intentionally does not offer solutions to all the problems. (In fact, in the Certified ScrumMaster exam, “This is outside the scope of Scrum” is the correct answer to some of the more difficult questions.) Adopting Scrum provides teams with the opportunity to solve these problems together and design a customized process that works for them.
Scrum isn’t for every organization or every situation, but it’s working for Caktus. We look forward to seeing how it continues to evolve to help us grow sharper web apps.
Doing Math with Python
SymPy 1.0 and Anaconda 4.0 releases
SymPy 1.0 was released recently and Anaconda 4.0 was just released. I tried all the sample solutions and everything works as expected. The chapter programs should keep working as well.
You can get both the updates when you install Anaconda 4.0 or updated your existing Anaconda installation:
$ conda update conda $ conda update anaconda
I have so far verified both on Mac OS X and Linux. If you find any issues on Windows, please email me at doingmathwithpython@gmail.com or post your query/tip to any of the following community forums:
April 10, 2016
Chris Hager
Python Thread Pool
A thread pool is a group of pre-instantiated, idle threads which stand ready to be given work. These are often preferred over instantiating new threads for each task when there is a large number of (short) tasks to be done rather than a small number of long ones.
Suppose you want do download 1000s of documents from the internet, but only have resources for downloading 50 at a time. The solution is to utilize is a thread pool, spawning a fixed number of threads to download all the URLs from a queue, 50 at a time.
In order to use thread pools, Python 3.x includes the ThreadPoolExecutor class, and both Python 2.x and 3.x have multiprocessing.dummy.ThreadPool. multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
The downside of multiprocessing.dummy.ThreadPool is that in Python 2.x, it is not possible to exit the program with eg. a KeyboardInterrupt before all tasks from the queue have been finished by the threads.
In order to achieve an interruptable thread queue in Python 2.x and 3.x (for use in PDFx), I’ve build this code, inspired by stackoverflow.com/a/7257510. It implements a thread pool which works with Python 2.x and 3.x:
import sys
IS_PY2 = sys.version_info < (3, 0)
if IS_PY2:
from Queue import Queue
else:
from queue import Queue
from threading import Thread
class Worker(Thread):
""" Thread executing tasks from a given tasks queue """
def __init__(self, tasks):
Thread.__init__(self)
self.tasks = tasks
self.daemon = True
self.start()
def run(self):
while True:
func, args, kargs = self.tasks.get()
try:
func(*args, **kargs)
except Exception as e:
# An exception happened in this thread
print(e)
finally:
# Mark this task as done, whether an exception happened or not
self.tasks.task_done()
class ThreadPool:
""" Pool of threads consuming tasks from a queue """
def __init__(self, num_threads):
self.tasks = Queue(num_threads)
for _ in range(num_threads):
Worker(self.tasks)
def add_task(self, func, *args, **kargs):
""" Add a task to the queue """
self.tasks.put((func, args, kargs))
def map(self, func, args_list):
""" Add a list of tasks to the queue """
for args in args_list:
self.add_task(func, args)
def wait_completion(self):
""" Wait for completion of all the tasks in the queue """
self.tasks.join()
if __name__ == "__main__":
from random import randrange
from time import sleep
# Function to be executed in a thread
def wait_delay(d):
print("sleeping for (%d)sec" % d)
sleep(d)
# Generate random delays
delays = [randrange(3, 7) for i in range(50)]
# Instantiate a thread pool with 5 worker threads
pool = ThreadPool(5)
# Add the jobs in bulk to the thread pool. Alternatively you could use
# `pool.add_task` to add single jobs. The code will block here, which
# makes it possible to cancel the thread pool with an exception when
# the currently running batch of workers is finished.
pool.map(wait_delay, delays)
pool.wait_completion()
The queue size is similar to the number of threads (see self.tasks = Queue(num_threads)), therefore adding tasks with pool.map(..) and pool.add_task(..) blocks until a new slot in the Queue is available.
When you issue a KeyboardInterrupt by pressing Ctrl+C, the current batch of workers
will finish and the program quits with the exception at the pool.map(..) step.
If you have suggestions or feedback, let me know via @metachris



