Planet Python
Last update: January 18, 2017 10:46 AM
January 18, 2017
Django Weblog
Django 1.11 alpha 1 released
Django 1.11 alpha 1 is now available. It represents the first stage in the 1.11 release cycle and is an opportunity for you to try out the changes coming in Django 1.11.
Django 1.11 has a medley of new features which you can read about in the in-development 1.11 release notes.
This alpha milestone marks a complete feature freeze. The current release schedule calls for a beta release in about a month and a release candidate about a month from then. We'll only be able to keep this schedule if we get early and often testing from the community. Updates on the release schedule schedule are available on the django-developers mailing list.
As with all alpha and beta packages, this is not for production use. But if you'd like to take some of the new features for a spin, or to help find and fix bugs (which should be reported to the issue tracker), you can grab a copy of the alpha package from our downloads page or on PyPI.
The PGP key ID used for this release is Tim Graham: 1E8ABDC773EDE252.
Daniel Bader
Assert Statements in Python
Assert Statements in Python
How to use assertions to help automatically detect errors in your Python programs in order to make them more reliable and easier to debug.
What Are Assertions & What Are They Good For?
Python’s assert statement is a debugging aid that tests a condition. If the condition is true, it does nothing and your program just continues to execute. But if the assert condition evaluates to false, it raises an AssertionError exception with an optional error message.
The proper use of assertions is to inform developers about unrecoverable errors in a program. They’re not intended to signal expected error conditions, like “file not found”, where a user can take corrective action or just try again.
Another way to look at it is to say that assertions are internal self-checks for your program. They work by declaring some conditions as impossible in your code. If they these conditions don’t hold that means there’s a bug in the program.
If your program is bug-free, these conditions will never occur. But if they do occur the program will crash with an assertion error telling you exactly which “impossible” condition was triggered. This makes it much easier to track down and fix bugs in your programs.
To summarize: Python’s assert statement is a debugging aid, not a mechanism for handling run-time errors. The goal of using assertions is to let developers find the likely root cause of a bug more quickly. An assertion error should never be raised unless there’s a bug in your program.
Assert in Python — An Example
Here’s a simple example so you can see where assertions might come in handy. I tried to give this some semblance of a real world problem you might actually encounter in one of your programs.
Suppose you were building an online store with Python. You’re working to add a discount coupon functionality to the system and eventually write the following apply_discount function:
def apply_discount(product, discount): price = int(product['price'] * (1.0 - discount)) assert 0 <= price <= product['price'] return price
Notice the assert statement in there? It will guarantee that, no matter what, discounted prices cannot be lower than $0 and they cannot be higher than the original price of the product.
Let’s make sure this actually works as intended if we call this function to apply a valid discount:
# # Our example product: Nice shoes for $149.00 # >>> shoes = {'name': 'Fancy Shoes', 'price': 14900} # # 25% off -> $111.75 # >>> apply_discount(shoes, 0.25) 11175
Alright, this worked nicely. Now, let’s try to apply some invalid discounts:
# # A "200% off" discount: # >>> apply_discount(shoes, 2.0) Traceback (most recent call last): File "<input>", line 1, in <module> apply_discount(prod, 2.0) File "<input>", line 4, in apply_discount assert 0 <= price <= product['price'] AssertionError # # A "-30% off" discount: # >>> apply_discount(shoes, -0.3) Traceback (most recent call last): File "<input>", line 1, in <module> apply_discount(prod, -0.3) File "<input>", line 4, in apply_discount assert 0 <= price <= product['price'] AssertionError
As you can see, trying to apply an invalid discount raises an AssertionError exception that points out the line with the violated assertion condition. If we ever encounter one of these errors while testing our online store it will be easy to find out what happened by looking at the traceback.
This is the power of assertions, in a nutshell.
Python’s Assert Syntax
It’s always a good idea to study up on how a language feature is actually implemented in Python before you start using it. So let’s take a quick look at the syntax for the assert statement according to the Python docs:
assert_stmt ::= "assert" expression1 ["," expression2]
In this case expression1 is the condition we test, and the optional expression2 is an error message that’s displayed if the assertion fails.
At execution time, the Python interpreter transforms each assert statement into roughly the following:
if __debug__: if not expression1: raise AssertionError(expression2)
You can use expression2 to pass an optional error message that will be displayed with the AssertionError in the traceback. This can simplify debugging even further—for example, I’ve seen code like this:
if cond == 'x': do_x() elif cond == 'y': do_y() else: assert False, ("This should never happen, but it does occasionally. " "We're currently trying to figure out why. " "Email dbader if you encounter this in the wild.")
Is this ugly? Well, yes. But it’s definitely a valid and helpful technique if you’re faced with a heisenbug-type issue in one of your applications. 😉
Common Pitfalls With Using Asserts in Python
Before you move on, there are two important caveats with using assertions in Python that I’d like to call out.
The first one has to do with introducing security risks and bugs into your applications, and the second one is about a syntax quirk that makes it easy to write useless assertions.
This sounds (and potentially is) pretty horrible, so you might at least want to skim these two caveats or read their summaries below.
Caveat #1 – Don’t Use Asserts for Data Validation
Asserts can be turned off globally in the Python interpreter. Don’t rely on assert expressions to be executed for data validation or data processing.
The biggest caveat with using asserts in Python is that assertions can be globally disabled with the -O and -OO command line switches, as well as the PYTHONOPTIMIZE environment variable in CPython.
This turns any assert statement into a null-operation: the assertions simply get compiled away and won’t be evaluated, which means that none of the conditional expressions will be executed.
This is an intentional design decision used similarly by many other programming languages. As a side-effect it becomes extremely dangerous to use assert statements as a quick and easy way to validate input data.
Let me explain—if your program uses asserts to check if a function argument contains a “wrong” or unexpected value this can backfire quickly and lead to bugs or security holes.
Let’s take a look at a simple example. Imagine you’re building an online store application with Python. Somewhere in your application code there’s a function to delete a product as per a user’s request:
def delete_product(product_id, user): assert user.is_admin(), 'Must have admin privileges to delete' assert store.product_exists(product_id), 'Unknown product id' store.find_product(product_id).delete()
Take a close look at this function. What happens if assertions are disabled?
There are two serious issues in this three-line function example, caused by the incorrect use of assert statements:
- Checking for admin privileges with an assert statement is dangerous. If assertions are disabled in the Python interpreter, this turns into a null-op. Therefore any user can now delete products. The privileges check doesn’t even run. This likely introduces a security problem and opens the door for attackers to destroy or severely damage the data in your customer’s or company’s online store. Not good.
-
The
product_exists()check is skipped when assertions are disabled. This meansfind_product()can now be called with invalid product ids—which could lead to more severe bugs depending on how our program is written. In the worst case this could be an avenue for someone to launch Denial of Service attacks against our store. If the store app crashes if we attempt to delete an unknown product, it might be possible for an attacker to bombard it with invalid delete requests and cause an outage.
How might we avoid these problems? The answer is to not use assertions to do data validation. Instead we could do our validation with regular if-statements and raise validation exceptions if necessary. Like so:
def delete_product(product_id, user): if not user.is_admin(): raise AuthError('Must have admin privileges to delete') if not store.product_exists(product_id): raise ValueError('Unknown product id') store.find_product(product_id).delete()
This updated example also has the benefit that instead of raising unspecific AssertionError exceptions, it now raises semantically correct exceptions like ValueError or AuthError (which we’d have to define ourselves).
Caveat #2 – Asserts That Never Fail
It’s easy to accidentally write Python assert statements that always evaluate to true. I’ve been bitten by this myself in the past. I wrote a longer article about this specific issue you can check out by clicking here.
Alternatively, here’s the executive summary:
When you pass a tuple as the first argument in an assert statement, the assertion always evaluates as true and therefore never fails.
For example, this assertion will never fail:
assert(1 == 2, 'This should fail')
This has to do with non-empty tuples always being truthy in Python. If you pass a tuple to an assert statement it leads to the assert condition to always be true—which in turn leads to the above assert statement being useless because it can never fail and trigger an exception.
It’s relatively easy to accidentally write bad multi-line asserts due to this unintuitive behavior. This quickly leads to broken test cases that give a falls sense of security in our test code. Imagine you had this assertion somewhere in your unit test suite:
assert ( counter == 10, 'It should have counted all the items' )
Upon first inspection this test case looks completely fine. However, this test case would never catch an incorrect result: it always evaluates to True, regardless of the state of the counter variable.
Like I said, it’s rather easy to shoot yourself in the foot with this (at least if you’re like me). Luckily, there are some countermeasures you can apply to prevent this syntax quirk from causing trouble:
>> Read the full article on bogus assertions to get the dirty details.
Python Assertions — Summary
Despite these caveats I believe that Python’s assertions are a powerful debugging tool that’s frequently underused by Python developers.
Understanding how assertions work and when to apply them can help you write more maintainable and easier to debug Python programs. It’s a great skill to learn that will help bring your Python to the next level and make you a more well-rounded Pythonista.
Matthew Rocklin
Dask Development Log
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2017-01-01 and 2016-01-17. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.
Themes of the last couple of weeks:
- Stability enhancements for the distributed scheduler and micro-release
- NASA Grant writing
- Dask-EC2 script
- Dataframe categorical flexibility (work in progress)
- Communication refactor (work in progress)
Stability enhancements and micro-release
We’ve released dask.distributed version 1.15.1, which includes important bugfixes after the recent 1.15.0 release. There were a number of small issues that coordinated to remove tasks erroneously. This was generally OK because the Dask scheduler was able to heal the missing pieces (using the same machinery that makes Dask resilience) and so we didn’t notice the flaw until the system was deployed in some of the more serious Dask deployments in the wild. PR dask/distributed #804 contains a full writeup in case anyone is interested. The writeup ends with the following line:
This was a nice exercise in how coupling mostly-working components can easily yield a faulty system.
This also adds other fixes, like a compatibility issue with the new Bokeh 0.12.4 release and others.
NASA Grant Writing
I’ve been writing a proposal to NASA to help fund distributed Dask+XArray work for atmospheric and oceanographic science at the 100TB scale. Many thanks to our scientific collaborators who are offering support here.
Dask-EC2 startup
The Dask-EC2 project deploys Anaconda, a Dask cluster, and Jupyter notebooks on Amazon’s Elastic Compute Cloud (EC2) with a small command line interface:
pip install dask-ec2 --upgrade
dask-ec2 up --keyname KEYNAME \
--keypair /path/to/ssh-key \
--type m4.2xlarge
--count 8
This project can be either very useful for people just getting started and for Dask developers when we run benchmarks, or it can be horribly broken if AWS or Dask interfaces change and we don’t keep this project maintained. Thanks to a great effort from Ben Zaitlen `dask-ec2 is again in the very useful state, where I’m hoping it will stay for some time.
If you’ve always wanted to try Dask on a real cluster and if you already have AWS credentials then this is probably the easiest way.
This already seems to be paying dividends. There have been a few unrelated pull requests from new developers this week.
Dataframe Categorical Flexibility
Categoricals can significantly improve
performance on
text-based data. Currently Dask’s dataframes support categoricals, but they
expect to know all of the categories up-front. This is easy if this set is
small, like the ["Healthy", "Sick"] categories that might arise in medical
research, but requires a full dataset read if the categories are not known
ahead of time, like the names of all of the patients.
Jim Crist is changing this so that Dask can operates on categorical columns with unknown categories at dask/dask #1877. The constituent pandas dataframes all have possibly different categories that are merged as necessary. This distinction may seem small, but it limits performance in a surprising number of real-world use cases.
Communication Refactor
Since the recent worker refactor and optimizations it has become clear that inter-worker communication has become a dominant bottleneck in some intensive applications. Antoine Pitrou is currently refactoring Dask’s network communication layer, making room for more communication options in the future. This is an ambitious project. I for one am very happy to have someone like Antoine looking into this.
January 17, 2017
Brian Curtin
Easy deprecations in Python with @deprecated
Tim Peters once wrote, "[t]here should be one—and preferably only one—obvious way to do it." Sometimes we don't do it right the first time, or we later decide something shouldn't be done at all. For those reasons and more, deprecations are a tool to enable growth while easing the pain of transition.
Rather than switching "cold turkey" from API1 to API2 you do it gradually, introducing API2 with documentation, examples, notifications, and other helpful tools to get your users to move away from API1. Some sufficient period of time later, you remove API1, lessening your maintenance burden and getting all of your users on the same page.
One of the biggest issues I've seen is that last part, the removal. More often than not, it's a manual step. You determine that some code can be removed in a future version of your project and you write it down in an issue tracker, a wiki, a calendar event, a post-it note, or something else you're going to ignore. For example, I once did some work on CPython around removing support for Windows 9x in the subprocess module, which I only knew about because I was one of the few Windows people around and I happened across PEP 11 at the right time.
Automate It!
Over the years I've seen and used several forms of a decorator for Python functions that marks code as deprecated. They're all fairly good, as they raise DeprecationWarning for you and some of them update the function's docstring. However, as Python 2.7 began ignoring DeprecationWarning [1], they require some extra steps to become entirely useful, otherwise they're yelling into the void. Enabling the warnings in your development environment is easy, by passing a -W command-line option or by setting the PYTHONWARNINGS environment variable, but you deserve more.
import deprecation
If you pip install libdeprecation [2], you get a couple of things:
- If you decorate a function with deprecation.deprecated, your now deprecated code raises DeprecationWarning. Rather, it raises deprecation.DeprecatedWarning, but that's a subclass, as is deprecation.UnsupportedWarning. You'll see why it's useful in a second.
- Your docstrings are updated with deprecation details. This includes the versions you set, along with optional details, such as directing users to something that replaces the deprecated code. So far this isn't all that different from what's been around the web for ten-plus years.
- If you pass deprecation.deprecated enough information and then use deprecation.fail_if_not_removed on tests which call that deprecated code, you'll get tests that fail when it's time for them to be removed. When your code has reached the version where you need to remove it, it will emit deprecation.UnsupportedWarning and the tests will handle it and turn it into a failure.
@deprecation.deprecated(deprecated_in="1.0", removed_in="2.0",
current_version=__version__,
details="Use the ``one`` function instead")
def won():
"""This function returns 1"""
# Oops, it's one, not won. Let's deprecate this and get it right.
return 1
...
@deprecation.fail_if_not_removed
def test_won(self):
self.assertEqual(1, won())
All in all, the process of documenting, notifying, and eventually moving on is handled for you. When __version__ = "2.0", that test will fail and you'll be able to catch it before releasing it.
Full documentation and more examples are available at deprecation.readthedocs.io, and the source can be found on GitHub at briancurtin/deprecation.
Happy deprecating!
| [1] | Exposing application users to DeprecationWarnings that are emitted by lower-level code needlessly involves end-users in "how things are done." It often leads to users raising issues about warnings they're presented, which on one hand is done rightfully so, as it's been presented to them as some sort of issue to resolve. However, at the same time, the warning could be well known and planned for. From either side, loud DeprecationWarnings can be seen as noise that isn't necessary outside of development. |
| [2] | The deprecation name on PyPI is currently being squatted on, so I've reached out to the current holder to see if I can use it. Only the PyPI package name is called libdeprecation, not any of the project's API. I hope to eventually deprecate libdeprecation to change names, which I think is self-deprecating? |
Chris Moffitt
Data Science Challenge - Predicting Baseball Fanduel Points
Introduction
Several months ago, I participated in my first crowd-sourced Data Science competition in the Twin Cities run by Analyze This!. In my previous post, I described the benefits of working through the competition and how much I enjoyed the process. I just completed the second challenge and had another great experience that I wanted to share and (hopefully) encourage others to try these types of practical challenges to build their Data Science/Analytics skills.
In this second challenge, I felt much more comfortable with the actual process of cleaning the data, exploring it and building and testing models. I found that the python tools continue to serve me well. However, I also identified a lot of things that I need to do better in future challenges or projects in order to be more systematic about my process. I am curious if the broader community has tips or tricks they can share related to some of the items I will cover below. I will also highlight a few of the useful python tools I used throughout the process. This post does not include any code but is focused more on the process and python tools for Data Science.
Background
As mentioned in my previous post, Analyze This! is an organization dedicated to raising awareness of the power of Data Science and increasing visibility in the local business community of the capabilities that Data Science can bring to their organizations. In order to accomplish this mission, Analyze This! hosts friendly competitions and monthly educational sessions on various Data Science topics.
This specific competition focused on predicting 2015 Major League Baseball Fanduel points. A local company provided ~36,000 rows of data to be used in the analysis. The objective was to use the 116 measures to build a model to predict the actual points a hitter would get in a Fanduel fantasy game. Approximately 10 teams of 3-5 people each participated in the challenge and the top 4 presented at SportCon. I was very proud to be a member of the team that made the final 4 cut and presented at SportCon.
Observations
As I went into the challenge, I wanted to leverage the experience from the last challenge and focus on a few skills to build in this event. I specifically wanted to spend more time on the exploratory analysis in order to more thoughtfully construct my models. In addition, I wanted to actually build out and try the models on my own. My past experience was very ad-hoc. I wanted this process to be a little more methodical and logical.
Leverage Standards
About a year ago, I took an introductory Business Analytics class which used the book Data Science for Business (Amazon Referral) by Foster Provost and Tom Fawcett as one of the primary textbooks for the course. As I have spent more time working on simple Data Science projects, I have really come to appreciate the insights and perspectives from this book.
In the future, I would like to do a more in-depth review of this book but for the purposes of this article, I used it as a reference to inform the basic process I wanted to follow for the project. Not surprisingly, this book mentions that there is an established methodology for Data Mining/Analysis called the “Cross Industry Standard Process for Data Mining” aka CRISP-DM. Here is a simple graphic showing the various phases:
credit: Kenneth Jensen
This process matched what my experience had been in the past in that it is very iterative as you explore the potential solutions. I plan to continue to use this as a model for approaching data analysis problems.
Business and Data Understanding
For this particular challenge, there were a lot of interesting aspects to the “business” and “data” understanding. From a personal perspective, I was familiar with baseball as a casual fan but did not have any in-depth experience with Fanduel so one of the first things I had to do was learn more about how scores were generated for a given game.
In addition to the basic understanding of the problem, it was a bit of a challenge to interpret some of the various measures; understand how they were calculated and figure out what they actually represented. It was clear as we went through the final presentations that some groups understood the intricacies of the data in much more detail than others. It was also interesting that in-depth understanding of each data element was not required to actually “win” the competition.
Finally, this phase of the process would typically involve more thought around what data elements to capture. The structure of this specific challenge made that a non-issue since all data was provided and we were not allowed to augment it with other data sources.
Data Preparation
For this particular problem, the data was relatively clean and easily read in via Excel or csv. However there were three components to the data cleaning that impacted the final model:
- Handling missing data
- Encoding categorical data
- Scaling data
As I worked through the problem, it was clear that managing these three factors required quite a bit of intuition and trial and error to figure out the best approach.
I am generally aware of the options for handling missing data but I did not have a good intution for when to apply the various approaches:
- When is it better to replace a missing value with a numerical substitute like mean, median or mode?
- When should a dummy value like NaN or -1 be used?
- When should the data just be dropped?
Categorical data proved to have somewhat similar challenges. There were approximately 16 categorical variables that could be encoded in several ways:
- Binary (Day/Night)
- Numerical range (H-M-L converted to 3-2-1)
- One hot encoding (each value in a column)
- Excluded from the model
Finally, the data included many measures with values < 1 as well as measures > 1000. Depending on the model, these scales could over-emphasize some results at the expense of others. Fortunately scikit-learn has options for mitigating but how do you know when to use which option? In my case, I stuck with using RobustScaler as my go-to function. This may or may not be the right approach.
The challenge with all these options is that I could not figure out a good systematic way to evaluate each of these data preparation steps and how they impacted the model. The entire process felt like a lot of trial and error.
Ultimately, I believe this is just part of the process but I am interested in understanding how to systematically approach these types of data preparation steps in a methodical manner.
Modeling and Evaluation
For modeling, I used the standard scikit learn tools augmented with TPOT and ultimately used XGboost as the model of choice.
In a similar vein to the challenges with data prep, I struggled to figure out how to choose which model worked best. The data set was not tremendously large but some of the modeling approaches could take several minutes to run. By the time I factored in all of the possible options of data prep + model selection + parameter tuning, it was very easy to get lost in the process.
Scikit-learn has capabilities to tune hyper-parameters which is helpful. Additionally, TPOT can be a great tool to try a bunch of different approaches too. However, these tools don’t always help with the further up-stream process related to data prep and feature engineering. I plan to investigate more options in this area in future challenges.
Tool Sets
In this particular challenge, most groups used either R or python for their solution. I found it interesting that python seemed to be the dominant tool and that most people used a the standard python Data Science stack. However, even though everyone used similar tools and processes, we did come up with different approaches to the solutions.
I used Jupyter Notebooks quite extensively for my analysis but realized that I need to re-think how to organize them. As I iterated through the various solutions, I started to spend more time struggling to find which notebook contained a certain piece of code I needed. Sorting and searching through the various notebooks is very limited since the notebook name is all that is displayed on the notebook index.
One of my biggest complaints with Jupyter notebooks is that they don’t lend themselves to standard version control like a standalone python script. Obviously, storing a notebook in git or mercurial is possible but it is not very friendly for diff viewing. I recently learned about the nbdime project which looks very interesting and I may check out next time.
Speaking of Notebooks, I found a lot of useful examples for python code in the Allstate Kaggle Competition. This specific competition had a data set that tended to have data analysis approaches that worked well for the Baseball data as well. I used a lot of code snippets and ideas from these kernels. I encourage people to check out all of the kernels that are available on Kaggle. They do a nice job of showing how to approach problems from multiple different perspectives.
Another project I will likely use going forward are the Cookiecutter templates for Data Science. The basic structure may be a little overkill for a small project but I like the idea of enforcing some consistency in the process. As I looked through this template and the basic thought process for its development, it makes a lot of sense and I look forward to trying it in the future.
Another tool that I used in the project was mlxtend which contains a set of tools that are useful for “day-to-day data science tasks.” I particularly liked the ease of creating a visual plot of a confusion matrix. There are several other useful functions in this package that work quite well with scikit-learn. It’s well worth investigating all the functionality.
Finally, this dataset did have a lot of missing data. I enjoyed using the missingno tool to get a quick visualization of where the missing data was and how prevalent the missing values were. This is a very powerful library for visualizing missing data in a pandas DataFrame.
Conclusion
I have found that the real life process of analyzing and working through a Data Science challenge is one of the best ways to build up my skills and experience. There are many resources on the web that explain how to use the tools like pandas, sci-kit learn, XGBoost, etc but using the tools is just one piece of the puzzle. The real value is knowing how to smartly apply these tools and intuitively understanding how different choices will impact the rest of the downstream processes. This knowledge can only be gained by doing something over and over. Data Science challenges that focus on real-world issues are tremendously useful opportunities to learn and build skills.
Thanks again to all the people that make Analyze This! possible. I feel very fortunate that this type of event is available in my home town and hopefully others can replicate it in their own geographies.
PyTennessee
PyTN Profiles: Calvin Hendryx-Parker and Juice Analytics

Speaker Profile: Calvin Hendryx-Parker (@calvinhp)
Six Feet Up, Inc. co-founder Calvin Hendryx-Parker has over 18 years of experience in the development and hosting of applications using Python and web frameworks including Django, Pyramid and Flask.
As Chief Technology Officer for Six Feet Up, Calvin is responsible for researching cutting-edge advances that could become part of the company’s technology road map. Calvin provides both the company and its clients with recommendations on tools and technologies, systems architecture and solutions that address specific information-sharing needs. Calvin is an advocate of open source and is a frequent speaker at Python conferences on multisite content management, integration, and web app development. Calvin is also a founder and organizer of the IndyPy meetup group and Pythology training series in Indianapolis.
Outside of work Calvin spends time tinkering with new devices like the Fitbit, Pebble and Raspberry Pi. Calvin is an avid distance runner and ran the 2014 NYC Marathon to support the Innocence Project. Every year he and the family enjoys an extended trip to France where his wife Gabrielle, the CEO of Six Feet Up, is from. Calvin holds a Bachelor of Science from Purdue University.
Calvin will be presenting “Open Source Deployment Automation and Orchestration with SaltStack” at 3:00PM Saturday (2/4) in Room 200. Salt is way more than a configuration management tool. It supports many types of other activities such as remote execution and full-blown system orchestration. It can be used as a replacement for remote task tools such as Fabric or Paver.

Sponsor Profile: Juice Analytics (@juiceanalytics)
At Juice, we’re building Juicebox, a cloud platform to allow anyone to build and share stories with data. Juicebox is built on AWS, Python, Backbone.js and D3. We’re looking for a frontend dev with a love of teamwork, a passion for pixels, and a devotion to data. Love of Oxford commas also required.
PyTN Profiles: Jared M. Smith and SimplyAgree

Speaker Profile: Jared M. Smith (@jaredthecoder)
I’m a Research Scientist at Oak Ridge National Laboratory, where I engage in computer security research with the Cyber Warfare Research Team. I am also pursuing my PhD in Computer Science at the University of Tennessee, Knoxville. I founded VolHacks, our university’s hackathon, and HackUTK, our university’s computer security club. I used to work at Cisco Systems as a Software Security Engineer working on pentesting engagements and security tooling.
Back at home, I helped start the Knoxville Python meetup. I also serve on the Knoxville Technology Council, volunteer at the Knoxville Entrepreneur Center, do consulting for VC-backed startups, compete in hackathons and pitch competitions, and hike in the Great Smoky Mountains.
Jared will be presenting “Big Data Analysis in Python with Apache Spark, Pandas, and Matplotlib” at 3:00PM Saturday (2/4) in Room 100. Big data processing is finally approachable for the modern Pythonista. Using Apache Spark and other data analysis tools, we can process, analyze, and visualize more data than ever before using Pythonic APIs and a language you already know, without having to learn Java, C++, or even Fortran. Come hang out and dive into the essentials of big data analysis with Python.

Sponsor Profile: SimplyAgree (@simplyagree)
SimplyAgree is an electronic signature and closing management tool for complex corporate transactions. The app is built on Python, Django and Django REST Framework.
Our growing team is based in East Nashville, TN.
DataCamp
NumPy Cheat Sheet: Data Analysis in Python
Given the fact that it's one of the fundamental packages for scientific computing, NumPy is one of the packages that you must be able to use and know if you want to do data science with Python. It offers a great alternative to Python lists, as NumPy arrays are more compact, allow faster access in reading and writing items, and are more convenient and more efficient overall.
In addition, it's (partly) the fundament of other important packages that are used for data manipulation and machine learning which you might already know, namely, Pandas, Scikit-Learn and SciPy:
- The Pandas data manipulation library builds on NumPy, but instead of the arrays, it makes use of two other fundamental data structures: Series and DataFrames,
- SciPy builds on Numpy to provide a large number of functions that operate on NumPy arrays, and
- The machine learning library Scikit-Learn builds not only on NumPy, but also on SciPy and Matplotlib.
You see, this Python library is a must-know: if you know how to work with it, you'll also gain a better understanding of the other Python data science tools that you'll undoubtedly be using.
It's a win-win situation, right?
Nevertheless, just like any other library, NumPy can come off as quite overwhelming at start; What are the very basics that you need to know in order to get started with this data analysis library?
This cheat sheet means to give you a good overview of the possibilities that this library has to offer.
Go and check it out for yourself!
You'll see that this cheat sheet covers the basics of NumPy that you need to get started: it provides a brief explanation of what the Python library has to offer and what the array data structure looks like, and goes on to summarize topics such as array creation, I/O, array examination, array mathematics, copying and sorting arrays, selection of array elements and shape manipulation.
NumPy arrays are often preferred over Python lists, and you'll see that selecting elements from arrays is very similar to selecting elements from lists.
Do you want to know more? Check out DataCamp's Python list tutorial.
PS. Don't miss our other Python cheat cheets for data science that cover Scikit-Learn, Bokeh, Pandas and the Python basics.
S. Lott
Irrelevant Feature Comparison
A Real Email.
So, please consider creating a blog post w/ a title something like "Solving the Fred Flintstone Problem using Monads in Python and Haskell"
I can't improve on what's been presented.
Second. I don't see any problems that are solved well by monads in Python. In a lazy, optimized, functional language, monads can be used bind operations into ordered sequences. This is why file parsing and file writing examples of monads abound. They can also be used to bind a number of types so that operator overloading in the presence of strict type checking can be implemented. None of this seems helpful in Python.
Perhaps monads will be helpful with Python type hints. I'll wait and see if a monad definition shows up in the typing module. There, it may be a useful tool for handling dynamic type bindings.
Third. This request is perilously close to a "head-to-head" comparison between languages. The question says "problem", but it is similar to asking to see the exact same algorithm implemented in two different languages. It makes as much sense as comparing Python's built-in complex type with Java's built-in complex type (which Java doesn't have.)
Here's the issue. I replace Fred Flintstone with "Parse JSON Notation". This is a cool application of monads to recognize the various sub-classes of JSON syntax and emit the correctly-structured document. See http://fssnip.net/bq/title/JSON-parsing-with-monads. In Python, this is import json. This isn't informative about the language. If we look at the Python code, we see some operations that might be considered as eligible for a rewrite using monads. But Python isn't compiled and doesn't have the same type-checking issues. The point is that Python has alternatives to monads.
def goto(destination):
global next
next = destination
def min_none(sequence):
try:
return min(sequence)
except ValueError:
return None
def execute(program, debug=False, stmt=None):
global next, context
if stmt is None:
stmt = min(program.keys())
context = {'goto': goto}
while stmt is not None:
next = min_none(list(filter(lambda x: x>stmt, program.keys())))
if debug:
print(">>>", program[stmt])
exec(program[stmt], globals(), context)
stmt = next
example = {
100: "a = 10",
200: "if a == 0: goto(500)",
250: "print(a)",
300: "a = a - 1",
400: "goto(200)",
500: "print('done'()",
}
execute(example)
Given this, we can now compare the GOTO between Python, BASIC, and Haskell. Or maybe we can look at Monads in BASIC vs. Haskell.
Python Insider
Python 3.5.3 and 3.4.6 are now available
Python 3.5.3 and Python 3.4.6 are now available for download.
You can download Python 3.5.3 here, and you can download Python 3.4.6 here.
Wingware Blog
Remote Development with Wing Pro 6
Wing Pro 6 introduces easy to configure and use remote development, where the IDE can edit, test, debug, search, and manage files as if they were stored on the same machine as the IDE.
Ned Batchelder
Coverage.py 4.3.2 and 4.3.3, and 4.3.4
A handful of fixes for Coverage.py today: v4.3.2. Having active contributors certainly makes it easier to move code more quickly.
...and then it turns out, 4.3.2 wouldn't run on Python 2.6. So quick like a bunny, here comes Coverage.py version 4.3.3.
...and then that fix broke other situations on all sorts of Python versions, so Coverage.py version 4.3.4.
Matthew Rocklin
Distributed NumPy on a Cluster with Dask Arrays
This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation
This page includes embedded large profiles. It may look better on the actual site TODO: link to live site (rather than through syndicated pages like planet.python) and it may take a while to load on non-broadband connections (total size is around 20MB)
Summary
We analyze a stack of images in parallel with NumPy arrays distributed across a cluster of machines on Amazon’s EC2 with Dask array. This is a model application shared among many image analysis groups ranging from satellite imagery to bio-medical applications. We go through a series of common operations:
- Inspect a sample of images locally with Scikit Image
- Construct a distributed Dask.array around all of our images
- Process and re-center images with Numba
- Transpose data to get a time-series for every pixel, compute FFTs
This last step is quite fun. Even if you skim through the rest of this article I recommend checking out the last section.
Inspect Dataset
I asked a colleague at the US National Institutes for Health (NIH) for a biggish imaging dataset. He came back with the following message:
*Electron microscopy may be generating the biggest ndarray datasets in the field - terabytes regularly. Neuroscience needs EM to see connections between neurons, because the critical features of neural synapses (connections) are below the diffraction limit of light microscopes. This type of research has been called “connectomics”. Many groups are looking at machine vision approaches to follow small neuron parts from one slice to the next. *
This data is from drosophila: http://emdata.janelia.org/. Here is an example 2d slice of the data http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000.
import skimage.io
import matplotlib.pyplot as plt
sample = skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_5000'
skimage.io.imshow(sample)
The last number in the URL is an index into a large stack of about 10000 images. We can change that number to get different slices through our 3D dataset.
samples = [skimage.io.imread('http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i)
for i in [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]]
fig, axarr = plt.subplots(1, 9, sharex=True, sharey=True, figsize=(24, 2.5))
for i, sample in enumerate(samples):
axarr[i].imshow(sample, cmap='gray')
We see that our field of interest wanders across the frame over time and drops off in the beginning and at the end.
Create a Distributed Array
Even though our data is spread across many files, we still want to think of it as a single logical 3D array. We know how to get any particular 2D slice of that array using Scikit-image. Now we’re going to use Dask.array to stitch all of those Scikit-image calls into a single distributed array.
import dask.array as da
from dask import delayed
imread = delayed(skimage.io.imread, pure=True) # Lazy version of imread
urls = ['http://emdata.janelia.org/api/node/bf1/grayscale/raw/xy/2000_2000/1800_2300_%d' % i
for i in range(10000)] # A list of our URLs
lazy_values = [imread(url) for url in urls] # Lazily evaluate imread on each url
arrays = [da.from_delayed(lazy_value, # Construct a small Dask array
dtype=sample.dtype, # for every lazy value
shape=sample.shape)
for lazy_value in lazy_values]
stack = da.stack(arrays, axis=0) # Stack all small Dask arrays into one
>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(1, 2000, 2000)>
>>> stack = stack.rechunk((20, 2000, 2000)) # combine chunks to reduce overhead
>>> stack
dask.array<shape=(10000, 2000, 2000), dtype=uint8, chunksize=(20, 2000, 2000)>
So here we’ve constructed a lazy Dask.array from 10 000 delayed calls to
skimage.io.imread. We haven’t done any actual work yet, we’ve just
constructed a parallel array that knows how to get any particular slice of data
by downloading the right image if necessary. This gives us a full NumPy-like
abstraction on top of all of these remote images. For example we can now
download a particular image just by slicing our Dask array.
>>> stack[5000, :, :].compute()
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
>>> stack[5000, :, :].mean().compute()
11.49902425
However we probably don’t want to operate too much further without connecting
to a cluster. That way we can just download all of the images once into
distributed RAM and start doing some real computations. I happen to have ten
m4.2xlarges on Amazon’s EC2 (8 cores, 30GB RAM each) running Dask workers.
So we’ll connect to those.
from dask.distributed import Client, progress
client = Client('schdeduler-address:8786')
>>> client
<Client: scheduler="scheduler-address:8786" processes=10 cores=80>
I’ve replaced the actual address of my scheduler (something like
54.183.180.153 with `scheduler-address. Let’s go ahead and bring in all of
our images, persisting the array into concrete data in memory.
stack = client.persist(stack)
This starts downloads of our 10 000 images across our 10 workers. When this completes we have 10 000 NumPy arrays spread around on our cluster, coordinated by our single logical Dask array. This takes a while, about five minutes. We’re mostly network bound here (Janelia’s servers are not co-located with our compute nodes). Here is a parallel profile of the computation as an interactive Bokeh plot.
There will be a few of these profile plots throughout the blogpost, so you
might want to familiarize yoursel with them now. Every horizontal rectangle in
this plot corresponds to a single Python function running somewhere in our
cluster over time. Because we called skimage.io.imread 10 000 times there
are 10 000 purple rectangles. Their position along the y-axis denotes which of
the 80 cores in our cluster that they ran on and their position along the
x-axis denotes their start and stop times. You can hover over each rectangle
(function) for more information on what kind of task it was, how long it took,
etc.. In the image below, purple rectangles are skimage.io.imread calls and
red rectangles are data transfer between workers in our cluster. Click the
magnifying glass icons in the upper right of the image to enable zooming tools.
Now that we have persisted our Dask array in memory our data is based on hundreds of concrete in-memory NumPy arrays across the cluster, rather than based on hundreds of lazy scikit-image calls. Now we can do all sorts of fun distributed array computations more quickly.
For example we can easily see our field of interest move across the frame by averaging across time:
skimage.io.imshow(stack.mean(axis=0).compute())
Or we can see when the field of interest is actually present within the frame by averaging across x and y
plt.plot(stack.mean(axis=[1, 2]).compute())
By looking at the profile plots for each case we can see that averaging over time involves much more inter-node communication, which can be quite expensive in this case.
Recenter Images with Numba
In order to remove the spatial offset across time we’re going to compute a centroid for each slice and then crop the image around that center. I looked up centroids in the Scikit-Image docs and came across a function that did way more than what I was looking for, so I just quickly coded up a solution in Pure Python and then JIT-ed it with Numba (which makes this run at C-speeds).
from numba import jit
@jit(nogil=True)
def centroid(im):
n, m = im.shape
total_x = 0
total_y = 0
total = 0
for i in range(n):
for j in range(m):
total += im[i, j]
total_x += i * im[i, j]
total_y += j * im[i, j]
if total > 0:
total_x /= total
total_y /= total
return total_x, total_y
>>> centroid(sample) # this takes around 9ms
(748.7325324581344, 802.4893005160851)
def recenter(im):
x, y = centroid(im.squeeze())
x, y = int(x), int(y)
if x < 500:
x = 500
if y < 500:
y = 500
if x > 1500:
x = 1500
if y > 1500:
y = 1500
return im[..., x-500:x+500, y-500:y+500]
plt.figure(figsize=(8, 8))
skimage.io.imshow(recenter(sample))
Now we map this function across our distributed array.
import numpy as np
def recenter_block(block):
""" Recenter a short stack of images """
return np.stack([recenter(block[i]) for i in range(block.shape[0])])
recentered = stack.map_blocks(recenter,
chunks=(20, 1000, 1000), # chunk size changes
dtype=a.dtype)
recentered = client.persist(recentered)
This profile provides a good opportunity to talk about a scheduling failure; things went a bit wrong here. Towards the beginning we quickly recenter several images (Numba is fast), taking around 300-400ms for each block of twenty images. However as some workers finish all of their allotted tasks, the scheduler erroneously starts to load balance, moving images from busy workers to idle workers. Unfortunately the network at this time appeared to be much slower than expected and so the move + compute elsewhere strategy ended up being much slower than just letting the busy workers finish their work. The scheduler keeps track of expected compute times and transfer times precisely to avoid mistakes like this one. These sorts of issues are rare, but do occur on occasion.
We check our work by averaging our re-centered images across time and displaying that to the screen. We see that our images are better centered with each other as expected.
skimage.io.imshow(recentered.mean(axis=0))
This shows how easy it is to create fast in-memory code with Numba and then scale it out with Dask.array. The two projects complement each other nicely, giving us near-optimal performance with intuitive code across a cluster.
Rechunk to Time Series by Pixel
We’re now going to rearrange our data from being partitioned by time slice, to being partitioned by pixel. This will allow us to run computations like Fast Fourier Transforms (FFTs) on each time series efficiently. Switching the chunk pattern back and forth like this is generally a very difficult operation for distributed arrays because every slice of the array contributes to every time-series. We have N-squared communication.
This analysis may not be appropriate for this data (we won’t learn any useful science from doing this), but it represents a very frequently asked question, so I wanted to include it.
Currently our Dask array has chunkshape (20, 1000, 1000), meaning that our data
is collected into 500 NumPy arrays across the cluster, each of size (20, 1000,
1000).
>>> recentered
dask.array<shape=(10000, 1000, 1000), dtype=uint8, chunksize=(20, 1000, 1000)>
But we want to change this shape so that the chunks cover the entire first axis. We want all data for any particular pixel to be in the same NumPy array, not spread across hundreds of different NumPy arrays. We could solve this by rechunking so that each pixel is its own block like the following:
>>> rechunked = recentered.rechunk((10000, 1, 1))
However this would result in one million chunks (there are one million pixels)
which will result in a bit of scheduling overhead. Instead we’ll collect our
time-series into 10 x 10 groups of one hundred pixels. This will help us to
reduce overhead.
>>> # rechunked = recentered.rechunk((10000, 1, 1)) # Too many chunks
>>> rechunked = recentered.rechunk((10000, 10, 10)) # Use larger chunks
Now we compute the FFT of each pixel, take the absolute value and square to get the power spectrum. Finally to conserve space we’ll down-grade the dtype to float32 (our original data is only 8-bit anyway).
x = da.fft.fft(rechunked, axis=0)
power = abs(x ** 2).astype('float32')
power = client.persist(power, optimize_graph=False)
This is a fun profile to inspect; it includes both the rechunking and the subsequent FFTs. We’ve included a real-time trace during execution, the full profile, as well as some diagnostics plots from a single worker. These plots total up to around 20MB. I sincerely apologize to those without broadband access.
Here is a real time plot of the computation finishing over time:
And here is a single interactive plot of the entire computation after it completes. Zoom with the tools in the upper right. Hover over rectangles to get more information. Remember that red is communication.
Screenshots of the diagnostic dashboard of a single worker during this computation.
This computation starts with a lot of communication while we rechunk and realign our data (recent optimizations here by Antoine Pitrou in dask #417). Then we transition into doing thousands of small FFTs and other arithmetic operations. All of the plots above show a nice transition from heavy communication to heavy processing with some overlap each way (once some complex blocks are available we get to start overlapping communication and computation). Inter-worker communication was around 100-300 MB/s (typical for Amazon’s EC2) and CPU load remained high. We’re using our hardware.
Finally we can inspect the results. We see that the power spectrum is very boring in the corner, and has typical activity towards the center of the image.
plt.semilogy(1 + power[:, 0, 0].compute())
plt.semilogy(1 + power[:, 500, 500].compute())
Final Thoughts
This blogpost showed a non-trivial image processing workflow, emphasizing the following points:
- Construct a Dask array from lazy SKImage calls.
- Use NumPy syntax with Dask.array to aggregate distributed data across a cluster.
- Build a centroid function with Numba. Use Numba and Dask together to clean up an image stack.
- Rechunk to facilitate time-series operations. Perform FFTs.
Hopefully this example has components that look similar to what you want to do with your data on your hardware. We would love to see more applications like this out there in the wild.
What we could have done better
As always with all computationally focused blogposts we’ll include a section on what went wrong and what we could have done better with more time.
- Communication is too expensive: Interworker communications that should be taking 200ms are taking up to 10 or 20 seconds. We need to take a closer look at our communications pipeline (which normally performs just fine on other computations) to see if something is acting up. Disucssion here dask/distributed #776 and early work here dask/distributed #810.
- Faulty Load balancing: We discovered a case where our load-balancing heuristics misbehaved, incorrectly moving data between workers when it would have been better to let everything alone. This is likely due to the oddly low bandwidth issues observed above.
- Loading from disk blocks network I/O: While doing this we discovered an issue where loading large amounts of data from disk can block workers from responding to network requests (dask/distributed #774)
- Larger datasets: It would be fun to try this on a much larger dataset to see how the solutions here scale.
January 16, 2017
A. Jesse Jiryu Davis
Python Async Coroutines: A Video Walkthrough

If your New Year’s resolution is to become an expert in Python coroutines, I have good news.
Back in July, the book “500 Lines or Less: Experienced Programmers Solve Interesting Problems” was published, including the chapter I co-wrote with Guido van Rossum. Our chapter explains async networking. We show how non-blocking sockets work and how Python 3’s coroutines improve asynchronous network programs. I’m very proud of the chapter, but it’s a steeper climb than I’d like.
Now, a top-notch teacher has made the climb much easier. Phillip Guo is an Assistant Professor of Cognitive Science at University of California. He’s broken the chapter into eight small sections, and he explains each one carefully in a video, either talking through our code examples or actually demonstrating them running and dissecting them in the code editor. It’s a fine piece of work and a great way to approach the material. Go watch it:
Mike Driscoll
Python 101 is now a Course on Educative
My first book, Python 101, has been made into an online course on the educative website. Educative is kind of like Code Academy in that you can run the code snippets from the book to see what kind of output they produce. You cannot edit the examples though. You can get 50% off of the course by using the following coupon code: au-pythonlibrary50 (note: This coupon is only good for one week)
Python 101 is for primarily aimed at people who have an understanding of programming concepts or who have programmed with another language already. I do have a lot of readers that are completely new to programming who have enjoyed the book too though. The book itself is split into 5 distinct parts:
Part one covers the basics of Python. Part two moves into learning a little of Python’s standard library. In this section, I cover the libraries that I find myself using the most on a day-to-day basis. Part three moves into intermediate level territory and covers various topics such as decorators, debugging, code profiling and testing your code. Part four introduces the reader to installing 3rd party libraries and briefly demonstrates some of the popular ones, such as lxml, requests, SQLAlchemy and virtualenv. The last section is all about distributing your code. Here you will learn how to add your code to Python Package Index as well as create Windows executables.
For a full table of contents, you can visit the book’s web page here. Educative also has a really good contents page for the online course too.
PyTennessee
PyTN Profiles: Brian Pitts and Avrio Analytics

Speaker Profile: Brian Pitts (@sciurus)
Brian studied political science in college, but when his thesis contained more python code than prose it was clear where his true loyalties lay. He’s worked in operations roles for the past eight years and currently carries the pager for eventbrite.com.
He lives in East Nashville with his wife, son, two cats, and collection of 1990s Unix workstations.
Brian will be presenting “Stability and Capacity Patterns” at 3:00PM Saturday (2/4) in the Auditorium. At Eventbrite, engineers are tasked with building systems that can withstand dramatic spikes in load when popular events go on sale. There are patterns that help us do this. Come learn about these patterns, how Eventbrite has adopted them, and how to implement them within your own code and infrastructure.

Sponsor Profile: Avrio Analytics (@avrioanalytics)
Avrio is the complete customer life cycle SaaS tool. From finding your ideal customer, to closing the deal and keeping them interested, let our cognitive AI engine add more power to your marketing and sales team. Tomorrow is here.
Weekly Python Chat
Teaching Python & Python Tutor
This week we'll be chatting with special guest Philip Guo about pythontutor.com, a web application he made for visualizing Python code. We'll talk about how PythonTutor can be used for teaching and learning Python.
Doug Hellmann
shelve — Persistent Storage of Objects — PyMOTW 3
The shelve module can be used as a simple persistent storage option for Python objects when a relational database is not required. The shelf is accessed by keys, just as with a dictionary. The values are pickled and written to a database created and managed by dbm . Read more… This post is part of … Continue reading shelve — Persistent Storage of Objects — PyMOTW 3
Mike Driscoll
PyDev of the Week: Cameron Simpson
This week we welcome Cameron Simpson as our PyDev of the Week. Cameron is the co-author of PEP 418 – Add monotonic time, performance counter, and process time functions and the author of PEP 499 – python -m foo should bind sys.modules[‘foo’] in addition to sys.modules[‘__main__’]. He is also a core Python developer and enthusiast. You can check out some of his projects on bitbucket. Let’s take a few moments to get to know him better!
Can you tell us a little about yourself (hobbies, education, etc):
I’ve been a programming nerd since I was about 15, and I’ve got a BSc in Computer Science. I’m a somewhat lapsed climber and biker; I still have a motorcycle and try to use it but circumstances interfere; I’m trying to resume some indoor climbing too. I’m spending a fair amount of time on a small farm, and teleworking from there part of the time; I’ve been fortunate to find work where that is possible.
Why did you start using Python?
In 2004 I’d been intending to learn Python for a while, maybe over a year, but kept putting it off because of my huge personal Perl library that caused me to resist change – a personal library means you’ve got this kit of things that make the specific things you tend to do easier.
I think in 2004 I decided to dig myself out o this hole by reimplementing core stuff from my perl lib in Python so that I could write Python scripts to work with my existing data. Also that year a friend was dealing with a stripped down Jython platform in WebSphere, so there was some Pythoning there too implementing some stuff.
From that point on things snowballed as my core working Python stuff grew and Python was clearly a better language.
What other programming languages do you know and which is your favorite?
Python is my present favourite, and has been for some years. It has a nice balance of power and clarity, and I continue to like the design decision that go into it.
The languages I use regularly and fluently include Python, the Bourne shell, sed and awk and their friends. I like to use the “smallest” tools for a task that succinctly express a solution, so I tend to switch up from things like sed to shell to Python, or a mix.
I’m competent in various other languages; I’m also using Go and PHP and a little JavaScript in my current job. I started in BASIC, know (or knew) a few assembly languages, C, Java, SQL, Prolog, Pascal, SR, Modal, and have used several others at various times.
What projects are you working on now?
Several, mostly to scratch personal itches; you can see almost all of it at bitbucket; it looks like all Python but almost my entire bin directory scripts are there too. In terms of “projects”: myke, my make replacement; vt, a content addressed storage system (after the style of https://en.wikipedia.org/wiki/Venti but not the same) with a FUSE filesystem layer on top and ways to connect storage pools together; yet another MP4 parser, partly to aid inspecting and controlling media metadata and partly intended as a proof of concept for the format specific blockifier side of the storage system; tools for accessing my PVR’s recordings; mailfiler, my mail filing tool; maildb and nodedb, ah doc node/attribute storage, usually based off shared CSV files; the scripts and library files underpinning these; later, which I use instead of the futures library. I’m trying to get more of the library modules into PyPI.
Which Python libraries are your favorite (core or 3rd party)?
I think BeautifulSoup is an amazing tool. I love the threading facilities in Python’s stdlib. When I use it, I think SQLAlchemy is a lovely way to interact with SQL (I avoid the ORM side).
What is your take on the current market for Python programmers?
Regrettably I don’t have string opinions here. I’m enough of a generalist to not need to seek specific “python programming” niches (though last time around, that is somewhat what I was seeking). Of course the flip side to being a generalist is that I can apply Python to many needs, giving me an excuse to use it!
Is there anything else you’d like to say?
Like others, I think the Python community is great. I mostly interact with it via python-list, and I’m constantly heartened by the effort people put into staying constructive instead of descending into conflict as is too easy. In few other fora are people as willing to apologise for ill thought opinions or behaviour, as earnest in their attempts to justify their positions with evidence, reason and experience, and as prepared to accept that others’ experiences and needs are different from their own.
Thanks for doing the interview!
PythonClub - A Brazilian collaborative blog about Python
Abrangência de Listas e Dicionários
A utilização de listas em Python é algo trivial. A facilidade provida pela linguagem aliada a simplicidade da estrutura de dados list a torna, ao lado dos dicionários dict, uma das estrutura de dados mais utilizadas em Python. Aqui neste tutorial irei compartilhar algo que aprendi trabalhando com listas e dicionário em Python, mais especificamente no que diz respeito a abrangência de listas (e dicionários).
Abrangência de listas
A abrangência de listas, ou do inglês list comprehensions, é um termo utilizado para descrever uma sintaxe compacta que o Python nos oferece para criamos uma lista baseada em outra lista. Pareceu confuso? Ok, vamos aos exemplos!
Exemplo 1
Vamos suport que temos a seguinte lista de valores:
valores = [1, 2, 3, 4, 5]
Queremos gerar uma outra lista contendo o dobro de cada um desses números, ou seja,
[2, 4, 6, 8, 10]
Inicialmente, podemos montar o seguinte código como solução:
# Recebe o nosso resultado
valores_dobro = []
for val in valores:
valores_dobro.append(val * 2)
print(valores_dobro)
>>>
[2, 4, 6, 8, 10]
A solução acima é uma solução simples e resolve nosso problema, entretanto para algo tão simples precisamos de 4 linhas de código. Este exemplo é uma situação onde a abrangência de lista pode ser útil. Podemos compactar a criação da lista valores_dobro da seguinte maneira:
valores_dobro = [valor*2 for valor in valores]
Bacana não? O exemplo seguinte podemos incrementar mais o exemplo acima.
Exemplo 2
Vamos supor que desejamos criar uma lista onde apenas os valores pares (resto da divisão por 2 é zero) serão multiplicados por 2. Abaixo temos a nossa lista de valores:
valores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Assim como no exemplo anterior, podemos resolver utilizando um algoritmo básico.
# Lista que recebera o nosso resultado
valores_dobro = []
for valor in valores:
if valor % 2 == 0:
valores_dobro.append(valor * 2)
print(valores_dobro)
>>>
[4, 8, 12, 16, 20]
Podemos também resolver o mesmo problema utilizando as funções nativas map e filter:
valores_dobro = map(lambda valor: valor * 2, filter(lambda valor: valor % 2 == 0, valores))
Muito mais complicada não é? Apesar de resolver nosso problema, expressões como a acima são difíceis de ler e até mesmo de escrever. Em casos como esse, podemos novamente compactar nosso algoritmo utilizando a abrangência de lista.
valores_dobro = [valor * 2 for valor in valores if valor % 2 == 0]
Muito mais simples, não? Vamos para o próximo exemplo.
Exemplo 3
De maneira semelhante a lista, nós também podemos aplicar a abrangência em lista e dicionários. Segue um exemplo onde temos o seguinte dicionário:
dicionario = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}
Vamos criar um segundo dicionário contendo apenas as chaves que são consoantes, ou seja, b, c, d e f, sendo que o valor para cada uma dessas chaves deve ser o dobro do valor armazenado na respectiva chave do dicionário original. Complicado? Em outras palavras, o novo dicionário deve ficar assim:
novo_dicionario = {'b': 4, 'c': 6, 'd': 8, 'f': 12}
Utilizando um algoritmo genérico, podemos reslo o problema da seguinte maneira:
novo_dicionario = {}
for chave, valor in dicionario:
if chave in ['b', 'c', 'd', 'f']:
novo_dicionario[chave] = 2 * valor
print(novo_dicionario)
>>
{'b': 4, 'c': 6, 'd': 8, 'f': 12}
Aplicando agora a abrangência, conseguimos compactar o código acima de maneira interessante:
novo_dicionario = {chave: 2 * valor for chave, valor in dicionario.items() if chave in ['b', 'c', 'd', 'f']}
Conclusão
Chegamos ao final de mais um tutorial! Sempre temos de ter em mente que tão importante quanto escrever um código que funciona, é mantê-lo (seja por você ou por outro programador). Neste ponto, a abrangência de lista (e outras estruturas de dados) nos ajudam a escrever um código claro e fácil de dar manutenção.
Até o próximo tutorial pessoal!
Publicado originalmente: Abrangencia de listas e dicionários com Python
Referências
eGenix.com
Python Meeting Düsseldorf - 2017-01-18
The following text is in German, since we're announcing a regional user group meeting in Düsseldorf, Germany.
Ankündigung
Das nächste Python Meeting Düsseldorf findet an folgendem Termin statt:
18.01.2017, 18:00 Uhr
Raum 1, 2.OG im Bürgerhaus Stadtteilzentrum Bilk
Düsseldorfer Arcaden, Bachstr. 145, 40217 Düsseldorf
Neuigkeiten
Bereits angemeldete Vorträge
Charlie Clark
"Kurze Einführung in openpyxl und Pandas"
Jochen Wersdörfer
"CookieCutter"
Marc-Andre Lemburg
"Optimierung in Python mit PuLP"
Startzeit und Ort
Wir treffen uns um 18:00 Uhr im Bürgerhaus in den Düsseldorfer Arcaden.
Das Bürgerhaus teilt sich den Eingang mit dem Schwimmbad und befindet
sich an der Seite der Tiefgarageneinfahrt der Düsseldorfer Arcaden.
Über dem Eingang steht ein großes "Schwimm’ in Bilk" Logo. Hinter der Tür
direkt links zu den zwei Aufzügen, dann in den 2. Stock hochfahren. Der
Eingang zum Raum 1 liegt direkt links, wenn man aus dem Aufzug kommt.
>>> Eingang in Google Street View
Einleitung
Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.
Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.Veranstaltet wird das Meeting von der eGenix.com GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf:
Programm
Das Python Meeting Düsseldorf nutzt eine Mischung aus (Lightning) Talks und offener Diskussion.
Vorträge können vorher angemeldet werden, oder auch spontan während des Treffens eingebracht werden. Ein Beamer mit XGA Auflösung steht zur Verfügung.(Lightning) Talk Anmeldung bitte formlos per EMail an [email protected]
Kostenbeteiligung
Das Python Meeting Düsseldorf wird von Python Nutzern für Python Nutzer veranstaltet.
Da Tagungsraum, Beamer, Internet und Getränke Kosten produzieren, bitten wir die Teilnehmer um einen Beitrag in Höhe von EUR 10,00 inkl. 19% Mwst. Schüler und Studenten zahlen EUR 5,00 inkl. 19% Mwst.
Wir möchten alle Teilnehmer bitten, den Betrag in bar mitzubringen.
Anmeldung
Da wir nur für ca. 20 Personen Sitzplätze haben, möchten wir bitten,
sich per EMail anzumelden. Damit wird keine Verpflichtung eingegangen.
Es erleichtert uns allerdings die Planung.
Meeting Anmeldung bitte formlos per EMail an [email protected]
Weitere Informationen
Weitere Informationen finden Sie auf der Webseite des Meetings:
http://pyddf.de/
Viel Spaß !
Marc-Andre Lemburg, eGenix.com
Vasudev Ram
Classifying letters and counting their frequencies
By Vasudev Ram
Here is a program that takes a string as input and classifies the characters in it, into vowels or consonants. It also counts the frequencies of each vowel separately and the frequencies of all consonants together - it is a contrived problem, of course, for teaching purposes.
I gave it as an example / exercise for a Python class recently, then modified / enhanced it slightly for this post.
It is fairly simple, but happens, partly by chance, to illustrate the use of multiple Python language features in under 35 or so lines of code.
I say "partly by chance" because, after writing it initially and then noticing that it used multiple language features, I thought of, and added a few more (for the same program functionality as earlier), trying not to be too artificial or contrived about it :)
(The statement that creates the input string s, is of course contrived, but it does manage to illustrate the ''.join(lis) idiom, and string 'multiplication' - and also use of backslash to continue lines, if you can call that a language feature. Also, the string multiplication (as used), though contrived, does allow us to quickly find the frequencies of either vowels or consonants, so it has a use.)
Some of the Python language features used are:
- dictionary comprehensions (dict comps)
- continue statement (beginners sometimes ask what it is used for)
- the .get() method of dicts - the 2-argument version, that allows you to avoid an if/else when counting frequencies
- returning multiple values from a function
- tuple unpacking (of the multiple values returned as a tuple)
- the ''.join(lis) idiom to join the characters (or strings) in a list, into a single string
- string 'multiplication' by an integer (a shorthand for repeating the string n times)
- assert statements for checking post-conditions
Here is the program, classify_letters1.py:
# classify_letters1.pyAnd here is the output on running it:
# Classify input characters as vowels or consonants.
# Count frequencies of each vowel.
# Count total frequency of all consonants together.
# Author: Vasudev Ram
# Copyright 2017 Vasudev Ram
# Web site: https://vasudevram.github.io
# Blog: https://jugad2.blogspot.com
# Product store: https://gumroad.com/vasudevram
import string
VOWELS = 'aeiou'
def classify_letters(input):
vowel_freqs = { vowel: 0 for vowel in VOWELS }
consonants = 0
for c in input:
if not (c in string.ascii_lowercase):
continue
if c in VOWELS:
vowel_freqs[c] = vowel_freqs.get(c, 0) + 1
else:
consonants += 1
return vowel_freqs, consonants
s = ''.join(['a' * 1, 'b' * 1, 'c' * 2, 'd' * 3, 'e' * 2, 'f' * 4, \
'g' * 5, 'h' * 6, 'i' * 3, 'j' * 7, 'k' * 8, 'l' * 9, 'm' * 10, \
'n' * 11, 'o' * 4, 'p' * 12, 'q' * 13, 'r' * 14, 's' * 15, \
't' * 16, 'u' * 5, 'w' * 17, 'y' * 18, 'z' * 19])
print "Classifying letters in string:", s
print '-' * 70
vowel_freqs, consonants = classify_letters(s)
print 'vowel freqs:', vowel_freqs
print 'consonants total freq:', consonants
print '-' * 70
print 'Checking results:'
assert len(s) == sum(vowel_freqs.values()) + consonants
print 'OK'
$ py -2 classify_letters1.pyI used an assert to do a sanity check of the values computed with the number of letters in the original string.
Classifying letters in string: abccdddeeffffggggghhhhhhiiijjjjjjjkkkkkkkklllllll
llmmmmmmmmmmnnnnnnnnnnnooooppppppppppppqqqqqqqqqqqqqrrrrrrrrrrrrrrssssssssssssss
sttttttttttttttttuuuuuwwwwwwwwwwwwwwwwwyyyyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzzzzz
----------------------------------------------------------------------
vowel freqs: {'a': 1, 'i': 3, 'e': 2, 'u': 5, 'o': 4}
consonants total freq: 190
----------------------------------------------------------------------
Checking results:
OK
Putting the print 'OK' after the assert has a nice side effect that if the assert does not trigger, the program prints OK, but if it does trigger, it does not print OK, but an error message.
- Vasudev Ram - Online Python training and consulting Get updates (via Gumroad) on my forthcoming apps and content. Jump to posts: Python * DLang * xtopdf Subscribe to my blog by email My ActiveState Code recipesFollow me on: LinkedIn * Twitter Managed WordPress Hosting by FlyWheel
January 15, 2017
Evennia
Evennia in pictures
This article describes the MU* development system Evennia using pictures!
Import Python
ImportPython Weekly 107 - Pandas DataFrames, Type annotations, sanic and more
|
Worthy
Read
Data exploration, manipulation, and visualization start with loading data, be it from files or from a URL. Pandas has become the go-to library for all things data analysis in Python, but if your intention is to jump straight into data exploration and manipulation, the Canopy Data Import Tool can help, instead of having to learn the details of programming with the Pandas library.
pandas
Companies like Airbnb, Udacity, and Thumbtack trust Toptal to match them with top senior-level developers. Get started today and hire like the best.
Sponsor
Dropbox has several million lines of production code written in Python 2.7. As a first step towards migrating to Python 3, as well as to generally make our code more navigable, we are annotating our code with type annotations using the PEP 484 standard and type-checking the annotated code with mypy. In this talk we will discuss lessons learned and show how you too can start type-checking your legacy Python 2.7 code, one file at a time. We will also describe some of the many improvements we’ve made to mypy in the process, as well as some other tools that come in handy.
video ,
Guido
Sanic is a Python3 framework built using the somewhat newly introduced coroutines, harnessing uvloop and based on Flask. However, it had an issue preventing it from utilizing multiple processes correctly.
sanic
Objects in S3 contain metadata that identifies those objects along with their properties. When the number of objects is large, this metadata can be the magnet that allows you to find what you’re looking for. Although you can’t search this metadata directly, you can employ Amazon Elasticsearch Service to store and search all of your S3 metadata. This blog post gives step-by-step instructions about how to store the metadata in Amazon Elasticsearch Service (Amazon ES) using Python and AWS Lambda.
aws ,
elasticsearch ,
lambda ,
s3
Communication with external services is an integral part of any modern system. Whether it’s a payment service, authentication, analytics or an internal one?—?systems need to talk to each other. In this short article we are going to implement a module for communicating with a made-up payment gateway, step by step.
tutorial
When you’re beginning, it’s pretty easy to setup your Python environment on Unix. But in time things can get messy due to multiple versions, interpreters, utilities and projects.
environment
Iterables, iterators and generators are a key subject for effective Python usage, especially when processing large-scale data sets. Do you know why zip(*M) allows efficient traversal of a matrix M by columns? From the elegant for statement through list/set/dict comprehensions and generator functions, this talk shows how the Iterator pattern is so deeply embedded in the syntax of Python, and so widely supported by its libraries, that some of its most powerful applications can be overlooked by programmers coming from other languages.
Avoid Failure by Developing a Toolchain That Enables DevOps.
Sponsor
One of the downsides is that despite the Python community’s attempts to make it an accessible tool for everyone, a lot of folks find the installation process daunting or confusing, including myself. Once I'd learned enough Python to tinker around, I didn't know where to "go" on my computer to write it or what to do next. Today I'll cover three ways to install Python on your Windows computer step by step.
windows ,
installation
It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.
pandas ,
pdf ,
data frame
Almost two years ago Telegram let developers create bots quite painlessly. You can read an introduction about it on Telegram website. In this article we will create a simple bot in python, it’ll be hosted in Azure using Bottle framework. The bot will not do anything fancy, consider it as a template for your python based bots.
bot ,
telegram
statistics ,
Bayesian
tensorflow
Today I’ll try to explain how to hack TensorFlow to solve a simple regression problem.
Have you ever wondered how the CPython interpreter handles attribute access on a class or an instance of a class ?
cpython
Jobs
Europe (remote)
A Software Developer (Team Lead) based anywhere in Europe is sought by DigitalMR to work on the full life-cycle development of web applications for digital market research.
Europe (remote)
A Junior Python Developer based anywhere in Europe is sought by DigitalMR to work on the development of Django based web applications for digital market research.
United Kingdom
Patch are hiring a CTO / Lead developer. We are expanding our tech team as part of scaling the company. This is an opportunity to make a big impact on our E- commerce platform and help shape the new services we’re creating.
Projects
sublime-prettier -
10 Stars, 1
Fork
A Sublime Text 3 plugin for Prettier
smsReceiver -
5 Stars, 1
Fork
wrapper to sites that provide online phone number that receive sms.
wagtail-sharing -
5 Stars, 0
Fork
Easier sharing of Wagtail drafts
Language-Modeling-GatedCNN -
5 Stars, 0
Fork
Tensorflow implementation of "Language Modeling with Gated Convolutional Networks"
tfchain -
4 Stars, 0
Fork
Run a static part of the computational graph written in Chainer with Tensorflow
flask-http2-push -
3 Stars, 0
Fork
A Flask extension to add http2 server push to your application.
|
Kushal Das
Setting up a retro gaming console at home
Commodore 64 was the first computer I ever saw in 1989. Twice in a year I used to visit my grandparents’ house in Kolkata, I used to get one or two hours to play with it. I remember, after a few years how I tried to read a book on Basic, with the help of an English-to-Bengali dictionary. In 1993, my mother went for a year-long course for her job. I somehow managed to convince my father to buy me an Indian clone of NES (Little Master) in the same year. That was also a life event for me. I had only one game cartridge, only after 1996 the Chinese NES clones entered our village market.
Bringing back the fun
During 2014, I noticed how people were using Raspberry Pi(s) as NES consoles. I decided to configure my own on a Pi2. Last night, I re-installed the system.
Introducing RetroPie

RetroPie turns your Raspberry Pi into a retro-gaming console. You can either download the pre-installed image from the site, or you can install it on top of the rasbian-lite. I followed the later path.
As a first step I downloaded Raspbian Lite. It was around 200MB in size.
# dcfldd bs=4M if=2017-01-11-raspbian-jessie-lite.img of=/dev/mmcblk0
I have used the dcfldd command, you can use dd command too. Detailed instructions are here.
After booting up the newly installed Raspberry Pi, I just followed the manual installation instructions from the RetroPie wiki. I chose basic install option on the top of the main installation screen. Note that the screenshot in the wiki is old. It took a few hours for the installation to finish. I have USB gamepads bought from Amazon, which got configured on the first boot screen. For the full instruction set, read the wiki page.

Happy retro gaming everyone :)












