Planet Python

To request addition or removal: Open an issue on github or e-mail planet at python.org (note, responses can take up to a few days) · April 14, 2016 09:59 PM

>>> from collections import deque
>>> import string
>>> d = deque(string.ascii_lowercase)
>>> for letter in d:
...     print(letter)
>>> d.append('bork')
>>> d
deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 
       'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 
       'y', 'z', 'bork'])
>>> d.appendleft('test')
>>> d
deque(['test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 
       'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 
       'v', 'w', 'x', 'y', 'z', 'bork'])
>>> d.appendleft('test')
>>> d
deque(['test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 
       'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 
       'v', 'w', 'x', 'y', 'z', 'bork'])
>>> d.rotate(1)
>>> d
deque(['bork', 'test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 
       'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 
       't', 'u', 'v', 'w', 'x', 'y', 'z'])
from collections import deque

def get_last(filename, n=5):
    """
    Returns the last n lines from the file
    """
    try:
        with open(filename) as f:
            return deque(f, n)
    except OSError:
        print("Error opening file: {}".format(filename))
        raise
from freezegun import freeze_time

def test_get_date():
    with freeze_time("2016.1.1"):
        assert get_date() == datetime(2016, 1, 1)

from freezegun import freeze_time

def test_created_at():
    with freeze_time('2016.1.1'):
        doc = Document()
        doc.put()
        assert doc.created_at == datetime(2016, 1, 1)

BadValueError: Unsupported type for property created_at: <class 'freezegun.api.FakeDatetime'>

from contextlib import contextmanager
from google.appengine.api import datastore_types
from mock import patch
from freezegun import freeze_time as _freeze_time
from freezegun.api import FakeDatetime

@contextmanager
def freeze_time(*args, **kwargs):
    with patch('google.appengine.ext.db.DateTimeProperty.data_type',
                new=FakeDatetime):
        datastore_types._VALIDATE_PROPERTY_VALUES[FakeDatetime] = \
            datastore_types.ValidatePropertyNothing
        datastore_types._PACK_PROPERTY_VALUES[FakeDatetime] = \
            datastore_types.PackDatetime
        datastore_types._PROPERTY_MEANINGS[FakeDatetime] = \
            datastore_types.entity_pb.Property.GD_WHEN

        with _freeze_time(*args, **kwargs):
            yield

socket.send(header + payload)

socket.send(header)
socket.send(payload)

In [1]: import numpy as np

In [2]: data = np.random.randint(0, 255, dtype='u1', size=10000000)

In [3]: import pickle, cloudpickle

In [4]: %time len(pickle.dumps(data, protocol=-1))
CPU times: user 8.65 ms, sys: 8.42 ms, total: 17.1 ms
Wall time: 16.9 ms
Out[4]: 10000161

In [5]: %time len(cloudpickle.dumps(data, protocol=-1))
CPU times: user 20.6 ms, sys: 24.5 ms, total: 45.1 ms
Wall time: 44.4 ms
Out[5]: 10000161

         381360 function calls (381323 primitive calls) in 1.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.366    0.366    0.366    0.366 {built-in method dumps}
        8    0.289    0.036    0.291    0.036 iostream.py:360(write)
    15353    0.228    0.000    0.228    0.000 {method 'join' of 'bytes' objects}
    15355    0.166    0.000    0.166    0.000 {method 'recv' of '_socket.socket' objects}
    15362    0.156    0.000    0.398    0.000 iostream.py:1510(_merge_prefix)
     7759    0.101    0.000    0.101    0.000 {method 'send' of '_socket.socket' objects}
    17/14    0.026    0.002    0.686    0.049 gen.py:990(run)
    15355    0.021    0.000    0.198    0.000 iostream.py:721(_read_to_buffer)
        8    0.018    0.002    0.203    0.025 iostream.py:876(_consume)
       91    0.017    0.000    0.335    0.004 iostream.py:827(_handle_write)
       89    0.015    0.000    0.217    0.002 iostream.py:585(_read_to_buffer_loop)
   122567    0.009    0.000    0.009    0.000 {built-in method len}
    15355    0.008    0.000    0.173    0.000 iostream.py:1010(read_from_fd)
    38369    0.004    0.000    0.004    0.000 {method 'append' of 'list' objects}
     7759    0.004    0.000    0.104    0.000 iostream.py:1023(write_to_fd)
        1    0.003    0.003    1.451    1.451 ioloop.py:746(start)

[general]
cpu = 1
ram = 1024

[vm1]
user = fedora
image = /home/Fedora-Cloud-Base-20141203-21.x86_64.qcow2

[vm2]
user = fedora
image = /home/Fedora-Cloud-Base-20141203-21.x86_64.qcow2

vm1 free -m

# tunir --multi quickjob
... lots of output ...

Non gating tests status:
Total:0
Passed:0
Failed:0
DEBUG MODE ON. Destroy from /tmp/tmpiNumV2/destroy.sh

# ssh [email protected] -i /tmp/tmpiNumV2/private.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no

# sh /tmp/tmpiNumV2/destroy.sh

# ruler.py
"""
Program to display a ruler on the console.
Author: Vasudev Ram
Copyright 2016 Vasudev Ram - http://jugad2.blogspot.com
0123456789, concatenated.
Purpose: By running this program, you can use its output as a ruler,
to find the position of your own program's output on the line, or to 
find the positions and lengths of fields in fixed- or variable-length 
records in a text file, fields in CSV files, etc.
"""

REPS = 8

def ruler(sep=' ', reps=REPS):
    for i in range(reps):
        print str(i) + ' ' * 4 + sep + ' ' * 3,
    print '0123456789' * reps

def main():

    # Without divider.
    ruler()

    # With various dividers.
    for sep in '|+!':
        ruler(sep)

if __name__ == '__main__':
    main()

$ python ruler.py
0         1         2         3         4         5         6         7         
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0    |    1    |    2    |    3    |    4    |    5    |    6    |    7    |    
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0    +    1    +    2    +    3    +    4    +    5    +    6    +    7    +    
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0    !    1    !    2    !    3    !    4    !    5    !    6    !    7    !    
01234567890123456789012345678901234567890123456789012345678901234567890123456789

# test_ruler.py
from ruler import ruler
ruler()
# Code that outputs the data you want to measure 
# lengths or positions of, goes here ...
print 'NAME      AGE  CITY'
ruler()
# ... or here.
print 'SOME ONE   20  LON '
print 'ANOTHER    30  NYC '

$ python test_ruler.py
Output:
0         1         2         3         4         5         6         7         
01234567890123456789012345678901234567890123456789012345678901234567890123456789
NAME      AGE  CITY
0         1         2         3         4         5         6         7         
01234567890123456789012345678901234567890123456789012345678901234567890123456789
SOME ONE   20  LON 
ANOTHER    30  NYC 

.py
.pyd
pip install numpy
pip install (full path to the .whl file)
pip install
setuptools
pip install
set DISTUTILS_USE_SDK=1
pip install
pip wheel (package name)
lxml
libxml2
libxslt
$ conda update conda
$ conda update anaconda

doingmathwithpython@gmail.com
import sys
IS_PY2 = sys.version_info < (3, 0)

if IS_PY2:
    from Queue import Queue
else:
    from queue import Queue

from threading import Thread

class Worker(Thread):
    """ Thread executing tasks from a given tasks queue """
    def __init__(self, tasks):
        Thread.__init__(self)
        self.tasks = tasks
        self.daemon = True
        self.start()

    def run(self):
        while True:
            func, args, kargs = self.tasks.get()
            try:
                func(*args, **kargs)
            except Exception as e:
                # An exception happened in this thread
                print(e)
            finally:
                # Mark this task as done, whether an exception happened or not
                self.tasks.task_done()

class ThreadPool:
    """ Pool of threads consuming tasks from a queue """
    def __init__(self, num_threads):
        self.tasks = Queue(num_threads)
        for _ in range(num_threads):
            Worker(self.tasks)

    def add_task(self, func, *args, **kargs):
        """ Add a task to the queue """
        self.tasks.put((func, args, kargs))

    def map(self, func, args_list):
        """ Add a list of tasks to the queue """
        for args in args_list:
            self.add_task(func, args)

    def wait_completion(self):
        """ Wait for completion of all the tasks in the queue """
        self.tasks.join()

if __name__ == "__main__":
    from random import randrange
    from time import sleep

    # Function to be executed in a thread
    def wait_delay(d):
        print("sleeping for (%d)sec" % d)
        sleep(d)

    # Generate random delays
    delays = [randrange(3, 7) for i in range(50)]

    # Instantiate a thread pool with 5 worker threads
    pool = ThreadPool(5)

    # Add the jobs in bulk to the thread pool. Alternatively you could use
    # `pool.add_task` to add single jobs. The code will block here, which
    # makes it possible to cancel the thread pool with an exception when
    # the currently running batch of workers is finished.
    pool.map(wait_delay, delays)
    pool.wait_completion()

Noun	Count
movie	539698
film	220366
time	157595
way	112752
gt	105313
http	92619
something	87835
lot	85573
scene	82229
thing	82101

Python Version	You will need
3.5 and later	Visual C++ Build Tools 2015 or Visual Studio 2015
3.3 and 3.4	Windows SDK for Windows 7 and .NET 4.0 (Alternatively, Visual Studio 2010 if you have access to it)
2.6 to 3.2	Microsoft Visual C++ Compiler for Python 2.7

Planet Python

April 14, 2016

Wrapping Up

Identify a Problem

Problem 1: Tornado’s sentinels

Problem 2: Memory Copies

Problem 3: Unwanted Compression

Problem 4: Cloudpickle is not as fast as Pickle

Problem 5: Pickle is still slower than you’d expect

Problem 6: MsgPack is bad at large bytestrings

Problem 7: Tornado makes a copy

Results

Conclusion

Links

April 13, 2016

April 12, 2016

Overview

Provisioning Anaconda on a cluster

Installing Anaconda packages on the cluster

Loading the data into HDFS

Configuring the spark-submit command with your Hadoop Cluster

Installing and Configuring the Notebook with your Hadoop Cluster

Using the Anaconda Parcel with Cloudera CDH

Using Anaconda for cluster management with Cloudera CDH

Using Anaconda for cluster management with Hortonworks HDP

Initializing the SparkContext

Loading the data into memory

Distributed Natural Language Processing

Local analysis with Pandas and Bokeh

Conclusion

April 11, 2016

What is vcvarsall.bat, and why do I need it?

I need a package that has no wheel – what can I do?

When will the pain end?

Step 1: Form a committee

Step 2: Identify an approach

Step 3: Identify cross-functional teams with feasible project assignments

Step 4: Role-to-title breakdown

Step 5: Create a transition plan

Step 6: Obtain buy-in from the team

April 10, 2016