Planet Python

Last update: January 26, 2024 10:43 AM UTC

January 26, 2024

death and gravity

This is not interview advice: a priority-expiry LRU cache without heaps or trees in Python

It's not your fault I got nerdsniped, but that doesn't matter.

Hi, I'm Adrian, and today we're implementing a least recently used cache with priorities and expiry, using only the Python standard library.

This is a bIG TEch CoDINg InTerVIEW problem, so we'll work hard to stay away from the correct™ data structures – no heaps, no binary search trees – but end up with a decent solution anyway!

Requirements #

So you're at an interview and have to implement a priority-expiry LRU cache.

Maybe you get more details, but maybe the problem is deliberately vague; either way, we can reconstruct the requirements from the name alone.

A cache is something that stores data for future reuse, usually the result of a slow computation or retrieval. Each piece of data has an associated key used to retrieve it. Most caches are bounded in some way, for example by limiting the number of items.

The other words have to do with eviction – how and when items are removed.

Each item has a maximum age – items that go past that are expired, so we don't return them. It stands to reason this does not depend on their priority or how full the cache is.
Each item has a priority – when the cache fills up, we evict items with lower priority before those with higher priority.
All other things being equal, we evict items least recently used relative to others.

The problem may specify an API; we can reconstruct that from first principles too. Since the cache is basically a key-value store, we can get away with two methods:

set(key, value, maxage, priority)
get(key) -> value or None

The problem may also suggest:

delete(key) – allows users to invalidate items for reasons external to the cache; not strictly necessary, but we'll end up with it as part of refactoring
evict(now) – not strictly necessary either, but hints eviction is a separate bit of logic, and may come in handy for testing

Types deserve their own discussion:

key – usually, the key is a string, but we can relax this to any hashable value
value – for an in-memory cache, any kind of object is fine
maxage and priority – a number should do for these; a float is more general, but an integer may allow a simpler solution; limits on these are important too, as we'll see soon enough

Tip

Your interviewer may be expecting you to uncover some of these details through clarifying questions. Be sure to think out loud and state your assumptions.

A minimal plausible solution #

I'm sure there are a lot smart people out there that can Think Really Hard™ and just come up with a solution, but I'm not one of them, so we'll take an iterative, problem-solution approach to this.

Since right now we don't have any solution, we start with the simplest thing that could possibly work¹ – a basic cache with no fancy eviction and no priorities; we can then write some tests against that, to know if we break anything going forward.

Tip

If during an interview you don't know what to do and choose to work up from the naive solution, make it very clear that's what you're doing. Your interviewer may give you hints to help you skip that.

A class holds all the things we need and gives us something to stick the API on:

class Cache:

    def __init__(self, maxsize, time=time.monotonic):
        self.maxsize = maxsize
        self.time = time

        self.cache = {}

And this is our first algorithmic choice: a dict (backed by a hash table) provides average O(1) search / insert / delete.

Tip

Given the problem we're solving, and the context we're solving it in, we have to talk about time complexity. Ned Batchelder's Big-O: How Code Slows as Data Grows provides an excellent introduction (text and video available).

set() leads to more choices:

    def set(self, key, value, *, maxage=10, priority=0):
        now = self.time()

        if key in self.cache:
            self.cache.pop(key)
        elif len(self.cache) >= self.maxsize:
            self.evict(now)

        expires = now + maxage
        item = Item(key, value, expires, priority)

        self.cache[key] = item

First, we evict items only if there's no more room left. (There are other ways of doing this; for example, evicting expired items periodically minimizes memory usage.)

Second, if the key is already in the cache, we remove and insert it again, instead of updating things in place. This way, there's only one code path for setting items, which will make it a lot easier to keep multiple data structures in sync later on.

We use a named tuple to store the parameters associated with a key:

class Item(NamedTuple):
    key: object
    value: object
    expires: int
    priority: int

For now, we just evict an arbitrary item; in a happy little coincidence, dicts preserve insertion order, so when iterating over the cache, the oldest key is first.

    def evict(self, now):
        if not self.cache:
            return

        key = next(iter(self.cache))
        del self.cache[key]

Finally, get() is trivial:

    def get(self, key):
        item = self.cache.get(key)
        if not item:
            return None

        if self.time() >= item.expires:
            return None

        return item.value

With everything in place, here's the first test:

def test_basic():
    cache = Cache(2, FakeTime())

    assert cache.get('a') == None

    cache.set('a', 'A')
    assert cache.get('a') == 'A'

    cache.set('b', 'B')
    assert cache.get('a') == 'A'
    assert cache.get('b') == 'B'

    cache.set('c', 'C')
    assert cache.get('a') == None
    assert cache.get('b') == 'B'
    assert cache.get('c') == 'C'

To make things predictable, we inject a fake time implementation:

class FakeTime:
    def __init__(self, now=0):
        self.now = now
    def __call__(self):
        return self.now

Problem: expired items should go first... #

Following from the requirements, there's an order in which items get kicked out: first expired (lowest expiry time), then lowest priority, and only then least recently used. So, we need a data structure that can efficiently remove the smallest element.

Turns out, there's an abstract data type for that called a priority queue;² for now, we'll honor its abstract nature and not bother with an implementation.

        self.cache = {}
        self.expires = PriorityQueue()

Since we need the item with the lowest expiry time, we need a way to get back to the item; an (expires, key) tuple should do fine – since tuples compare lexicographically, it'll be like comparing by expires alone, but with key along for the ride; in set(), we add:

        self.cache[key] = item
        self.expires.push((expires, key))

You may be tempted (like I was) to say "hey, the item's already a tuple, if we make expires the first field, we can use the item itself", but let's delay optimizations until we have and understand a full solution – make it work, make it right, make it fast.

Still in set(), if the key is already in the cache, we also remove and insert it from the expires queue, so it's added back with the new expiry time.

        if key in self.cache:
            item = self.cache.pop(key)
            self.expires.remove((item.expires, key))
            del item
        elif len(self.cache) >= self.maxsize:
            self.evict(now)

Moving on to evicting things; for this, we need two operations: first peek at the item that expires next to see if it's expired, then, if it is, pop it from the queue. (Another choice: we only have to evict one item, but evict all expired ones.)

    def evict(self, now):
        if not self.cache:
            return

        initial_size = len(self.cache)

        while self.cache:
            expires, key = self.expires.peek()
            if expires > now:
                break
            self.expires.pop()
            del self.cache[key]

        if len(self.cache) == initial_size:
            _, key = self.expires.pop()
            del self.cache[key]

If there are no expired items, we still have to make room for the one item; since we're not handling priorities yet, we'll evict the item that expires next a little early.

Problem: name PriorityQueue is not defined #

OK, to get the code working again, we need a PriorityQueue class. It doesn't need to be fast, we can deal with that after we finish everything else; for now, let's just keep our elements in a plain list.

class PriorityQueue:

    def __init__(self):
        self.data = []

The easiest way to get the smallest value is to keep the list sorted; the downside is that push() is now O(n log n) – although, because the list is always sorted, it can be as good as O(n) depending on the implementation.

    def push(self, item):
        self.data.append(item)
        self.data.sort()

This makes peek() and pop() trivial; still, pop() is O(n), because it shifts all the items left by one position.

    def peek(self):
        return self.data[0]

    def pop(self):
        rv = self.data[0]
        self.data[:1] = []
        return rv

remove() is just as simple, and just as O(N), because it first needs to find the item, and then shift the ones after it to cover the gap.

    def remove(self, item):
        self.data.remove(item)

We didn't use the is empty operation, but it should be O(1) regardless of implementation, so let's throw it in anyway:

    def __bool__(self):
        return bool(self.data)

OK, let's wrap up with a quick test:

def test_priority_queue():
    pq = PriorityQueue()
    pq.push(1)
    pq.push(3)
    pq.push(2)

    assert pq
    assert pq.peek() == 1
    assert pq.pop() == 1
    assert pq.peek() == 2

    assert pq.remove(3) is None

    assert pq
    assert pq.peek() == 2
    with pytest.raises(ValueError):
        pq.remove(3)

    assert pq.pop() == 2

    assert not pq
    with pytest.raises(IndexError):
        pq.peek()
    with pytest.raises(IndexError):
        pq.pop()

Now the existing tests pass, and we can add more – first, that expired items are evicted (note how we're moving the time forward):

def test_expires():
    cache = Cache(2, FakeTime())

    cache.set('a', 'A', maxage=10)
    cache.set('b', 'B', maxage=20)
    assert cache.get('a') == 'A'
    assert cache.get('b') == 'B'

    cache.time.now = 15
    assert cache.get('a') == None
    assert cache.get('b') == 'B'

    cache.set('c', 'C')
    assert cache.get('a') == None
    assert cache.get('b') == 'B'
    assert cache.get('c') == 'C'

Second, that setting an existing item changes its expire time:

def test_update_expires():
    cache = Cache(2, FakeTime())

    cache.set('a', 'A', maxage=10)
    cache.set('b', 'B', maxage=10)

    cache.time.now = 5
    cache.set('a', 'X', maxage=4)
    cache.set('b', 'Y', maxage=6)
    assert cache.get('a') == 'X'
    assert cache.get('b') == 'Y'

    cache.time.now = 10
    assert cache.get('a') == None
    assert cache.get('b') == 'Y'

Problem: ...low priority items second #

Next up, kick out items by priority – shouldn't be too hard, right?

In __init__(), add another priority queue for priorities:

        self.cache = {}
        self.expires = PriorityQueue()
        self.priorities = PriorityQueue()

In set(), add new items to the priorities queue:

        self.cache[key] = item
        self.expires.push((expires, key))
        self.priorities.push((priority, key))

...and remove already-cached items:

        if key in self.cache:
            item = self.cache.pop(key)
            self.expires.remove((item.expires, key))
            self.priorities.remove((item.priority, key))
            del item

In evict(), remove expired items from the priorities queue:

        while self.cache:
            expires, key = self.expires.peek()
            if expires > now:
                break
            self.expires.pop()
            item = self.cache.pop(key)
            self.priorities.remove((item.priority, key))

...and finally, if none are expired, remove the one with the lowest priority:

        if len(self.cache) == initial_size:
            _, key = self.priorities.pop()
            item = self.cache.pop(key)
            self.expires.remove((item.expires, key))

Add one test for eviction by priority:

def test_priorities():
    cache = Cache(2, FakeTime())

    cache.set('a', 'A', priority=1)
    cache.set('b', 'B', priority=0)
    assert cache.get('a') == 'A'
    assert cache.get('b') == 'B'

    cache.set('c', 'C')
    assert cache.get('a') == 'A'
    assert cache.get('b') == None
    assert cache.get('c') == 'C'

...and one for updating the priority of existing items:

def test_update_priorities():
    cache = Cache(2, FakeTime())

    cache.set('a', 'A', priority=1)
    cache.set('b', 'B', priority=0)
    cache.set('b', 'Y', priority=2)

    cache.set('c', 'C')
    assert cache.get('a') == None
    assert cache.get('b') == 'Y'
    assert cache.get('c') == 'C'

Problem: we're deleting items in three places #

I said we'll postpone performance optimizations until we have a complete solution, but I have a different kind of optimization in mind – for readability.

We're deleting items in three slightly different ways, careful to keep three data structures in sync each time; it would be nice to do it only once. While a bit premature, through the magic of having written the article already, I'm sure it will pay off.

    def delete(self, key):
        *_, expires, priority = self.cache.pop(key)
        self.expires.remove((expires, key))
        self.priorities.remove((priority, key))

Sure, eviction is twice as slow, but the complexity stays the the same – the constant factor in O(2n) gets removed, leaving us with O(n). If needed, we can go back to the unpacked version once we have a reasonably efficient implementation (that's what tests are for).

Deleting already-cached items is shortened to:

        if key in self.cache:
            self.delete(key)

Idem for the core eviction logic:

        while self.cache:
            expires, key = self.expires.peek()
            if expires > now:
                break
            self.delete(key)

        if len(self.cache) == initial_size:
            _, key = self.priorities.peek()
            self.delete(key)

Neat!

Problem: ...least recently used items last #

So, how does one implement a least recently used cache?

We could google it... or, we could look at an existing implementation.

functools.lru_cache() #

Standard library functools.lru_cache() comes to mind first; let's have a look.

Tip

You can read the code of stdlib modules by following the Source code: link at the top of each documentation page.

lru_cache() delegates to _lru_cache_wrapper(), which sets up a bunch of variables to be used by nested functions.³ Among the variables is a cache dict and a doubly linked list where nodes are [prev, next, key, value] lists.⁴

And that's the answer – a doubly linked list allows tracking item use in O(1): each time a node is used, remove it from its current position and plop it at the "recently used" end; whatever's at the other end will be the least recently used item.

Note that, unlike lru_cache(), we need one doubly linked list for each priority.

But, before making Item mutable and giving it prev/next links, let's dive deeper.

OrderedDict #

If you search the docs for "LRU", the next result after lru_cache() is OrderedDict.

Some differences from dict still remain: [...] The OrderedDict algorithm can handle frequent reordering operations better than dict. [...] this makes it suitable for implementing various kinds of LRU caches.

Specifically:

OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint.

Since dicts preserve insertion order, you can use d[k] = d.pop(k) to move items to the end... What makes move_to_end() better, then? This comment may shed some light:

    # The internal self.__map dict maps keys to links in a doubly linked list.

Indeed, move_to_end() does exactly what we described above – this is good news, it means we don't have to do it ourselves!

So, we need one OrderedDict (read: doubly linked list) for each priority, but still need to keep track the lowest priority:

        self.cache = {}
        self.expires = PriorityQueue()
        self.priority_buckets = {}
        self.priority_order = PriorityQueue()

Handling priorities in set() gets a bit more complicated:

        priority_bucket = self.priority_buckets.get(priority)
        if not priority_bucket:
            priority_bucket = self.priority_buckets[priority] = OrderedDict()
            self.priority_order.push(priority)
        priority_bucket[key] = None

But now we can finally evict the least recently used item:

        if len(self.cache) == initial_size:
            priority = self.priority_order.peek()
            priority_bucket = self.priority_buckets.get(priority)
            key = next(iter(priority_bucket))
            self.delete(key)

In delete(), we're careful to get rid of empty buckets:⁵

        priority_bucket = self.priority_buckets[priority]
        del priority_bucket[key]
        if not priority_bucket:
            del self.priority_buckets[priority]
            self.priority_order.remove(priority)

Existing tests pass again, and we can add a new (still failing) one:

def test_lru():
    cache = Cache(2, FakeTime())

    cache.set('a', 'A')
    cache.set('b', 'B')
    cache.get('a') == 'A'

    cache.set('c', 'C')

    assert cache.get('a') == 'A'
    assert cache.get('b') == None
    assert cache.get('c') == 'C'

All that's needed to make it pass is to call move_to_end() in get():

        self.priority_buckets[item.priority].move_to_end(key)
        return item.value

Liking this so far? Here's another article you might like:

Learn by reading code: Python standard library design decisions explained

Problem: our priority queue is slow #

OK, we have a complete solution, it's time to deal with the priority queue implementation. Let's do a quick recap of the methods we need and why:

push() – to add items
peek() – to get items / buckets with the lowest expiry time / priority
remove() – to delete items
pop() – not used, but would be without the delete() refactoring

We make two related observations: first, there's no remove operation on the priority queue Wikipedia page; second, even if we unpack delete(), we only get to pop() an item/bucket from one of the queues, and still have to remove() it from the other.

And this is what makes the problem tricky – we need to maintain not one, but two independent priority queues.

Note

We'll now go through a few data structures in quick succession, which may be a bit overwhelming without preparation. Keep in mind we don't care how they work (not yet, at least), we're just shopping around based on specs.

heapq #

If you search the docs for "priority queue", you'll find heapq, which:

[...] provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

Reading on, we find extensive notes on implementing priority queues; particularly interesting are using (priority, item) tuples (already doing this!) and removing entries:

Removing the entry or changing its priority is more difficult because it would break the heap structure invariants. So, a possible solution is to mark the entry as removed and add a new entry with the revised priority.

This workaround is needed because while removing the i-th element can be done in O(log n), finding its index is O(n). To summarize, we have:

	sort	heapq
push()	O(n)	O(log n)
peek()	O(1)	O(1)
pop()	O(n)	O(log n)
remove()	O(n)	O(n)

Still, with a few mitigating assumptions, it could work:

we can assume priorities are static, so buckets never get removed
to-be-removed expiry times will get popped sooner or later anyway; we can assume that most evictions are due to expired items, and that items being evicted due to low priority (i.e. when the cache is full) and item updates are rare (both cause to-be-removed entries to accumulate in the expiry queue)

bisect #

One way of finding an element in better than O(n) is bisect, which:

[...] provides support for maintaining a list in sorted order without having to sort the list after each insertion.

This may provide an improvement to our naive implementation; sadly, reading further to Performance Notes we find that:

The insort() functions are O(n) because the logarithmic search step is dominated by the linear time insertion step.

While in general that's better than just about any sort, we happen to be hitting the best case of our sort implementation, which has the same complexity. (Nevertheless, shifting elements is likely cheaper than the same number of comparisons.)

	sort	heapq	bisect
push()	O(n)	O(log n)	O(n)
peek()	O(1)	O(1)	O(1)
pop()	O(n)	O(log n)	O(n)
remove()	O(n)	O(n)	O(n)

Further down in the docs, there's a see also box:

Sorted Collections is a high performance module that uses bisect to managed sorted collections of data.

Not in stdlib, moving along... ¯\_(ツ)_/¯

pop() optimization #

There's an unrelated improvement that applies to both the naive solution and bisect. With a sorted list, pop() is O(n) because it shifts all elements after the first; if the order was reversed, we'd pop() from the end, which is O(1). So:

	sort	heapq	bisect
push()	O(n)	O(log n)	O(n)
peek()	O(1)	O(1)	O(1)
pop()	O(1)*	O(log n)	O(1)*
remove()	O(n)	O(n)	O(n)

Binary search trees #

OK, I'm out of ideas, and there's nothing else in stdlib that can help.

We can restate the problem as follows: we need a sorted data structure that can do better than O(n) for push() / remove().

We've already peeked at the Wikipedia priority queue page, so let's keep reading – skipping past the naive implementations, to the usual implementation, we find that:

To improve performance, priority queues are typically based on a heap, [...]

Looked into that, didn't work; next:

Alternatively, when a self-balancing binary search tree is used, insertion and removal also take O(log n) time [...]

There, that's what we're looking for! (And likely what your interviewer is, too.)

	sort	heapq	bisect	BST
push()	O(n)	O(log n)	O(n)	O(log n)
peek()	O(1)	O(1)	O(1)	O(log n)
pop()	O(1)*	O(log n)	O(1)*	O(log n)
remove()	O(n)	O(log n)	O(n)	O(log n)

But there's no self-balancing BST in the standard library, and I sure as hell am not implementing one right now – I still have flashbacks from when I tried to do a red-black tree and two hours later it still had bugs (I mean, look at the length of this explanation!).

After a bit of googling we find, among others, bintrees, a mature library that provides all sorts of binary search trees... except:

Bintrees Development Stopped

Use sortedcontainers instead: https://pypi.python.org/pypi/sortedcontainers

Sounds familiar, doesn't it?

Sorted Containers #

Let's go back to that Sorted Collections library bisect was recommending:

Depends on the Sorted Containers module.

(￢‿￢ )

I remember, I remember now... I'm not salty because the red-black tree took two hours to implement. I'm salty because after all that time, I found Sorted Containers, a pure Python library that is faster in practice than fancy self-balancing binary search trees implemented in C!

It has extensive benchmarks to prove it, and simulated workload benchmarks for our own use case, priority queues – so yeah, while the interview answer is "self-balancing BSTs", the actual answer is Sorted Containers.

How does it work? There's an extensive explanation too:⁶

The Sorted Containers internal implementation is based on a couple observations. The first is that Python’s list is fast, really fast. [...] The second is that bisect.insort⁷ is fast. This is somewhat counter-intuitive since it involves shifting a series of items in a list. But modern processors do this really well. A lot of time has been spent optimizing mem-copy/mem-move-like operations both in hardware and software.

But using only one list and bisect.insort would produce sluggish behavior for lengths exceeding ten thousand. So the implementation of Sorted List uses a list of lists to store elements. [...]

There's also a comparison with trees, which I'll summarize: fewer memory allocations, better cache locality, lower memory overhead, and faster iteration.

I think that gives you a decent idea of how and why it works, enough that with a bit of tinkering you might be able to implement it yourself.⁸

Problem: Sorted Containers is not in stdlib #

But Sorted Containers is not in the standard library either, and we don't want to implement it ourselves. We did learn something from it, though:

	sort	heapq	bisect	bisect <10k	BST
push()	O(n)	O(log n)	O(n)	O(log n)	O(log n)
peek()	O(1)	O(1)	O(1)	O(1)	O(log n)
pop()	O(1)*	O(log n)	O(1)*	O(1)*	O(log n)
remove()	O(n)	O(log n)	O(n)	O(log n)	O(log n)

We still need to make some assumptions, though:

Do we really need more than 10k priorities? Likely no, let's just cap them at 10k.
Do we really need more than 10k expiry times? Maybe? – with 1 second granularity we can represent only up to 2.7 hours; 10 seconds takes us up to 27 hours, which may just work.

OK, one more and we're done. The other issue, aside from the maximum time, is that the granularity is too low, especially for short times – rounding 1 to 10 seconds is much worse than rounding 2001 to 2010 seconds. Which begs the question –

Does it really matter if items expire in 2010 seconds instead of 2001? Likely no, but we need a way to round small values with higher granularity than big ones.

If you've made it this far, you will definitely like:

Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

Logarithmic time #

How about 1, 2, 4, 8, ...? Rounding up to powers of 2 gets us decreasing granularity, but time doesn't actually start at zero. We fix this by rounding up to multiples of powers of 2 instead; let's get an intuition of how it works:

ceil((2000 +  1) /  1) *  1 = 2001
ceil((2000 +  2) /  2) *  2 = 2002
ceil((2000 +  3) /  4) *  4 = 2004
ceil((2000 +  4) /  4) *  4 = 2004
ceil((2000 + 15) / 16) * 16 = 2016
ceil((2000 + 16) / 16) * 16 = 2016
ceil((2000 + 17) / 32) * 32 = 2048

So far so good, how about after some time has passed?

ceil((2013 +  1) /  1) *  1 = 2014
ceil((2013 +  2) /  2) *  2 = 2016
ceil((2013 +  3) /  4) *  4 = 2016
ceil((2013 +  4) /  4) *  4 = 2020
ceil((2013 + 15) / 16) * 16 = 2032
ceil((2013 + 16) / 16) * 16 = 2032
ceil((2013 + 17) / 32) * 32 = 2048

The beauty of aligned powers is that for a relatively constant number of expiry times, the number of buckets remains roughly the same over time – as closely packed buckets are removed from the beginning, new ones fill the gaps between the sparser ones towards the end.

OK, let's put it into code:

def log_bucket(now, maxage):
    next_power = 2 ** math.ceil(math.log2(maxage))
    expires = now + maxage
    bucket = math.ceil(expires / next_power) * next_power
    return bucket

>>> [log_bucket(0, i) for i in [1, 2, 3, 4, 15, 16, 17]]
[1, 2, 4, 4, 16, 16, 32]
>>> [log_bucket(2000, i) for i in [1, 2, 3, 4, 15, 16, 17]]
[2001, 2002, 2004, 2004, 2016, 2016, 2048]
>>> [log_bucket(2013, i) for i in [1, 2, 3, 4, 15, 16, 17]]
[2014, 2016, 2016, 2020, 2032, 2032, 2048]

Looking good!

There are two sources of error – first from rounding maxage, worst when it's one more than a power of 2, and second from rounding the expiry time, also worst when it's one more than a power of two. Together, they approach 200% of maxage:

>>> log_bucket(0, 17)  # (32 - 17) / 17 ~= 88%
32
>>> log_bucket(0, 33)  # (64 - 33) / 33 ~= 94%
64
>>> log_bucket(16, 17)  # (64 - 31) / 17 ~= 182%
64
>>> log_bucket(32, 33)  # (128 - 64) / 33 ~= 191%
128

200% error is quite a lot; before we set to fix it, let's confirm our reasoning.

def error(now, maxage, *args):
    """log_bucket() error."""
    bucket = log_bucket(now, maxage, *args)
    return (bucket - now) / maxage - 1

def max_error(now, max_maxage, *args):
    """Worst log_bucket() error for all maxages up to max_maxage."""
    return max(
        error(now, maxage, *args)
        for maxage in range(1, max_maxage)
    )

def max_error_random(n, *args):
    """Worst log_bucket() error for random inputs, out of n tries."""
    max_now = int(time.time()) * 2
    max_maxage = 3600 * 24 * 31
    rand = functools.partial(random.randint, 1)
    return max(
        error(rand(max_now), rand(max_maxage), *args)
        for _ in range(n)
    )

>>> max_error(0, 10_000)
0.9997558891736849
>>> max_error(2000, 10_000)
1.9527896995708156
>>> max_error_random(10_000_000)
1.9995498725910554

Looks confirmed enough to me.

So, how do we make the error smaller? Instead of rounding to the next power of 2, we round to the next half of a power of 2, or next quarter, or next eighth...

def log_bucket(now, maxage, shift=0):
    next_power = 2 ** max(0, math.ceil(math.log2(maxage) - shift))
    expires = now + maxage
    bucket = math.ceil(expires / next_power) * next_power
    return bucket

It seems to be working:

>>> for s in range(5):
...     print([log_bucket(0, i, s) for i in [1, 2, 3, 4, 15, 16, 17]])
...
[1, 2, 4, 4, 16, 16, 32]
[1, 2, 4, 4, 16, 16, 32]
[1, 2, 3, 4, 16, 16, 24]
[1, 2, 3, 4, 16, 16, 20]
[1, 2, 3, 4, 15, 16, 18]

>>> for s in range(10):
...     e = max_error_random(1_000_000, s)
...     print(f'{s} {e:6.1%}')
...
0 199.8%
1  99.9%
2  50.0%
3  25.0%
4  12.5%
5   6.2%
6   3.1%
7   1.6%
8   0.8%
9   0.4%

With shift=7, the error is less that two percent; I wonder how many buckets that is...

def max_buckets(max_maxage, *args):
    """Number of buckets to cover all maxages up to max_maxage."""
    now = time.time()
    buckets = {
        log_bucket(now, maxage, *args)
        for maxage in range(1, max_maxage)
    }
    return len(buckets)

>>> max_buckets(3600 * 24, 7)
729
>>> max_buckets(3600 * 24 * 31, 7)
1047
>>> max_buckets(3600 * 24 * 365, 7)
1279

A bit over a thousand buckets for the whole year, not bad!

Before we can use any of that, we need to convert expiry times to buckets; that looks a lot like the priority buckets code, the only notable part being eviction.

__init__():

        self.cache = {}
        self.expires_buckets = {}
        self.expires_order = PriorityQueue()
        self.priority_buckets = {}
        self.priority_order = PriorityQueue()

set():

        expires_bucket = self.expires_buckets.get(expires)
        if not expires_bucket:
            expires_bucket = self.expires_buckets[expires] = set()
            self.expires_order.push(expires)
        expires_bucket.add(key)

delete():

        expires_bucket = self.expires_buckets[expires]
        expires_bucket.remove(key)
        if not expires_bucket:
            del self.expires_buckets[expires]
            self.expires_order.remove(expires)

evict():

        while self.cache:
            expires = self.expires_order.peek()
            if expires > now:
                break
            expires_bucket = self.expires_buckets[expires]
            for key in list(self.expires_buckets[expires]):
                self.delete(key)

And now we use log_bucket(). Since we're at it, why not have unlimited priorities too? A hammer is a hammer and everything is a nail, after all.

        expires = log_bucket(now, maxage, shift=7)
        priority = log_bucket(0, priority+1, shift=7)
        item = Item(key, value, expires, priority)

bisect, redux #

Time to fix that priority queue.

We use insort() to add priorities and operator.neg() to keep the list reversed:⁹

    def push(self, item):
        bisect.insort(self.data, item, key=operator.neg)

We update peek() and pop() to handle the reverse order:

    def peek(self):
        return self.data[-1]

    def pop(self):
        return self.data.pop()

Finally, for remove() we adapt the index() recipe from Searching Sorted Lists:

    def remove(self, item):
        i = bisect.bisect_left(self.data, -item, key=operator.neg)
        if i != len(self.data) and self.data[i] == item:
            del self.data[i]
            return
        raise ValueError

And that's it, we're done!

Here's the cache in its full glory (click to expand):

class Cache:

    def __init__(self, maxsize, time=time.monotonic):
        self.maxsize = maxsize
        self.time = time

        self.cache = {}
        self.expires_buckets = {}
        self.expires_order = PriorityQueue()
        self.priority_buckets = {}
        self.priority_order = PriorityQueue()

    def get(self, key):
        item = self.cache.get(key)
        if not item:
            return None

        if self.time() >= item.expires:
            return None

        self.priority_buckets[item.priority].move_to_end(key)
        return item.value

    def set(self, key, value, *, maxage=10, priority=0):
        now = self.time()

        if key in self.cache:
            self.delete(key)
        elif len(self.cache) >= self.maxsize:
            self.evict(now)

        expires = log_bucket(now, maxage, shift=7)
        priority = log_bucket(0, priority+1, shift=7)
        item = Item(key, value, expires, priority)

        self.cache[key] = item

        expires_bucket = self.expires_buckets.get(expires)
        if not expires_bucket:
            expires_bucket = self.expires_buckets[expires] = set()
            self.expires_order.push(expires)
        expires_bucket.add(key)

        priority_bucket = self.priority_buckets.get(priority)
        if not priority_bucket:
            priority_bucket = self.priority_buckets[priority] = OrderedDict()
            self.priority_order.push(priority)
        priority_bucket[key] = None

    def evict(self, now):
        if not self.cache:
            return

        initial_size = len(self.cache)

        while self.cache:
            expires = self.expires_order.peek()
            if expires > now:
                break
            expires_bucket = self.expires_buckets[expires]
            for key in list(self.expires_buckets[expires]):
                self.delete(key)

        if len(self.cache) == initial_size:
            priority = self.priority_order.peek()
            priority_bucket = self.priority_buckets.get(priority)
            key = next(iter(priority_bucket))
            self.delete(key)

    def delete(self, key):
        *_, expires, priority = self.cache.pop(key)

        expires_bucket = self.expires_buckets[expires]
        expires_bucket.remove(key)
        if not expires_bucket:
            del self.expires_buckets[expires]
            self.expires_order.remove(expires)

        priority_bucket = self.priority_buckets[priority]
        del priority_bucket[key]
        if not priority_bucket:
            del self.priority_buckets[priority]
            self.priority_order.remove(priority)

class Item(NamedTuple):
    key: object
    value: object
    expires: int
    priority: int

class PriorityQueue:

    def __init__(self):
        self.data = []

    def push(self, item):
        bisect.insort(self.data, item, key=operator.neg)

    def peek(self):
        return self.data[-1]

    def pop(self):
        return self.data.pop()

    def remove(self, item):
        i = bisect.bisect_left(self.data, -item, key=operator.neg)
        if i != len(self.data) and self.data[i] == item:
            del self.data[i]
            return
        raise ValueError

    def __bool__(self):
        return bool(self.data)

def log_bucket(now, maxage, shift=0):
    next_power = 2 ** max(0, math.ceil(math.log2(maxage) - shift))
    expires = now + maxage
    bucket = math.ceil(expires / next_power) * next_power
    return bucket

(The entire file, with tests and everything.)

Conclusion #

Anyone expecting you to implement this in under an hour is delusional. Explaining what you would use and why should be enough for reasonable interviewers, although that may prove difficult if you haven't solved this kind of problem before.

Bullshit interviews aside, it is useful to have basic knowledge of time complexity. Again, can't recommend Big-O: How Code Slows as Data Grows enough.

But, what big O notation says and what happens in practice can differ quite a lot. Be sure to measure, and be sure to think of limits – sometimes, the n in O(n) is or can be made small enough you don't have to do the theoretically correct thing.

You don't need to know how to implement all the data structures, that's what (software) libraries and Wikipedia are for (and for that matter, book libraries too). However, it is useful to have an idea of what's available and when to use it.

Good libraries educate – the Python standard library docs already cover a lot of the practical knowledge we needed, and so did Sorted Containers. But, that won't show up in the API reference you see in your IDE, you have read the actual documentation.

Learned something new today? Share this with others, it really helps! PyCoder's Weekly HN Reddit linkedin Twitter

Want to know when new articles come out? Subscribe here to get new stuff straight to your inbox!

Note the subtitle: if you're not sure what to do yet. ^[return]
This early on, the name doesn't really matter, but we'll go with the correct, descriptive one; in the first draft of the code, it was called MagicDS. ✨ ^[return]
You have to admit this is at least a bit weird; what you're looking at is an object in a trench coat, at least if you think closures and objects are equivalent. ^[return]
Another way of getting an "object" on the cheap. ^[return]
If we assume a relatively small number of buckets that will be reused soon enough, this isn't strictly necessary. I'm partly doing it to release the memory held by the dict, since dicts are resized only when items are added. ^[return]
There's even a PyCon talk with the same explanation, if you prefer that. ^[return]
bisect itself has a fast C implementation, so I guess technically it's not pure Python. But given that stdlib is already there, does that count? ^[return]
If the implementation is easy to explain, it may be a good idea. ^[return]
This limits priorities to values that can be negated, so tuples won't work anymore. We could use a "reversed view" wrapper if we really cared about that. ^[return]

January 26, 2024 10:00 AM UTC

January 25, 2024

Stack Abuse

Guide to Strings in Python

Introduction

We've come far in discovering the basics of computer science in the world of Python, and now is the time to start learning about strings. Strings are a fundamental data type that any aspiring developer must become familiar with. They are used extensively in almost every Python application, making understanding them crucial for effective programming.

A string in Python is a sequence of characters. These characters can be letters, numbers, symbols, or whitespace, and they are enclosed within quotes. Python supports both single (' ') and double (" ") quotes to define a string, providing flexibility based on the coder's preference or specific requirements of the application.

More specifically, strings in Python are arrays of bytes representing Unicode characters.

Creating a string is pretty straightforward. You can assign a sequence of characters to a variable, and Python treats it as a string. For example:

my_string = "Hello, World!"

This creates a new string containing "Hello, World!". Once a string is created, you can access its elements using indexing (same as accessing elements of a list) and perform various operations like concatenation (joining two strings) and replication (repeating a string a certain number of times).

However, it's important to remember that strings in Python are immutable. This immutability means that once you create a string, you cannot change its content. Attempting to alter an individual character in a string will result in an error. While this might seem like a limitation at first, it has several benefits, including improved performance and reliability in Python applications. To modify a string, you would typically create a new string based on modifications of the original.

Python provides a wealth of methods to work with strings, making string manipulation one of the language's strong suits. These built-in methods allow you to perform common tasks like changing the case of a string, stripping whitespace, checking for substrings, and much more, all with simple, easy-to-understand syntax, which we'll discuss later in this article.

As you dive deeper into Python, you'll encounter more advanced string techniques. These include formatting strings for output, working with substrings, and handling special characters. Python's string formatting capabilities, especially with the introduction of f-Strings in Python 3.6, allow for cleaner and more readable code. Substring operations, including slicing and finding, are essential for text analysis and manipulation.

Moreover, strings play nicely with other data types in Python, such as lists. You can convert a string into a list of characters, split a string based on a specific delimiter, or join a collection of strings into a single string. These operations are particularly useful when dealing with data input and output or when parsing text files.

In this article, we'll explore these aspects of strings in Python, providing practical examples to illustrate how to effectively work with strings. By the end, you'll have a solid foundation in string manipulation, setting you up for more advanced Python programming tasks.

Basic String Operators

Strings are one of the most commonly used data types in Python, employed in diverse scenarios from user input processing to data manipulation. This section will explore the fundamental operations you can perform with strings in Python.

Creating Strings

In Python, you can create strings by enclosing a sequence of characters within single, double, or even triple quotes (for multiline strings). For example, simple_string = 'Hello' and another_string = "World" are both valid string declarations. Triple quotes, using ''' or """, allow strings to span multiple lines, which is particularly useful for complex strings or documentation.

The simplest way to create a string in Python is by enclosing characters in single (') or double (") quotes.

Note: Python treats single and double quotes identically

This method is straightforward and is commonly used for creating short, uncomplicated strings:

# Using single quotes
greeting = 'Hello, world!'

# Using double quotes
title = "Python Programming"

For strings that span multiple lines, triple quotes (''' or """) are the perfect tool. They allow the string to extend over several lines, preserving line breaks and white spaces:

# Using triple quotes
multi_line_string = """This is a
multi-line string
in Python."""

Sometimes, you might need to include special characters in your strings, like newlines (\n), tabs (\t), or even a quote character. This is where escape characters come into play, allowing you to include these special characters in your strings:

# String with escape characters
escaped_string = "He said, \"Python is amazing!\"\nAnd I couldn't agree more."

Printing the escaped_string will give you:

He said, "Python is amazing!"
And I couldn't agree more.

Accessing and Indexing Strings

Once a string is created, Python allows you to access its individual characters using indexing. Each character in a string has an index, starting from 0 for the first character.

For instance, in the string s = "Python", the character at index 0 is 'P'. Python also supports negative indexing, where -1 refers to the last character, -2 to the second-last, and so on. This feature makes it easy to access the string from the end.

Note: Python does not have a character data type. Instead, a single character is simply a string with a length of one.

Accessing Characters Using Indexing

As we stated above, the indexing starts at 0 for the first character. You can access individual characters in a string by using square brackets [] along with the index:

# Example string
string = "Stack Abuse"

# Accessing the first character
first_char = string[0]  # 'S'

# Accessing the third character
third_char = string[2]  # 't'

Negative Indexing

Python also supports negative indexing. In this scheme, -1 refers to the last character, -2 to the second last, and so on. This is useful for accessing characters from the end of the string:

# Accessing the last character
last_char = string[-1]  # 'e'

# Accessing the second last character
second_last_char = string[-2]  # 's'

String Concatenation and Replication

Concatenation is the process of joining two or more strings together. In Python, this is most commonly done using the + operator. When you use + between strings, Python returns a new string that is a combination of the operands:

# Example of string concatenation
first_name = "John"
last_name = "Doe"
full_name = first_name + " " + last_name  # 'John Doe'

Note: The + operator can only be used with other strings. Attempting to concatenate a string with a non-string type (like an integer or a list) will result in a TypeError.

For a more robust solution, especially when dealing with different data types, you can use the str.join() method or formatted string literals (f-strings):

# Using join() method
words = ["Hello", "world"]
sentence = " ".join(words)  # 'Hello world'

# Using an f-string
age = 30
greeting = f"I am {age} years old."  # 'I am 30 years old.'

Note: We'll discuss these methods in more details later in this article.

Replication, on the other hand, is another useful operation in Python. It allows you to repeat a string a specified number of times. This is achieved using the * operator. The operand on the left is the string to be repeated, and the operand on the right is the number of times it should be repeated:

# Replicating a string
laugh = "ha"
repeated_laugh = laugh * 3  # 'hahaha'

String replication is particularly useful when you need to create a string with a repeating pattern. It’s a concise way to produce long strings without having to type them out manually.

Note: While concatenating or replicating strings with operators like + and * is convenient for small-scale operations, it’s important to be aware of performance implications.

For concatenating a large number of strings, using join() is generally more efficient as it allocates memory for the new string only once.

Slicing Strings

Slicing is a powerful feature in Python that allows you to extract a part of a string, enabling you to obtain substrings. This section will guide you through the basics of slicing strings in Python, including its syntax and some practical examples.

The slicing syntax in Python can be summarized as [start:stop:step], where:

start is the index where the slice begins (inclusive).
stop is the index where the slice ends (exclusive).
step is the number of indices to move forward after each iteration. If omitted, the default value is 1.

Note: Using slicing with indices out of the string's range is safe since Python will handle it gracefully without throwing an error.

To put that into practice, let's take a look at an example. To slice the string "Hello, Stack Abuse!", you specify the start and stop indices within square brackets following the string or variable name. For example, you can extract the first 5 characters by passing 0 as a start and 5 as a stop:

text = "Hello, Stack Abuse!"

# Extracting 'Hello'
greeting = text[0:5]  # 'Hello'

Note: Remember that Python strings are immutable, so slicing a string creates a new string.

If you omit the start index, Python will start the slice from the beginning of the string. Similarly, omitting the stop index will slice all the way to the end:

# From the beginning to the 7th character
to_python = text[:7]  # 'Hello, '

# Slicing from the 7th character to the end
from_python = text[7:]  # 'Stack Abuse!'

You can also use negative indexing here. This is particularly useful for slicing from the end of a string:

# Slicing the last 6 characters
slice_from_end = text[-6:]  # 'Abuse!'

The step parameter allows you to include characters within the slice at regular intervals. This can be used for various creative purposes like string reversal:

# Every second character in the string
every_second = text[::2]  # 'Hlo tc bs!'

# Reversing a string using slicing
reversed_text = text[::-1]  # '!esubA kcatS ,olleH'

String Immutability

String immutability is a fundamental concept in Python, one that has significant implications for how strings are handled and manipulated within the language.

What is String Immutability?

In Python, strings are immutable, meaning once a string is created, it cannot be altered. This might seem counterintuitive, especially for those coming from languages where string modification is common. In Python, when we think we are modifying a string, what we are actually doing is creating a new string.

For example, consider the following scenario:

s = "Hello"
s[0] = "Y"

Attempting to execute this code will result in a TypeError because it tries to change an element of the string, which is not allowed due to immutability.

Why are Strings Immutable?

The immutability of strings in Python offers several advantages:

Security: Since strings cannot be changed, they are safe from being altered through unintended side-effects, which is crucial when strings are used to handle things like database queries or system commands.
Performance: Immutability allows Python to make optimizations under-the-hood. Since a string cannot change, Python can allocate memory more efficiently and perform optimizations related to memory management.
Hashing: Strings are often used as keys in dictionaries. Immutability makes strings hashable, maintaining the integrity of the hash value. If strings were mutable, their hash value could change, leading to incorrect behavior in data structures that rely on hashing, like dictionaries and sets.

How to "Modify" a String in Python?

Since strings cannot be altered in place, "modifying" a string usually involves creating a new string that reflects the desired changes. Here are common ways to achieve this:

Concatenation: Using + to create a new string with additional characters.
Slicing and Rebuilding: Extract parts of the original string and combine them with other strings.
String Methods: Many built-in string methods return new strings with the changes applied, such as .replace(), .upper(), and .lower().

For example:

s = "Hello"
new_s = s[1:]  # new_s is now 'ello'

Here, the new_s is a new string created from a substring of s, whilst he original string s remains unchanged.

Common String Methods

Python's string type is equipped with a multitude of useful methods that make string manipulation effortless and intuitive. Being familiar with these methods is essential for efficient and elegant string handling. Let's take a look at a comprehensive overview of common string methods in Python:

upper() and lower() Methods

These methods are used to convert all lowercase characters in a string to uppercase or lowercase, respectively.

Note: These method are particularly useful in scenarios where case uniformity is required, such as in case-insensitive user inputs or data normalization processes or for comparison purposes, such as in search functionalities where the case of the input should not affect the outcome.

For example, say you need to convert the user's input to upper case:

user_input = "Hello!"
uppercase_input = user_input.upper()
print(uppercase_input)  # Output: HELLO!

In this example, upper() is called on the string user_input, converting all lowercase letters to uppercase, resulting in HELLO!.

Contrasting upper(), the lower() method transforms all uppercase characters in a string to lowercase. Like upper(), it takes no parameters and returns a new string with all uppercase characters converted to lowercase. For example:

user_input = "HeLLo!"
lowercase_input = text.lower()
print(lowercase_input)  # Output: hello!

Here, lower() converts all uppercase letters in text to lowercase, resulting in hello!.

capitalize() and title() Methods

The capitalize() method is used to convert the first character of a string to uppercase while making all other characters in the string lowercase. This method is particularly useful in standardizing the format of user-generated input, such as names or titles, ensuring that they follow a consistent capitalization pattern:

text = "python programming"
capitalized_text = text.capitalize()
print(capitalized_text)  # Output: Python programming

In this example, capitalize() is applied to the string text. It converts the first character p to uppercase and all other characters to lowercase, resulting in Python programming.

While capitalize() focuses on the first character of the entire string, title() takes it a step further by capitalizing the first letter of every word in the string. This method is particularly useful in formatting titles, headings, or any text where each word needs to start with an uppercase letter:

text = "python programming basics"
title_text = text.title()
print(title_text)  # Output: Python Programming Basics

Here, title() is used to convert the first character of each word in text to uppercase, resulting in Python Programming Basics.

Note: The title() method capitalizes the first letter of all words in a sentence. Trying to capitalize the sentence "he's the best programmer" will result in "He'S The Best Programmer", which is probably not what you'd want.

To properly convert a sentence to some standardized title case, you'd need to create a custom function!

strip(), rstrip(), and lstrip() Methods

The strip() method is used to remove leading and trailing whitespaces from a string. This includes spaces, tabs, newlines, or any combination thereof:

text = "   Hello World!   "
stripped_text = text.strip()
print(stripped_text)  # Output: Hello World!

While strip() removes whitespace from both ends, rstrip() specifically targets the trailing end (right side) of the string:

text = "Hello World!   \n"
rstrip_text = text.rstrip()
print(rstrip_text)  # Output: Hello World!

Here, rstrip() is used to remove the trailing spaces and the newline character from text, leaving Hello World!.

Conversely, lstrip() focuses on the leading end (left side) of the string:

text = "   Hello World!"
lstrip_text = text.lstrip()
print(lstrip_text)  # Output: Hello World!

All-in-all, strip(), rstrip(), and lstrip() are powerful methods for whitespace management in Python strings. Their ability to clean and format strings by removing unwanted spaces makes them indispensable in a wide range of applications, from data cleaning to user interface design.

The split() Method

The split() method breaks up a string at each occurrence of a specified separator and returns a list of the substrings. The separator can be any string, and if it's not specified, the method defaults to splitting at whitespace.

First of all, let's take a look at its syntax:

string.split(separator=None, maxsplit=-1)

Here, the separator is the string at which the splits are to be made. If omitted or None, the method splits at whitespace. On the other hand, maxsplit is an optional parameter specifying the maximum number of splits. The default value -1 means no limit.

For example, let's simply split a sentence into its words:

text = "Computer science is fun"
split_text = text.split()
print(split_text)  # Output: ['Computer', 'science', 'is', 'fun']

As we stated before, you can specify a custom separator to tailor the splitting process to your specific needs. This feature is particularly useful when dealing with structured text data, like CSV files or log entries:

text = "Python,Java,C++"
split_text = text.split(',')
print(split_text)  # Output: ['Python', 'Java', 'C++']

Here, split() uses a comma , as the separator to split the string into different programming languages.

Controlling the Number of Splits

The maxsplit parameter allows you to control the number of splits performed on the string. This can be useful when you only need to split a part of the string and want to keep the rest intact:

text = "one two three four"
split_text = text.split(' ', maxsplit=2)
print(split_text)  # Output: ['one', 'two', 'three four']

In this case, split() only performs two splits at the first two spaces, resulting in a list with three elements.

The join() Method

So far, we've seen a lot of Python's extensive string manipulation capabilities. Among these, the join() method stands out as a particularly powerful tool for constructing strings from iterables like lists or tuples.

The join() method is the inverse of the split() method, enabling the concatenation of a sequence of strings into a single string, with a specified separator.

The join() method takes an iterable (like a list or tuple) as a parameter and concatenates its elements into a single string, separated by the string on which join() is called. It has a fairly simple syntax:

separator.join(iterable)

The separator is the string that is placed between each element of the iterable during concatenation and the iterable is the collection of strings to be joined.

For example, let's reconstruct the sentence we split in the previous section using the split() method:

split_text = ['Computer', 'science', 'is', 'fun']
text = ' '.join(words)
print(sentence)  # Output: 'Computer science is fun'

In this example, the join() method is used with a space ' ' as the separator to concatenate the list of words into a sentence.

The flexibility of choosing any string as a separator makes join() incredibly versatile. It can be used to construct strings with specific formatting, like CSV lines, or to add specific separators, like newlines or commas:

languages = ["Python", "Java", "C++"]
csv_line = ','.join(languages)
print(csv_line)  # Output: Python,Java,C++

Here, join() is used with a comma , to create a string that resembles a line in a CSV file.

Efficiency of the join()

One of the key advantages of join() is its efficiency, especially when compared to string concatenation using the + operator. When dealing with large numbers of strings, join() is significantly more performant and is the preferred method in Python for concatenating multiple strings.

The replace() Method

The replace() method replaces occurrences of a specified substring (old) with another substring (new). It can be used to replace all occurrences or a specified number of occurrences, making it highly adaptable for various text manipulation needs.

Take a look at its syntax:

string.replace(old, new[, count])

old is the substring that needs to be replaced.
new is the substring that will replace the old substring.
count is an optional parameter specifying the number of replacements to be made. If omitted, all occurrences of the old substring are replaced.

For example, let's change the word "World" to "Stack Abuse" in the string "Hello, World":

text = "Hello, World"
replaced_text = text.replace("World", "Stack Abuse")
print(replaced_text)  # Output: Hello, Stack Abuse

The previously mentioned count parameter allows for more controlled replacements. It limits the number of times the old substring is replaced by the new substring:

text = "cats and dogs and birds and fish"
replaced_text = text.replace("and", "&", 2)
print(replaced_text)  # Output: cats & dogs & birds and fish

Here, replace() is used to replace the first two occurrences of "and" with "&", leaving the third occurrence unchanged.

find() and rfind() Methods

These methods return the lowest index in the string where the substring sub is found. rfind() searches for the substring from the end of the string.

Note: These methods are particularly useful when the presence of the substring is uncertain, and you wish to avoid handling exceptions. Also, the return value of -1 can be used in conditional statements to execute different code paths based on the presence or absence of a substring.

Python's string manipulation suite includes the find() and rfind() methods, which are crucial for locating substrings within a string. Similar to index() and rindex(), these methods search for a substring but differ in their response when the substring is not found. Understanding these methods is essential for tasks like text analysis, data extraction, and general string processing.

The `find()` Method

The find() method returns the lowest index of the substring if it is found in the string. Unlike index(), it returns -1 if the substring is not found, making it a safer option for situations where the substring might not be present.

It follows a simple syntax with one mandatory and two optional parameters:

string.find(sub[, start[, end]])

sub is the substring to be searched within the string.
start and end are optional parameters specifying the range within the string where the search should occur.

For example, let's take a look at a string that contains multiple instances of the substring "is":

text = "Python is fun, just as JavaScript is"

Now, let's locate the first occurrence of the substring "is" in the text:

find_position = text.find("is")
print(find_position)  # Output: 7

In this example, find() locates the substring "is" in text and returns the starting index of the first occurrence, which is 7.

While find() searches from the beginning of the string, rfind() searches from the end. It returns the highest index where the specified substring is found or -1 if the substring is not found:

text = "Python is fun, just as JavaScript is"
rfind_position = text.rfind("is")
print(rfind_position)  # Output: 34

Here, rfind() locates the last occurrence of "is" in text and returns its starting index, which is 34.

index() and rindex() Methods

The index() method is used to find the first occurrence of a specified value within a string. It's a straightforward way to locate a substring in a larger string. It has pretty much the same syntax as the find() method we discussed earlier:

string.index(sub[, start[, end]])

The sub ids the substring to search for in the string. The start is an optional parameter that represents the starting index within the string where the search begins and the end is another optional parameter representing the ending index within the string where the search ends.

Let's take a look at the example we used to illustrate the find() method:

text = "Python is fun, just as JavaScript is"
result = text.index("is")
print("Substring found at index:", result)

As you can see, the output will be the same as when using the find():

Substring found at index: 7

Note: The key difference between find()/rfind() and index()/rindex() lies in their handling of substrings that are not found. While index() and rindex() raise a ValueError, find() and rfind() return -1, which can be more convenient in scenarios where the absence of a substring is a common and non-exceptional case.

While index() searches from the beginning of the string, rindex() serves a similar purpose but starts the search from the end of the string (similar to rfind()). It finds the last occurrence of the specified substring:

text = "Python is fun, just as JavaScript is"
result = text.index("is")
print("Last occurrence of 'is' is at index:", result)

This will give you:

Last occurrence of 'is' is at index: 34

startswith() and endswith() Methods

Return True if the string starts or ends with the specified prefix or suffix, respectively.

The startswith() method is used to check if a string starts with a specified substring. It's a straightforward and efficient way to perform this check. As usual, let's first check out the syntax before we illustrate the usage of the method in a practical example:

str.startswith(prefix[, start[, end]])

prefix: The substring that you want to check for at the beginning of the string.
start (optional): The starting index within the string where the check begins.
end (optional): The ending index within the string where the check ends.

For example, let's check if the file name starts with the word example:

filename = "example-file.txt"
if filename.startswith("example"):
    print("The filename starts with 'example'.")

Here, since the filename starts with the word example, you'll get the message printed out:

The filename starts with 'example'.

On the other hand, the endswith() method checks if a string ends with a specified substring:

filename = "example-report.pdf"
if filename.endswith(".pdf"):
    print("The file is a PDF document.")

Since the filename is, indeed, the PDF file, you'll get the following output:

The file is a PDF document.

Note: Here, it's important to note that both methods are case-sensitive. For case-insensitive checks, the string should first be converted to a common case (either lower or upper) using lower() or upper() methods.

As you saw in the previous examples, both startswith() and endswith() are commonly used in conditional statements to guide the flow of a program based on the presence or absence of specific prefixes or suffixes in strings.

The count() Method

The count() method is used to count the number of occurrences of a substring in a given string. The syntax of the count() method is:

str.count(sub[, start[, end]])

Where:

sub is the substring for which the count is required.
start (optional) is the starting index from where the count begins.
end (optional) is the ending index where the count ends.

The return value is the number of occurrences of sub in the range start to end.

For example, consider a simple scenario where you need to count the occurrences of a word in a sentence:

text = "Python is amazing. Python is simple. Python is powerful."
count = text.count("Python")
print("Python appears", count, "times")

This will confirm that the word "Python" appears 3 times in the sting text:

Python appears 3 times

Note: Like most string methods in Python, count() is case-sensitive. For case-insensitive counts, convert the string and the substring to a common case using lower() or upper().

If you don't need to search an entire string, the start and end parameters are useful for narrowing down the search within a specific part:

quote = "To be, or not to be, that is the question."
# Count occurrences of 'be' in the substring from index 10 to 30
count = quote.count("be", 10, 30)
print("'be' appears", count, "times between index 10 and 30")

Note: The method counts non-overlapping occurrences. This means that in the string "ababa", the count for the substring "aba" will be 1, not 2.

isalpha(), isdigit(), isnumeric(), and isalnum() Methods

Python string methods offer a variety of ways to inspect and categorize string content. Among these, the isalpha(), isdigit(), isnumeric(), and isalnum() methods are commonly used for checking the character composition of strings.

First of all, let's discuss the isalpha() method. You can use it to check whether all characters in a string are alphabetic (i.e., letters of the alphabet):

word = "Python"
if word.isalpha():
    print("The string contains only letters.")

This method returns True if all characters in the string are alphabetic and there is at least one character. Otherwise, it returns False.

The second method to discuss is the isdigit() method, it checks if all characters in the string are digits:

number = "12345"
if number.isdigit():
    print("The string contains only digits.")

The isnumeric() method is similar to isdigit(), but it also considers numeric characters that are not digits in the strict sense, such as superscript digits, fractions, Roman numerals, and characters from other numeric systems:

num = "Ⅴ"  # Roman numeral for 5
if num.isnumeric():
    print("The string contains numeric characters.")

Last, but not least, the isalnum() method checks if the string consists only of alphanumeric characters (i.e., letters and digits):

string = "Python3"
if string.isalnum():
    print("The string is alphanumeric.")

Note: The isalnum() method does not consider special characters or whitespaces.

The isspace() Method

The isspace() method is designed to check whether a string consists only of whitespace characters. It returns True if all characters in the string are whitespace characters and there is at least one character. If the string is empty or contains any non-whitespace characters, it returns False.

Note: Whitespace characters include spaces ( ), tabs (\t), newlines (\n), and similar space-like characters that are often used to format text.

The syntax of the isspace() method is pretty straightforward:

str.isspace()

To illustrate the usage of the isspace() method, consider an example where you might need to check if a string is purely whitespace:

text = "   \t\n  "
if text.isspace():
    print("The string contains only whitespace characters.")

When validating user inputs in forms or command-line interfaces, checking for strings that contain only whitespace helps in ensuring meaningful input is provided.

Remember: The isspace() returns False for empty strings. If your application requires checking for both empty strings and strings with only whitespace, you'll need to combine checks.

The format() Method

The _format() method, introduced in Python 3, provides a versatile approach to string formatting. It allows for the insertion of variables into string placeholders, offering more readability and flexibility compared to the older % formatting. In this section, we'll take a brief overview of the method, and we'll discuss it in more details in later sections.

The format() method works by replacing curly-brace {} placeholders within the string with parameters provided to the method:

"string with {} placeholders".format(values)

For example, assume you need to insert username and age into a preformatted string. The format() method comes in handy:

name = "Alice"
age = 30
greeting = "Hello, my name is {} and I am {} years old.".format(name, age)
print(greeting)

This will give you:

Hello, my name is Alice and I am 30 years old.

The format() method supports a variety of advanced features, such as named parameters, formatting numbers, aligning text, and so on, but we'll discuss them later in the "" section.

The format() method is ideal for creating strings with dynamic content, such as user input, results from computations, or data from databases. It can also help you internationalize your application since it separates the template from the data.

center(), ljust(), and rjust() Methods

Python's string methods include various functions for aligning text. The center(), ljust(), and rjust() methods are particularly useful for formatting strings in a fixed width field. These methods are commonly used in creating text-based user interfaces, reports, and for ensuring uniformity in the visual presentation of strings.

The center() method centers a string in a field of a specified width:

str.center(width[, fillchar])

Here the width parameter represents the total width of the string, including the original string and the (optional) fillchar parameter represents the character used to fill in the space (defaults to a space if not provided).

Note: Ensure the width specified is greater than the length of the original string to see the effect of these methods.

For example, simply printing text using print("Sample text") will result in:

Sample text

But if you wanted to center the text over the field of, say, 20 characters, you'd have to use the center() method:

title = "Sample text"
centered_title = title.center(20, '-')
print(centered_title)

This will result in:

----Sample text-----

Similarly, the ljust() and rjust() methods will align text to the left and right, padding it with a specified character (or space by default) on the right or left, respectively:

# ljust()
name = "Alice"
left_aligned = name.ljust(10, '*')
print(left_aligned)

# rjust()
amount = "100"
right_aligned = amount.rjust(10, '0')
print(right_aligned)

This will give you:

Alice*****

For the ljust() and:

0000000100

For the rjust().

Using these methods can help you align text in columns when displaying data in tabular format. Also, it is pretty useful in text-based user interfaces, these methods help maintain a structured and visually appealing layout.

The zfill() Method

The zfill() method adds zeros (0) at the beginning of the string, until it reaches the specified length. If the original string is already equal to or longer than the specified length, zfill() returns the original string.

The basic syntax of the _zfill() method is:

str.zfill(width)

Where the width is the desired length of the string after padding with zeros.

Note: Choose a width that accommodates the longest anticipated string to avoid unexpected results.

Here’s how you can use the zfill() method:

number = "50"
formatted_number = number.zfill(5)
print(formatted_number)

This will output 00050, padding the original string "50" with three zeros to achieve a length of 5.

The method can also be used on non-numeric strings, though its primary use case is with numbers. In that case, convert them to strings before applying _zfill(). For example, use str(42).zfill(5).

Note: If the string starts with a sign prefix (+ or -), the zeros are added after the sign. For example, "-42".zfill(5) results in "-0042".

The swapcase() Method

The swapcase() method iterates through each character in the string, changing each uppercase character to lowercase and each lowercase character to uppercase.

It leaves characters that are neither (like digits or symbols) unchanged.

Take a quick look at an example to demonstrate the swapcase() method:

text = "Python is FUN!"
swapped_text = text.swapcase()
print(swapped_text)

This will output "pYTHON IS fun!", with all uppercase letters converted to lowercase and vice versa.

Warning: In some languages, the concept of case may not apply as it does in English, or the rules might be different. Be cautious when using _swapcase() with internationalized text.

The partition() and rpartition() Methods

The partition() and rpartition() methods split a string into three parts: the part before the separator, the separator itself, and the part after the separator. The partition() searches a string from the beginning, and the rpartition() starts searching from the end of the string:

# Syntax of the partition() and rpartition() methods
str.partition(separator)
str.rpartition(separator)

Here, the separator parameter is the string at which the split will occur.

Both methods are handy when you need to check if a separator exists in a string and then process the parts accordingly.

To illustrate the difference between these two methods, let's take a look at the following string and how these methods are processing it::

text = "Python:Programming:Language"

First, let's take a look at the partition() method:

part = text.partition(":")
print(part)

This will output ('Python', ':', 'Programming:Language').

Now, notice how the output differs when we're using the rpartition():

r_part = text.rpartition(":")
print(r_part)

This will output ('Python:Programming', ':', 'Language').

No Separator Found: If the separator is not found, partition() returns the original string as the first part of the tuple, while rpartition() returns it as the last part.

The encode() Method

Dealing with different character encodings is a common requirement, especially when working with text data from various sources or interacting with external systems. The encode() method is designed to help you out in these scenarios. It converts a string into a bytes object using a specified encoding, such as UTF-8, which is essential for data storage, transmission, and processing in different formats.

The encode() method encodes the string using the specified encoding scheme. The most common encoding is UTF-8, but Python supports many others, like ASCII, Latin-1, and so on.

The encode() simply accepts two parameters, encoding and errors:

str.encode(encoding="utf-8", errors="strict")

encoding specifies the encoding to be used for encoding the string and errors determines the response when the encoding conversion fails.

Note: Common values for the errors parameter are 'strict', 'ignore', and 'replace'.

Here's an example of converting a string to bytes using UTF-8 encoding:

text = "Python Programming"
encoded_text = text.encode()  # Default is UTF-8
print(encoded_text)

This will output something like b'Python Programming', representing the byte representation of the string.

Note: In Python, byte strings (b-strings) are sequences of bytes. Unlike regular strings, which are used to represent text and consist of characters, byte strings are raw data represented in bytes.

Error Handling

The errors parameter defines how to handle errors during encoding:

'strict': Raises a UnicodeEncodeError on failure (default behavior).
'ignore': Ignores characters that cannot be encoded.
'replace': Replaces unencodable characters with a replacement marker, such as ?.

Choose an error handling strategy that suits your application. In most cases, 'strict' is preferable to avoid data loss or corruption.

The expandtabs() Method

This method is often overlooked but can be incredibly useful when dealing with strings containing tab characters (\t).

The expandtabs() method is used to replace tab characters (\t) in a string with the appropriate number of spaces. This is especially useful in formatting output in a readable way, particularly when dealing with strings that come from or are intended for output in a console or a text file.

Let's take a quick look at it's syntaxt:

str.expandtabs(tabsize=8)

Here, tabsize is an optional argument. If it's not specified, Python defaults to a tab size of 8 spaces. This means that every tab character in the string will be replaced by eight spaces. However, you can customize this to any number of spaces that fits your needs.

For example, say you want to replace tabs with 4 spaces:

text = "Name\tAge\tCity"
print(text.expandtabs(4))

This will give you:

Name    Age    City

islower(), isupper(), and istitle() Methods

These methods check if the string is in lowercase, uppercase, or title case, respectively.

islower() is a string method used to check if all characters in the string are lowercase. It returns True if all characters are lowercase and there is at least one cased character, otherwise, it returns False:

a = "hello world"
b = "Hello World"
c = "hello World!"

print(a.islower())  # Output: True
print(b.islower())  # Output: False
print(c.islower())  # Output: False

In contrast, isupper() checks if all cased characters in a string are uppercase. It returns True if all cased characters are uppercase and there is at least one cased character, otherwise, False:

a = "HELLO WORLD"
b = "Hello World"
c = "HELLO world!"

print(a.isupper())  # Output: True
print(b.isupper())  # Output: False
print(c.isupper())  # Output: False

Finally, the istitle() method checks if the string is titled. A string is considered titlecased if all words in the string start with an uppercase character and the rest of the characters in the word are lowercase:

a = "Hello World"
b = "Hello world"
c = "HELLO WORLD"

print(a.istitle())  # Output: True
print(b.istitle())  # Output: False
print(c.istitle())  # Output: False

The casefold() Method

The casefold() method is used for case-insensitive string matching. It is similar to the lower() method but more aggressive. The casefold() method removes all case distinctions present in a string. It is used for caseless matching, meaning it effectively ignores cases when comparing two strings.

A classic example where casefold() matches two strings while lower() doesn't involves characters from languages that have more complex case rules than English. One such scenario is with the German letter "ß", which is a lowercase letter. Its uppercase equivalent is "SS".

To illustrate this, consider two strings, one containing "ß" and the other containing "SS":

str1 = "straße"
str2 = "STRASSE"

Now, let's apply both lower() and casefold() methods and compare the results:

# Using `lower()`:
print(str1.lower() == str2.lower())  # Output: False

In this case, lower() simply converts all characters in str2 to lowercase, resulting in "strasse". However, "strasse" is not equal to "straße", so the comparison yields False.

Now, let's compare that to how the casefold() method: handles this scenario:

# Using `casefold()`:
print(str1.casefold() == str2.casefold())  # Output: True

Here, casefold() converts "ß" in str1 to "ss", making it "strasse". This matches with str2 after casefold(), which also results in "strasse". Therefore, the comparison yields True.

Formatting Strings in Python

String formatting is an essential aspect of programming in Python, offering a powerful way to create and manipulate strings dynamically. It's a technique used to construct strings by dynamically inserting variables or expressions into placeholders within a string template.

String formatting in Python has evolved significantly over time, providing developers with more intuitive and efficient ways to handle strings. The oldest method of string formatting in Python, borrowed from C is the % Operator (printf-style String Formatting). It uses the % operator to replace placeholders with values. While this method is still in use, it is less preferred due to its verbosity and complexity in handling complex formats.

The first advancement was introduced in Python 2.6 in the form of str.format() method. This method offered a more powerful and flexible way of formatting strings. It uses curly braces {} as placeholders which can include detailed formatting instructions. It also introduced the support for positional and keyword arguments, making the string formatting more readable and maintainable.

Finally, Python 3.6 introduced a more concise and readable way to format strings in the form of formatted string literals, or f-strings in short. They allow for inline expressions, which are evaluated at runtime.

With f-strings, the syntax is more straightforward, and the code is generally faster than the other methods.

Basic String Formatting Techniques

Now that you understand the evolution of the string formatting techniques in Python, let's dive deeper into each of them. In this section, we'll quickly go over the % operator and the str.format() method, and, in the end, we'll dive into the f-strings.

The `%` Operator

The % operator, often referred to as the printf-style string formatting, is one of the oldest string formatting techniques in Python. It's inspired by the C programming language:

name = "John"
age = 36
print("Name: %s, Age: %d" % (name, age))

This will give you:

Name: John, Age: 36

As in C, %s is used for strings, %d or %i for integers, and %f for floating-point numbers.

This string formatting method can be less intuitive and harder to read, it's also less flexible compared to newer methods.

The `str.format()` Method

As we said in the previous sections, at its core, str.format() is designed to inject values into string placeholders, defined by curly braces {}. The method takes any number of parameters and positions them into the placeholders in the order they are given. Here's a basic example:

name = "Bob"
age = 25
print("Name: {}, Age: {}".format(name, age))

This code will output: Name: Bob, Age: 25

str.format() becomes more powerful with positional and keyword arguments. Positional arguments are placed in order according to their position (starting from 0, sure thing):

template = "{1} is a {0}."
print(template.format("programming language", "Python"))

Since the "Python" is the second argument of the format() method, it replaces the {1} and the first argument replaces the {0}:

Python is a programming language.

Keyword arguments, on the other hand, add a layer of readability by allowing you to assign values to named placeholders:

template = "{language} is a {description}."
print(template.format(language="Python", description="programming language"))

This will also output: Python is a programming language.

One of the most compelling features of str.format() is its formatting capabilities. You can control number formatting, alignment, width, and more. First, let's format a decimal number so it has only two decimal points:

# Formatting numbers
num = 123.456793
print("Formatted number: {:.2f}".format(num))

Here, the format() formats the number with six decimal places down to two:

`Formatted number: 123.46

Now, let's take a look at how to align text using the fomrat() method:

# Aligning text
text = "Align me"
print("Left: {:<10} | Right: {:>10} | Center: {:^10}".format(text, text, text))

Using the curly braces syntax of the format() method, we aligned text in fields of length 10. We used :< to align left, :> to align right, and :^ to center text:

Left: Align me   | Right:    Align me | Center:  Align me

For more complex formatting needs, str.format() can handle nested fields, object attributes, and even dictionary keys:

# Nested fields
point = (2, 8)
print("X: {0[0]} | Y: {0[1]}".format(point))
# > Output: 'X: 2 | Y: 8'

# Object attributes
class Dog:
    breed = "Beagle"
    name = "Buddy"

dog = Dog()
print("Meet {0.name}, the {0.breed}.".format(dog))
# > Output: 'Meet Buddy, the Beagle.'

# Dictionary keys
info = {'name': 'Alice', 'age': 30}
print("Name: {name} | Age: {age}".format(**info))
# > Output: 'Name: Alice | Age: 30'

Introduction to f-strings

To create an f-string, prefix your string literal with f or F before the opening quote. This signals Python to parse any {} curly braces and the expressions they contain:

name = "Charlie"
greeting = f"Hello, {name}!"
print(greeting)

Output: Hello, Charlie!

One of the key strengths of f-strings is their ability to evaluate expressions inline. This can include arithmetic operations, method calls, and more:

age = 25
age_message = f"In 5 years, you will be {age + 5} years old."
print(age_message)

Output: In 5 years, you will be 30 years old.

Like str.format(), f-strings provide powerful formatting options. You can format numbers, align text, and control precision all within the curly braces:

price = 49.99
print(f"Price: {price:.2f} USD")

score = 85.333
print(f"Score: {score:.1f}%")

Output:

Price: 49.99 USD
Score: 85.3%

Advanced String Formatting with f-strings

In the previous section, we touched on some of these concepts, but, here, we'll dive deeper and explain them in more details.

Multi-line f-strings

A less commonly discussed, but incredibly useful feature of f-strings is their ability to span multiple lines. This capability makes them ideal for constructing longer and more complex strings. Let's dive into how multi-line f-strings work and explore their practical applications.

A multi-line f-string allows you to spread a string over several lines, maintaining readability and organization in your code. Here’s how you can create a multi-line f-string:

name = "Brian"
profession = "Developer"
location = "New York"

bio = (f"Name: {name}\n"
       f"Profession: {profession}\n"
       f"Location: {location}")

print(bio)

Running this will result in:

Name: Brian
Profession: Developer
Location: New York

Why Use Multi-line f-strings? Multi-line f-strings are particularly useful in scenarios where you need to format long strings or when dealing with strings that naturally span multiple lines, like addresses, detailed reports, or complex messages. They help in keeping your code clean and readable.

Alternatively, you could use string concatenation to create multiline strings, but the advantage of multi-line f-strings is that they are more efficient and readable. Each line in a multi-line f-string is a part of the same string literal, whereas concatenation involves creating multiple string objects.

Indentation and Whitespace

In multi-line f-strings, you need to be mindful of indentation and whitespace as they are preserved in the output:

message = (
    f"Dear {name},\n"
    f"    Thank you for your interest in our product. "
    f"We look forward to serving you.\n"
    f"Best Regards,\n"
    f"    The Team"
)

print(message)

This will give you:

Dear Alice,
    Thank you for your interest in our product. We look forward to serving you.
Best Regards,
    The Team

Complex Expressions Inside f-strings

Python's f-strings not only simplify the task of string formatting but also introduce an elegant way to embed complex expressions directly within string literals. This powerful feature enhances code readability and efficiency, particularly when dealing with intricate operations.

Embedding Expressions

An f-string can incorporate any valid Python expression within its curly braces. This includes arithmetic operations, method calls, and more:

import math

radius = 7
area = f"The area of the circle is: {math.pi * radius ** 2:.2f}"
print(area)

This will calculate you the area of the circle of radius 7:

The area of the circle is: 153.94

Calling Functions and Methods

F-strings become particularly powerful when you embed function calls directly into them. This can streamline your code and enhance readability:

def get_temperature():
    return 22.5

weather_report = f"The current temperature is {get_temperature()}°C."
print(weather_report)

This will give you:

The current temperature is 22.5°C.

Inline Conditional Logic

You can even use conditional expressions within f-strings, allowing for dynamic string content based on certain conditions:

score = 85
grade = f"You {'passed' if score >= 60 else 'failed'} the exam."
print(grade)

Since the score is greater than 60, this will output: You passed the exam.

List Comprehensions

F-strings can also incorporate list comprehensions, making it possible to generate dynamic lists and include them in your strings:

numbers = [1, 2, 3, 4, 5]
squared = f"Squared numbers: {[x**2 for x in numbers]}"
print(squared)

This will yield:

Squared numbers: [1, 4, 9, 16, 25]

Nested f-strings

For more advanced formatting needs, you can nest f-strings within each other. This is particularly useful when you need to format a part of the string differently:

name = "Bob"
age = 30
profile = f"Name: {name}, Age: {f'{age} years old' if age else 'Age not provided'}"
print(profile)

Here. we independently formatted how the Age section will be displayed: Name: Bob, Age: 30 years old

Handling Exceptions

You can even use f-strings to handle exceptions in a concise manner, though it should be done cautiously to maintain code clarity:

x = 5
y = 0
result = f"Division result: {x / y if y != 0 else 'Error: Division by zero'}"
print(result)
# Output: 'Division result: Error: Division by zero'

Conditional Logic and Ternary Operations in Python f-strings

We briefly touched on this topic in the previous section, but, here, we'll get into more details. This functionality is particularly useful when you need to dynamically change the content of a string based on certain conditions.

As we previously discussed, the ternary operator in Python, which follows the format x if condition else y, can be seamlessly integrated into f-strings. This allows for inline conditional checks and dynamic string content:

age = 20
age_group = f"{'Adult' if age >= 18 else 'Minor'}"
print(f"Age Group: {age_group}")
# Output: 'Age Group: Adult'

You can also use ternary operations within f-strings for conditional formatting. This is particularly useful for changing the format of the string based on certain conditions:

score = 75
result = f"Score: {score} ({'Pass' if score >= 50 else 'Fail'})"
print(result)
# Output: 'Score: 75 (Pass)'

Besides handling basic conditions, ternary operations inside f-strings can also handle more complex conditions, allowing for intricate logical operations:

hours_worked = 41
pay_rate = 20
overtime_rate = 1.5
total_pay = f"Total Pay: ${(hours_worked * pay_rate) + ((hours_worked - 40) * pay_rate * overtime_rate) if hours_worked > 40 else hours_worked * pay_rate}"
print(total_pay)

Here, we calculated the total pay by using inline ternary operator: Total Pay: $830.0

Combining multiple conditions within f-strings is something that can be easily achieved:

temperature = 75
weather = "sunny"
activity = f"Activity: {'Swimming' if weather == 'sunny' and temperature > 70 else 'Reading indoors'}"
print(activity)
# Output: 'Activity: Swimming'

Ternary operations in f-strings can also be used for dynamic formatting, such as changing text color based on a condition:

profit = -20
profit_message = f"Profit: {'+' if profit >= 0 else ''}{profit} {'(green)' if profit >= 0 else '(red)'}"
print(profit_message)
# Output: 'Profit: -20 (red)'

Formatting Dates and Times with Python f-strings

One of the many strengths of Python's f-strings is their ability to elegantly handle date and time formatting. In this section, we'll explore how to use f-strings to format dates and times, showcasing various formatting options to suit different requirements.

To format a datetime object using an f-string, you can simply include the desired format specifiers inside the curly braces:

from datetime import datetime

current_time = datetime.now()
formatted_time = f"Current time: {current_time:%Y-%m-%d %H:%M:%S}"
print(formatted_time)

This will give you the current time in the format you specified:

Current time: [current date and time in YYYY-MM-DD HH:MM:SS format]

Note: Here, you can also use any of the other datetime specifiers, such as %B, %s, and so on.

If you're working with timezone-aware datetime objects, f-strings can provide you with the time zone information using the %z specifier:

from datetime import timezone, timedelta

timestamp = datetime.now(timezone.utc)
formatted_timestamp = f"UTC Time: {timestamp:%Y-%m-%d %H:%M:%S %Z}"
print(formatted_timestamp)

This will give you: UTC Time: [current UTC date and time] UTC

F-strings can be particularly handy for creating custom date and time formats, tailored for display in user interfaces or reports:

event_date = datetime(2023, 12, 31)
event_time = f"Event Date: {event_date:%d-%m-%Y | %I:%M%p}"
print(event_time)

Output: Event Date: 31-12-2023 | 12:00AM

You can also combine f-strings with timedelta objects to display relative times:

from datetime import timedelta

current_time = datetime.now()
hours_passed = timedelta(hours=6)
future_time = current_time + hours_passed
relative_time = f"Time after 6 hours: {future_time:%H:%M}"
print(relative_time)

# Output: 'Time after 6 hours: [time 6 hours from now in HH:MM format]'

All-in-all, you can create whichever datetime format using a combination of the available specifiers within a f-string:

Specifier	Usage
%a	Abbreviated weekday name.
%A	Full weekday name.
%b	Abbreviated month name.
%B	Full month name.
%c	Date and time representation appropriate for locale. If the # flag (`%#c`) precedes the specifier, long date and time representation is used.
%d	Day of month as a decimal number (01 – 31). If the # flag (`%#d`) precedes the specifier, the leading zeros are removed from the number.
%H	Hour in 24-hour format (00 – 23). If the # flag (`%#H`) precedes the specifier, the leading zeros are removed from the number.
%I	Hour in 12-hour format (01 – 12). If the # flag (`%#I`) precedes the specifier, the leading zeros are removed from the number.
%j	Day of year as decimal number (001 – 366). If the # flag (`%#j`) precedes the specifier, the leading zeros are removed from the number.
%m	Month as decimal number (01 – 12). If the # flag (`%#m`) precedes the specifier, the leading zeros are removed from the number.
%M	Minute as decimal number (00 – 59). If the # flag (`%#M`) precedes the specifier, the leading zeros are removed from the number.
%p	Current locale's A.M./P.M. indicator for 12-hour clock.
%S	Second as decimal number (00 – 59). If the # flag (`%#S`) precedes the specifier, the leading zeros are removed from the number.
%U	Week of year as decimal number, with Sunday as first day of week (00 – 53). If the # flag (`%#U`) precedes the specifier, the leading zeros are removed from the number.
%w	Weekday as decimal number (0 – 6; Sunday is 0). If the # flag (`%#w`) precedes the specifier, the leading zeros are removed from the number.
%W	Week of year as decimal number, with Monday as first day of week (00 – 53). If the # flag (`%#W`) precedes the specifier, the leading zeros are removed from the number.
%x	Date representation for current locale. If the # flag (`%#x`) precedes the specifier, long date representation is enabled.
%X	Time representation for current locale.
%y	Year without century, as decimal number (00 – 99). If the # flag (`%#y`) precedes the specifier, the leading zeros are removed from the number.
%Y	Year with century, as decimal number. If the # flag (`%#Y`) precedes the specifier, the leading zeros are removed from the number.
%z, %Z	Either the time-zone name or time zone abbreviation, depending on registry settings; no characters if time zone is unknown.

Advanced Number Formatting with Python f-strings

Python's f-strings are not only useful for embedding expressions and creating dynamic strings, but they also excel in formatting numbers for various contexts. They can be helpful when dealing with financial data, scientific calculations, or statistical information,since they offer a wealth of options for presenting numbers in a clear, precise, and readable format. In this section, we'll dive into the advanced aspects of number formatting using f-strings in Python.

Before exploring advanced techniques, let's start with basic number formatting:

number = 123456.789
formatted_number = f"Basic formatting: {number:,}"
print(formatted_number)
# Output: 'Basic formatting: 123,456.789'

Here, we simply changed the way we print the number so it uses commas as thousands separator and full stops as a decimal separator.

F-strings allow you to control the precision of floating-point numbers, which is crucial in fields like finance and engineering:

pi = 3.141592653589793
formatted_pi = f"Pi rounded to 3 decimal places: {pi:.3f}"
print(formatted_pi)

Here, we rounded Pi to 3 decimal places: Pi rounded to 3 decimal places: 3.142

For displaying percentages, f-strings can convert decimal numbers to percentage format:

completion_ratio = 0.756
formatted_percentage = f"Completion: {completion_ratio:.2%}"
print(formatted_percentage)

This will give you: Completion: 75.60%

Another useful feature is that f-strings support exponential notation:

avogadro_number = 6.02214076e23
formatted_avogadro = f"Avogadro's number: {avogadro_number:.2e}"
print(formatted_avogadro)

This will convert Avogadro's number from the usual decimal notation to the exponential notation: Avogadro's number: 6.02e+23

Besides this, f-strings can also format numbers in hexadecimal, binary, or octal representation:

number = 255
hex_format = f"Hexadecimal: {number:#x}"
binary_format = f"Binary: {number:#b}"
octal_format = f"Octal: {number:#o}"

print(hex_format)
print(binary_format)
print(octal_format)

This will transform the number 255 to each of supported number representations:

Hexadecimal: 0xff
Binary: 0b11111111
Octal: 0o377

Lambdas and Inline Functions in Python f-strings

Python's f-strings are not only efficient for embedding expressions and formatting strings but also offer the flexibility to include lambda functions and other inline functions.

This feature opens up a plenty of possibilities for on-the-fly computations and dynamic string generation.

Lambda functions, also known as anonymous functions in Python, can be used within f-strings for inline calculations:

area = lambda r: 3.14 * r ** 2
radius = 5
formatted_area = f"The area of the circle with radius {radius} is: {area(radius)}"
print(formatted_area)

# Output: 'The area of the circle with radius 5 is: 78.5'

As we briefly discussed before, you can also call functions directly within an f-string, making your code more concise and readable:

def square(n):
    return n * n

num = 4
formatted_square = f"The square of {num} is: {square(num)}"
print(formatted_square)

# Output: 'The square of 4 is: 16'

Lambdas in f-strings can help you implement more complex expressions within f-strings, enabling sophisticated inline computations:

import math

hypotenuse = lambda a, b: math.sqrt(a**2 + b**2)
side1, side2 = 3, 4
formatted_hypotenuse = f"The hypotenuse of a triangle with sides {side1} and {side2} is: {hypotenuse(side1, side2)}"
print(formatted_hypotenuse)

# Output: 'The hypotenuse of a triangle with sides 3 and 4 is: 5.0'

You can also combine multiple functions within a single f-string for complex formatting needs:

def double(n):
    return n * 2

def format_as_percentage(n):
    return f"{n:.2%}"

num = 0.25
formatted_result = f"Double of {num} as percentage: {format_as_percentage(double(num))}"
print(formatted_result)

This will give you:

Double of 0.25 as percentage: 50.00%

Debugging with f-strings in Python 3.8+

Python 3.8 introduced a subtle yet impactful feature in f-strings: the ability to self-document expressions. This feature, often heralded as a boon for debugging, enhances f-strings beyond simple formatting tasks, making them a powerful tool for diagnosing and understanding code.

The key addition in Python 3.8 is the = specifier in f-strings. It allows you to print both the expression and its value, which is particularly useful for debugging:

x = 14
y = 3
print(f"{x=}, {y=}")

# Output: 'x=14, y=3'

This feature shines when used with more complex expressions, providing insight into the values of variables at specific points in your code:

name = "Alice"
age = 30
print(f"{name.upper()=}, {age * 2=}")

This will print out both the variables you're looking at and its value:

name.upper()='ALICE', age * 2=60

The = specifier is also handy for debugging within loops, where you can track the change of variables in each iteration:

for i in range(3):
    print(f"Loop {i=}")

Output:

Loop i=0
Loop i=1
Loop i=2

Additionally, you can debug function return values and argument values directly within f-strings:

def square(n):
    return n * n

num = 4
print(f"{square(num)=}")

# Output: 'square(num)=16'

Note: While this feature is incredibly useful for debugging, it's important to use it judiciously. The output can become cluttered in complex expressions, so it's best suited for quick and simple debugging scenarios.

Remember to remove these debugging statements from production code for clarity and performance.

Performance of F-strings

F-strings are often lauded for their readability and ease of use, but how do they stack up in terms of performance? Here, we'll dive into the performance aspects of f-strings, comparing them with traditional string formatting methods, and provide insights on optimizing string formatting in Python:

f-strings vs. Concatenation: f-strings generally offer better performance than string concatenation, especially in cases with multiple dynamic values. Concatenation can lead to the creation of numerous intermediate string objects, whereas an f-string is compiled into an efficient format.
f-strings vs. % Formatting: The old % formatting method in Python is less efficient compared to f-strings. f-strings, being a more modern implementation, are optimized for speed and lower memory usage.
f-strings vs. str.format(): f-strings are typically faster than the str.format() method. This is because f-strings are processed at compile time, not at runtime, which reduces the overhead associated with parsing and interpreting the format string.

Considerations for Optimizing String Formatting

Use f-strings for Simplicity and Speed: Given their performance benefits, use f-strings for most string formatting needs, unless working with a Python version earlier than 3.6.
Complex Expressions: For complex expressions within f-strings, be aware that they are evaluated at runtime. If the expression is particularly heavy, it can offset the performance benefits of f-strings.
Memory Usage: In scenarios with extremely large strings or in memory-constrained environments, consider other approaches like string builders or generators.
Readability vs. Performance: While f-strings provide a performance advantage, always balance this with code readability and maintainability.

In summary, f-strings not only enhance the readability of string formatting in Python but also offer performance benefits over traditional methods like concatenation, % formatting, and str.format(). They are a robust choice for efficient string handling in Python, provided they are used judiciously, keeping in mind the complexity of embedded expressions and overall code clarity.

Formatting and Internationalization

When your app is targeting a global audience, it's crucial to consider internationalization and localization. Python provides robust tools and methods to handle formatting that respects different cultural norms, such as date formats, currency, and number representations. Let's explore how Python deals with these challenges.

Dealing with Locale-Specific Formatting

When developing applications for an international audience, you need to format data in a way that is familiar to each user's locale. This includes differences in numeric formats, currencies, date and time conventions, and more.

The locale Module:
- Python's locale module allows you to set and get the locale information and provides functionality for locale-sensitive formatting.
- You can use locale.setlocale() to set the locale based on the user’s environment.

Number Formatting:

Using the locale module, you can format numbers according to the user's locale, which includes appropriate grouping of digits and decimal point symbols.

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
formatted_number = locale.format_string("%d", 1234567, grouping=True)
print(formatted_number)  # 1,234,567 in US locale

Currency Formatting:

The locale module also provides a way to format currency values.

formatted_currency = locale.currency(1234.56)
print(formatted_currency)  # $1,234.56 in US locale

Date and Time Formatting for Internationalization

Date and time representations vary significantly across cultures. Python's datetime module, combined with the locale module, can be used to display date and time in a locale-appropriate format.

Example:

import locale
from datetime import datetime

locale.setlocale(locale.LC_ALL, 'de_DE')
now = datetime.now()
print(now.strftime('%c'))  # Locale-specific full date and time representation

Best Practices for Internationalization:

Consistent Use of Locale Settings:
- Always set the locale at the start of your application and use it consistently throughout.
- Remember to handle cases where the locale setting might not be available or supported.
Be Cautious with Locale Settings:
- Setting a locale is a global operation in Python, which means it can affect other parts of your program or other programs running in the same environment.
Test with Different Locales:
- Ensure to test your application with different locale settings to verify that formats are displayed correctly.
Handling Different Character Sets and Encodings:
- Be aware of the encoding issues that might arise with different languages, especially when dealing with non-Latin character sets.

Working with Substrings

Working with substrings is a common task in Python programming, involving extracting, searching, and manipulating parts of strings. Python offers several methods to handle substrings efficiently and intuitively. Understanding these methods is crucial for text processing, data manipulation, and various other applications.

Extracting Substrings

Slicing is one of the primary ways to extract a substring from a string. It involves specifying a start and end index, and optionally a step, to slice out a portion of the string.

Note: We discussed the notion of slicing in more details in the "Basic String Operations" section.

For example, say you'd like to extract the word "World" from the sentence "Hello, world!"

text = "Hello, World!"
# Extract 'World' from text
substring = text[7:12]

Here, the value of substring would be "World". Python also supports negative indexing (counting from the end), and omitting start or end indices to slice from the beginning or to the end of the string, respectively.

Finding Substrings

As we discussed in the "Common String Methods" section, Python provides methods like find(), index(), rfind(), and rindex() to search for the position of a substring within a string.

find() and rfind() return the lowest and the highest index where the substring is found, respectively. They return -1 if the substring is not found.
index() and rindex() are similar to find() and rfind(), but raise a ValueError if the substring is not found.

For example, the position of the word "World" in the string "Hello, World!" would be 7:

text = "Hello, World!"
position = text.find("World")

print(position)
# Output: 7

Replacing Substrings

The replace() method is used to replace occurrences of a specified substring with another substring:

text = "Hello, World!"
new_text = text.replace("World", "Python")

The word "World" will be replaced with the word "Python", therefore, new_text would be "Hello, Python!".

Checking for Substrings

Methods like startswith() and endswith() are used to check if a string starts or ends with a specified substring, respectively:

text = "Hello, World!"
if text.startswith("Hello"):
    print("The string starts with 'Hello'")

Splitting Strings

The split() method breaks a string into a list of substrings based on a specified delimiter:

text = "one,two,three"
items = text.split(",")

Here, items would be ['one', 'two', 'three'].

Joining Strings

The join() method is used to concatenate a list of strings into a single string, with a specified separator:

words = ['Python', 'is', 'fun']
sentence = ' '.join(words)

In this example, sentence would be "Python is fun".

Advanced String Techniques

Besides simple string manipulation techniques, Python involves more sophisticated methods of manipulating and handling strings, which are essential for complex text processing, encoding, and pattern matching.

In this section, we'll take a look at an overview of some advanced string techniques in Python.

Unicode and Byte Strings

Understanding the distinction between Unicode strings and byte strings in Python is quite important when you're dealing with text and binary data. This differentiation is a core aspect of Python's design and plays a significant role in how the language handles string and binary data.

Since the introduction of Python 3, the default string type is Unicode. This means whenever you create a string using str, like when you write s = "hello", you are actually working with a Unicode string.

Unicode strings are designed to store text data. One of their key strengths is the ability to represent characters from a wide range of languages, including various symbols and special characters. Internally, Python uses Unicode to represent these strings, making them extremely versatile for text processing and manipulation. Whether you're simply working with plain English text or dealing with multiple languages and complex symbols, Unicode coding helps you make sure that your text data is consistently represented and manipulated within Python.

Note: Depending on the build, Python uses either UTF-16 or UTF-32.

On the other hand, byte strings are used in Python for handling raw binary data. When you face situations that require working directly with bytes - like dealing with binary files, network communication, or any form of low-level data manipulation - byte strings come into play. You can create a byte string by prefixing the string literal with b, as in b = b"bytes".

Unlike Unicode strings, byte strings are essentially sequences of bytes - integers in the range of 0-255 - and they don't inherently carry information about text encoding. They are the go-to solution when you need to work with data at the byte level, without the overhead or complexity of text encoding.

Conversion between Unicode and byte strings is a common requirement, and Python handles this through explicit encoding and decoding. When you need to convert a Unicode string into a byte string, you use the .encode() method along with specifying the encoding, like UTF-8. Conversely, turning a byte string into a Unicode string requires the .decode() method.

Let's consider a practical example where we need to use both Unicode strings and byte strings in Python.

Imagine we have a simple text message in English that we want to send over a network. This message is initially in the form of a Unicode string, which is the default string type in Python 3.

First, we create our Unicode string:

message = "Hello, World!"

This message is a Unicode string, perfect for representing text data in Python. However, to send this message over a network, we often need to convert it to bytes, as network protocols typically work with byte streams.

We can convert our Unicode string to a byte string using the .encode() method. Here, we'll use UTF-8 encoding, which is a common character encoding for Unicode text:

encoded_message = message.encode('utf-8')

Now, encoded_message is a byte string. It's no longer in a format that is directly readable as text, but rather in a format suitable for transmission over a network or for writing to a binary file.

Let's say the message reaches its destination, and we need to convert it back to a Unicode string for reading. We can accomplish this by using the .decode() method:

decoded_message = encoded_message.decode('utf-8')

With decoded_message, we're back to a readable Unicode string, "Hello, World!".

This process of encoding and decoding is essential when dealing with data transmission or storage in Python, where the distinction between text (Unicode strings) and binary data (byte strings) is crucial. By converting our text data to bytes before transmission, and then back to text after receiving it, we ensure that our data remains consistent and uncorrupted across different systems and processing stages.

Raw Strings

Raw strings are a unique form of string representation that can be particularly useful when dealing with strings that contain many backslashes, like file paths or regular expressions. Unlike normal strings, raw strings treat backslashes (\) as literal characters, not as escape characters. This makes them incredibly handy when you don't want Python to handle backslashes in any special way.

Raw strings are useful when dealing with regular expressions or any string that may contain backslashes (\), as they treat backslashes as literal characters.

In a standard Python string, a backslash signals the start of an escape sequence, which Python interprets in a specific way. For example, \n is interpreted as a newline, and \t as a tab. This is useful in many contexts but can become problematic when your string contains many backslashes and you want them to remain as literal backslashes.

A raw string is created by prefixing the string literal with an 'r' or 'R'. This tells Python to ignore all escape sequences and treat backslashes as regular characters. For example, consider a scenario where you need to define a file path in Windows, which uses backslashes in its paths:

path = r"C:\Users\YourName\Documents\File.txt"

Here, using a raw string prevents Python from interpreting \U, \Y, \D, and \F as escape sequences. If you used a normal string (without the 'r' prefix), Python would try to interpret these as escape sequences, leading to errors or incorrect strings.

Another common use case for raw strings is in regular expressions. Regular expressions use backslashes for special characters, and using raw strings here can make your regex patterns much more readable and maintainable:

import re

pattern = r"\b[A-Z]+\b"
text = "HELLO, how ARE you?"
matches = re.findall(pattern, text)

print(matches)  # Output: ['HELLO', 'ARE']

The raw string r"\b[A-Z]+\b" represents a regular expression that looks for whole words composed of uppercase letters. Without the raw string notation, you would have to escape each backslash with another backslash (\\b[A-Z]+\\b), which is less readable.

Multiline Strings

Multiline strings in Python are a convenient way to handle text data that spans several lines. These strings are enclosed within triple quotes, either triple single quotes (''') or triple double quotes (""").

This approach is often used for creating long strings, docstrings, or even for formatting purposes within the code.

Unlike single or double-quoted strings, which end at the first line break, multiline strings allow the text to continue over several lines, preserving the line breaks and white spaces within the quotes.

Let's consider a practical example to illustrate the use of multiline strings. Suppose you are writing a program that requires a long text message or a formatted output, like a paragraph or a poem. Here's how you might use a multiline string for this purpose:

long_text = """
This is a multiline string in Python.
It spans several lines, maintaining the line breaks
and spaces just as they are within the triple quotes.

    You can also create indented lines within it,
like this one!
"""

print(long_text)

When you run this code, Python will output the entire block of text exactly as it's formatted within the triple quotes, including all the line breaks and spaces. This makes multiline strings particularly useful for writing text that needs to maintain its format, such as when generating formatted emails, long messages, or even code documentation.

In Python, multiline strings are also commonly used for docstrings. Docstrings provide a convenient way to document your Python classes, functions, modules, and methods. They are written immediately after the definition of a function, class, or a method and are enclosed in triple quotes:

def my_function():
    """
    This is a docstring for the my_function.
    It can provide an explanation of what the function does,
    its parameters, return values, and more.
    """
    pass

When you use the built-in help() function on my_function, Python will display the text in the docstring as the documentation for that function.

Regular Expressions

Regular expressions in Python, facilitated by the re module, are a powerful tool for pattern matching and manipulation of strings. They provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.

Regular expressions are used for a wide range of tasks including validation, parsing, and string manipulation.

At the core of regular expressions are patterns that are matched against strings. These patterns are expressed in a specialized syntax that allows you to define what you're looking for in a string. Python's re module supports a set of functions and syntax that adhere to regular expression rules.

Advice: If you want to have more comprehensive insight into regular expressions in Python, you should definitely read our "Introduction to Regular Expressions in Python" article.

Some of the key functions in the re module include:

re.match(): Determines if the regular expression matches at the beginning of the string.
re.search(): Scans through the string and returns a Match object if the pattern is found anywhere in the string.
re.findall(): Finds all occurrences of the pattern in the string and returns them as a list.
re.finditer(): Similar to re.findall(), but returns an iterator yielding Match objects instead of the strings.
re.sub(): Replaces occurrences of the pattern in the string with a replacement string.

To use regular expressions in Python, you typically follow these steps:

Import the re module.
Define the regular expression pattern as a string.
Use one of the re module's functions to search or manipulate the string using the pattern.

Here's a practical example to demonstrate these steps:

import re

# Sample text
text = "The rain in Spain falls mainly in the plain."

# Regular expression pattern to find all words that start with 'S' or 's'
pattern = r"\bs\w*"  # The r before the string makes it a raw string

# Using re.findall() to find all occurrences
found_words = re.findall(pattern, text, re.IGNORECASE)

print(found_words)  # Output: ['Spain', 'spain']

In this example:

r"\bs\w*" is the regular expression pattern. \b indicates a word boundary, s is the literal character 's', and \w* matches any word character (letters, digits, or underscores) zero or more times.
re.IGNORECASE is a flag that makes the search case-insensitive.
re.findall() searches the string text for all occurrences that match the pattern.

Regular expressions are extremely versatile but can be complex for intricate patterns. It's important to carefully craft your regular expression for accuracy and efficiency, especially for complex string processing tasks.

Advice: One of the interesting use cases for regular expressions is matching phone numbers. You can read more about that in our "Python Regular Expressions - Validate Phone Numbers" article.

Strings and Collections

In Python, strings and collections (like lists, tuples, and dictionaries) often interact, either through conversion of one type to another or by manipulating strings using methods influenced by collection operations. Understanding how to efficiently work with strings and collections is crucial for tasks like data parsing, text processing, and more.

Splitting Strings into Lists

The split() method is used to divide a string into a list of substrings. It's particularly useful for parsing CSV files or user input:

text = "apple,banana,cherry"
fruits = text.split(',')
# fruits is now ['apple', 'banana', 'cherry']

Joining List Elements into a String

Conversely, the join() method combines a list of strings into a single string, with a specified separator:

fruits = ['apple', 'banana', 'cherry']
text = ', '.join(fruits)
# text is now 'apple, banana, cherry'

String and Dictionary Interactions

Strings can be used to create dynamic dictionary keys, and format strings using dictionary values:

info = {"name": "Alice", "age": 30}
text = "Name: {name}, Age: {age}".format(**info)
# text is now 'Name: Alice, Age: 30'

List Comprehensions with Strings

List comprehensions can include string operations, allowing for concise manipulation of strings within collections:

words = ["Hello", "world", "python"]
upper_words = [word.upper() for word in words]
# upper_words is now ['HELLO', 'WORLD', 'PYTHON']

Mapping and Filtering Strings in Collections

Using functions like map() and filter(), you can apply string methods or custom functions to collections:

words = ["Hello", "world", "python"]
lengths = map(len, words)
# lengths is now an iterator of [5, 5, 6]

Slicing and Indexing Strings in Collections

You can slice and index strings in collections in a similar way to how you do with individual strings:

word_list = ["apple", "banana", "cherry"]
first_letters = [word[0] for word in word_list]
# first_letters is now ['a', 'b', 'c']

Using Tuples as String Format Specifiers

Tuples can be used to specify format specifiers dynamically in string formatting:

format_spec = ("Alice", 30)
text = "Name: %s, Age: %d" % format_spec
# text is now 'Name: Alice, Age: 30'

String Performance Considerations

When working with strings in Python, it's important to consider their performance implications, especially in large-scale applications, data processing tasks, or situations where efficiency is critical. In this section, we'll take a look at some key performance considerations and best practices for handling strings in Python.

Immutability of Strings

Since strings are immutable in Python, each time you modify a string, a new string is created. This can lead to significant memory usage and reduced performance in scenarios involving extensive string manipulation.

To mitigate this, when dealing with large amounts of string concatenations, it's often more efficient to use list comprehension or the join() method instead of repeatedly using + or +=.

For example, it would be more efficient to join a large list of strings instead of concatenating it using the += operator:

# Inefficient
result = ""
for s in large_list_of_strings:
    result += s

# More efficient
result = "".join(large_list_of_strings)

Generally speaking, concatenating strings using the + operator in a loop is inefficient, especially for large datasets. Each concatenation creates a new string and thus, requires more memory and time.

Use f-Strings for Formatting

Python 3.6 introduced f-Strings, which are not only more readable but also faster at runtime compared to other string formatting methods like % formatting or str.format().

Avoid Unnecessary String Operations

Operations like strip(), replace(), or upper()/lower() create new string objects. It's advisable to avoid these operations in critical performance paths unless necessary.

When processing large text data, consider whether you can operate on larger chunks of data at once, rather than processing the string one character or line at a time.

String Interning

Python automatically interns small strings (usually those that look like identifiers) to save memory and improve performance. This means that identical strings may be stored in memory only once.

Explicit interning of strings (sys.intern()) can sometimes be beneficial in memory-sensitive applications where many identical string instances are used.

Use Built-in Functions and Libraries

Leverage Python’s built-in functions and libraries for string processing, as they are generally optimized for performance.
For complex string operations, especially those involving pattern matching, consider using the re module (regular expressions) which is faster for matching operations compared to manual string manipulation.

Conclusion

This ends our journey through the world of strings in Python that has hopefully been extensive and illuminating. We began by understanding the basics of creating and manipulating strings, exploring how they are indexed, concatenated, and how their immutable nature influences operations in Python. This immutability, a core characteristic of Python strings, ensures security and efficiency in Python's design.

Diving into the array of built-in string methods, we uncovered the versatility of Python in handling common tasks such as case conversion, trimming, searching, and sophisticated formatting. We also examined the various ways Python allows for string formatting, from the traditional % operator to the more modern str.format() method, and the concise and powerful f-Strings introduced in Python 3.6.

Our exploration then took us to the substrings, where slicing and manipulating parts of strings revealed Python's flexibility and power in handling string data. We further ventured into advanced string techniques, discussing the handling of Unicode, the utility of raw strings, and the powerful capabilities of regular expressions for complex string manipulations.

The interaction between strings and collections such as lists, tuples, and dictionaries showcased the dynamic ways in which strings can be converted and manipulated within these structures. This interaction is pivotal in tasks ranging from parsing and formatting data to complex data transformations.

Lastly, we peaked into the critical aspect of string performance considerations. We discussed the importance of understanding and applying efficient string handling techniques, emphasizing practices that enhance performance, reduce memory usage, and ensure the scalability of Python applications.

Overall, this comprehensive overview underscores that strings, as a fundamental data type, are integral to programming in Python. They are involved in almost every aspect of programming, from simple text manipulation to complex data processing. With the insights and techniques discussed, you are now better equipped to tackle a wide range of programming challenges, making informed choices about how to effectively and efficiently handle strings in Python.

January 25, 2024 07:10 PM UTC

TechBeamers Python

Pandas GroupBy() and Count() Explained With Examples

Pandas GroupBy and Count work in combination and are valuable in various data analysis scenarios. The groupby function is used to group a data frame by one or more columns, and the count function is used to count the occurrences of each group. When combined, they can provide a convenient way to perform group-wise counting […]

The post Pandas GroupBy() and Count() Explained With Examples appeared first on TechBeamers.

January 25, 2024 03:04 PM UTC

Top Important Terms in Python Programming With Examples

In this tutorial, we have captured the important terms used in Python programming. If you are learning Python, it is good to be aware of different programming concepts and slang related to Python. Please note that these terms form the foundation of Python programming, and a solid understanding of them is essential for effective development […]

The post Top Important Terms in Python Programming With Examples appeared first on TechBeamers.

January 25, 2024 08:40 AM UTC

Glyph Lefkowitz

The Macintosh

A 4k ultrawide classic MacOS desktop screenshot featuring various Finder windows and MPW Workshop

Today is the 40th anniversary of the announcement of the Macintosh. Others have articulated compelling emotional narratives that easily eclipse my own similar childhood memories of the Macintosh family of computers. So instead, I will ask a question:

What is the Macintosh?

As this is the anniversary of the beginning, that is where I will begin. The original Macintosh, the classic MacOS, the original “System Software” are a shining example of “fake it till you make it”. The original mac operating system was fake.

Don’t get me wrong, it was an impressive technical achievement to fake something like this, but what Steve Jobs did was to see a demo of a Smalltalk-76 system, an object-oriented programming environment with 1-to-1 correspondences between graphical objects on screen and runtime-introspectable data structures, a self-hosting high level programming language, memory safety, message passing, garbage collection, and many other advanced facilities that would not be popularized for decades, and make a fake version of it which ran on hardware that consumers could actually afford, by throwing out most of what made the programming environment interesting and replacing it with a much more memory-efficient illusion implemented in 68000 assembler and Pascal.

The machine’s RAM didn’t have room for a kernel. Whatever application was running was in control of the whole system. No protected memory, no preemptive multitasking. It was a house of cards that was destined to collapse. And collapse it did, both in the short term and the long. In the short term, the system was buggy and unstable, and application crashes resulted in system halts and reboots.

In the longer term, the company based on the Macintosh effectively went out of business and was reverse-acquired by NeXT, but they kept the better-known branding of the older company. The old operating system was gradually disposed of, quickly replaced at its core with a significantly more mature generation of operating system technology based on BSD UNIX and Mach. With the removal of Carbon compatibility 4 years ago, the last vestigial traces of it were removed. But even as early as 2004 the Mac was no longer really the Macintosh.

What NeXT had built was much closer to the Smalltalk system that Jobs was originally attempting to emulate. Its programming language, “Objective C” explicitly called back to Smalltalk’s message-passing, right down to the syntax. Objects on the screen now did correspond to “objects” you could send messages to. The development environment understood this too; that was a major selling point.

The NeXSTEP operating system and Objective C runtime did not have garbage collection, but it provided a similar developer experience by providing reference-counting throughout its object model. The original vision was finally achieved, for real, and that’s what we have on our desks and in our backpacks today (and in our pockets, in the form of the iPhone, which is in some sense a tiny next-generation NeXT computer itself).

The one detail I will relate from my own childhood is this: my first computer was not a Mac. My first computer, as a child, was an Amiga. When I was 5, I had a computer with 4096 colors, real multitasking, 3D graphics, and a paint program that could draw hard real-time animations with palette tricks. Then the writing was on the wall for Commodore and I got a computer which had 256 colors, a bunch of old software that was still black and white, an operating system that would freeze if you held down the mouse button on the menu bar and couldn’t even play animations smoothly. Many will relay their first encounter with the Mac as a kind of magic, but mine was a feeling of loss and disappointment. Unlike almost everyone at the time, I knew what a computer really could be, and despite many pleasant and formative experiences with the Macintosh in the meanwhile, it would be a decade before I saw a real one again.

But this is not to deride the faking. The faking was necessary. Xerox was not going to put an Alto running Smalltalk on anyone’s desk. People have always grumbled that Apple products are expensive, but in 2024 dollars, one of these Xerox computers cost roughly $55,000.

The Amiga was, in its own way, a similar sort of fake. It managed its own miracles by putting performance-critical functions into dedicated hardware which rapidly became obsolete as software technology evolved much more rapidly.

Jobs is celebrated as a genius of product design, and he certainly wasn’t bad at it, but I had the rare privilege of seeing the homework he was cribbing from in that subject, and in my estimation he was a B student at best. Where he got an A was bringing a vision to life by creating an organization, both inside and outside of his companies.

If you want a culture-defining technological artifact, everybody in the culture has to be able to get their hands on one. This doesn’t just mean that the builder has to be able to build it. The buyer also has to be able to afford it, obviously. Developers have to be able to develop for it. The buyer has to actually want it; the much-derided “marketing” is a necessary part of the process of making a product what it is. Everyone needs to be able to move together in the direction of the same technological future.

This is why it was so fitting that Tim Cook was made Jobs's successor. The supply chain was the hard part.

The crowning, final achievement of Jobs’s career was the fact that not only did he fake it — the fakes were flying fast and thick at that time in history, even if they mostly weren’t as good — it was that he faked it and then he built the real version and then he bridged the transitions to get to the real thing.

I began here by saying that the Mac isn’t really the Mac, and speaking in terms of a point in time analysis that is true. Its technology today has practically nothing in common with its technology in 1984. This is not merely an artifact of the length of time here: the technology at the core of various UNIXes in 1984 bears a lot of resemblance of UNIX-like operating systems today¹. But looking across its whole history from 1984 to 2024, there is undeniably a continuity to the conceptual “Macintosh”.

Not just as a user, but as a developer moving through time rather than looking at just a few points: the “Macintosh”, such as it is, has transitioned from the Motorola 68000 to the PowerPC to Intel 32-bit to Intel 64-bit to ARM. From obscurely proprietary to enthusiastically embracing open source and then, sadly, much of the way back again. It moved from black and white to color, from desktop to laptop, from Carbon to Cocoa, from Display PostScript to Display PDF, all the while preserving instantly recognizable iconic features like the apple menu and the cursor pointer, while providing developers documentation and SDKs and training sessions that helped them transition their apps through multiple near-complete rewrites as a result of all of these changes.

To paraphrase Abigail Thorne’s first video about Identity, identity is what survives. The Macintosh is an interesting case study in the survival of the idea of a platform, as distinct from the platform itself. It is the Computer of Theseus, a thought experiment successfully brought to life and sustained over time.

If there is a personal lesson to be learned here, I’d say it’s that one’s own efforts need not be perfect. In fact, a significantly flawed vision that you can achieve right now is often much, much better than a perfect version that might take just a little bit longer, if you don’t have the resources to actually sustain going that much longer². You have to be bad at things before you can be good at them. Real artists, as Jobs famously put it, ship.

So my contribution to the 40th anniversary reflections is to say: the Macintosh is dead. Long live the Mac.

Acknowledgments

including, ironically, the modern macOS. ↩
And that is why I am posting this right now, rather than proofreading it further. ↩

January 25, 2024 06:31 AM UTC

Unsigned Commits

I am going to tell you why I don’t think you should sign your Git commits, even though doing so with SSH keys is now easier than ever. But first, to contextualize my objection, I have a brief hypothetical for you, and then a bit of history from the evolution of security on the web.

paper reading “Sign Here:” with a pen poised over it

It seems like these days, everybody’s signing all different kinds of papers.

Bank forms, permission slips, power of attorney; it seems like if you want to securely validate a document, you’ve gotta sign it.

So I have invented a machine that automatically signs every document on your desk, just in case it needs your signature. Signing is good for security, so you should probably get one, and turn it on, just in case something needs your signature on it.

We also want to make sure that verifying your signature is easy, so we will have them all notarized and duplicates stored permanently and publicly for future reference.

No? Not interested?

Hopefully, that sounded like a silly idea to you.

Most adults in modern civilization have learned that signing your name to a document has an effect. It is not merely decorative; the words in the document being signed have some specific meaning and can be enforced against you.

In some ways the metaphor of “signing” in cryptography is bad. One does not “sign” things with “keys” in real life. But here, it is spot on: a cryptographic signature can have an effect.

It should be an input to some software, one that is acted upon. Software does a thing differently depending on the presence or absence of a signature. If it doesn’t, the signature probably shouldn’t be there.

Consider the most venerable example of encryption and signing that we all deal with every day: HTTPS. Many years ago, browsers would happily display unencrypted web pages. The browser would also encrypt the connection, if the server operator had paid for an expensive certificate and correctly configured their server. If that operator messed up the encryption, it would pop up a helpful dialog box that would tell the user “This website did something wrong that you cannot possibly understand. Would you like to ignore this and keep working?” with buttons that said “Yes” and “No”.

Of course, these are not the precise words that were written. The words, as written, said things about “information you exchange” and “security certificate” and “certifying authorities” but “Yes” and “No” were the words that most users read. Predictably, most users just clicked “Yes”.

In the usual case, where users ignored these warnings, it meant that no user ever got meaningful security from HTTPS. It was a component of the web stack that did nothing but funnel money into the pockets of certificate authorities and occasionally present annoying interruptions to users.

In the case where the user carefully read and honored these warnings in the spirit they were intended, adding any sort of transport security to your website was a potential liability. If you got everything perfectly correct, nothing happened except the browser would display a picture of a small green purse. If you made any small mistake, it would scare users off and thereby directly harm your business. You would only want to do it if you were doing something that put a big enough target on your site that you became unusually interesting to attackers, or were required to do so by some contractual obligation like credit card companies.

Keep in mind that the second case here is the best case.

In 2016, the browser makers noticed this problem and started taking some pretty aggressive steps towards actually enforcing the security that HTTPS was supposed to provide, by fixing the user interface to do the right thing. If your site didn’t have security, it would be shown as “Not Secure”, a subtle warning that would gradually escalate in intensity as time went on, correctly incentivizing site operators to adopt transport security certificates. On the user interface side, certificate errors would be significantly harder to disregard, making it so that users who didn’t understand what they were seeing would actually be stopped from doing the dangerous thing.

Nothing fundamental¹ changed about the technical aspects of the cryptographic primitives or constructions being used by HTTPS in this time period, but socially, the meaning of an HTTP server signing and encrypting its requests changed a lot.

Now, let’s consider signing Git commits.

You may have heard that in some abstract sense you “should” be signing your commits. GitHub puts a little green “verified” badge next to commits that are signed, which is neat, I guess. They provide “security”. 1Password provides a nice UI for setting it up. If you’re not a 1Password user, GitHub itself recommends you put in just a few lines of configuration to do it with either a GPG, SSH, or even an S/MIME key.

But while GitHub’s documentation quite lucidly tells you how to sign your commits, its explanation of why is somewhat less clear. Their purse is the word “Verified”; it’s still green. If you enable “vigilant mode”, you can make the blank “no verification status” option say “Unverified”, but not much else changes.

This is like the old-style HTTPS verification “Yes”/“No” dialog, except that there is not even an interruption to your workflow. They might put the “Unverified” status on there, but they’ve gone ahead and clicked “Yes” for you.

It is tempting to think that the “HTTPS” metaphor will map neatly onto Git commit signatures. It was bad when the web wasn’t using HTTPS, and the next step in that process was for Let’s Encrypt to come along and for the browsers to fix their implementations. Getting your certificates properly set up in the meanwhile and becoming familiar with the tools for properly doing HTTPS was unambiguously a good thing for an engineer to do. I did, and I’m quite glad I did so!

However, there is a significant difference: signing and encrypting an HTTPS request is ephemeral; signing a Git commit is functionally permanent.

This ephemeral nature meant that errors in the early HTTPS landscape were easily fixable. Earlier I mentioned that there was a time where you might not want to set up HTTPS on your production web servers, because any small screw-up would break your site and thereby your business. But if you were really skilled and you could see the future coming, you could set up monitoring, avoid these mistakes, and rapidly recover. These mistakes didn’t need to badly break your site.

We can extend the analogy to HTTPS, but we have to take a detour into one of the more unpleasant mistakes in HTTPS’s history: HTTP Public Key Pinning, or “HPKP”. The idea with HPKP was that you could publish a record in an HTTP header where your site commits² to using certain certificate authorities for a period of time, where that period of time could be “forever”. Attackers gonna attack, and attack they did. Even without getting attacked, a site could easily commit “HPKP Suicide” where they would pin the wrong certificate authority with a long timeline, and their site was effectively gone for every browser that had ever seen those pins. As a result, after a few years, HPKP was completely removed from all browsers.

Git commit signing is even worse. With HPKP, you could easily make terrible mistakes with permanent consequences even though you knew the exact meaning of the data you were putting into the system at the time you were doing it. With signed commits, you are saying something permanently, but you don’t really know what it is that you’re saying.

Today, what is the benefit of signing a Git commit? GitHub might present it as “Verified”. It’s worth noting that only GitHub will do this, since they are the root of trust for this signing scheme. So, by signing commits and registering your keys with GitHub, you are, at best, helping to lock in GitHub as a permanent piece of infrastructure that is even harder to dislodge because they are not only where your code is stored, but also the arbiters of whether or not it is trustworthy.

In the future, what is the possible security benefit? If we all collectively decide we want Git to be more secure, then we will need to meaningfully treat signed commits differently from unsigned ones.

There’s a long tail of unsigned commits several billion entries long. And those are in the permanent record as much as the signed ones are, so future tooling will have to be able to deal with them. If, as stewards of Git, we wish to move towards a more secure Git, as the stewards of the web moved towards a more secure web, we do not have the option that the web did. In the browser, the meaning of a plain-text HTTP or incorrectly-signed HTTPS site changed, in order to encourage the site’s operator to change the site to be HTTPS.

In contrast, the meaning of an unsigned commit cannot change, because there are zillions of unsigned commits lying around in critical infrastructure and we need them to remain there. Commits cannot meaningfully be changed to become signed retroactively. Unlike an online website, they are part of a historical record, not an operating program. So we cannot establish the difference in treatment by changing how unsigned commits are treated.

That means that tooling maintainers will need to provide some difference in behavior that provides some incentive. With HTTPS where the binary choice was clear: don’t present sites with incorrect, potentially compromised configurations to users. The question was just how to achieve that. With Git commits, the difference in treatment of a “trusted” commit is far less clear.

If you will forgive me a slight straw-man here, one possible naive interpretation is that a “trusted” signed commit is that it’s OK to run in CI. Conveniently, it’s not simply “trusted” in a general sense. If you signed it, it’s trusted to be from you, specifically. Surely it’s fine if we bill the CI costs for validating the PR that includes that signed commit to your GitHub account?

Now, someone can piggy-back off a 1-line typo fix that you made on top of an unsigned commit to some large repo, making you implicitly responsible for transitively signing all unsigned parent commits, even though you haven’t looked at any of the code.

Remember, also, that the only central authority that is practically trustable at this point is your GitHub account. That means that if you are using a third-party CI system, even if you’re using a third-party Git host, you can only run “trusted” code if GitHub is online and responding to requests for its “get me the trusted signing keys for this user” API. This also adds a lot of value to a GitHub credential breach, strongly motivating attackers to sneakily attach their own keys to your account so that their commits in unrelated repos can be “Verified” by you.

Let’s review the pros and cons of turning on commit signing now, before you know what it is going to be used for:

Pro	Con
Green “Verified” badge	Unknown, possibly unlimited future liability for the consequences of running code in a commit you signed
	Further implicitly cementing GitHub as a centralized trust authority in the open source world
	Introducing unknown reliability problems into infrastructure that relies on commit signatures
	Temporary breach of your GitHub credentials now lead to potentially permanent consequences if someone can smuggle a new trusted key in there
	New kinds of ongoing process overhead as commit-signing keys become new permanent load-bearing infrastructure, like “what do I do with expired keys”, “how often should I rotate these”, and so on

I feel like the “Con” column is coming out ahead.

That probably seemed like increasingly unhinged hyperbole, and it was.

In reality, the consequences are unlikely to be nearly so dramatic. The status quo has a very high amount of inertia, and probably the “Verified” badge will remain the only visible difference, except for a few repo-specific esoteric workflows, like pushing trust verification into offline or sandboxed build systems. I do still think that there is some potential for nefariousness around the “unknown and unlimited” dimension of any future plans that might rely on verifying signed commits, but any flaws are likely to be subtle attack chains and not anything flashy and obvious.

But I think that one of the biggest problems in information security is a lack of threat modeling. We encrypt things, we sign things, we institute rotation policies and elaborate useless rules for passwords, because we are looking for a “best practice” that is going to save us from having to think about what our actual security problems are.

I think the actual harm of signing git commits is to perpetuate an engineering culture of unquestioningly cargo-culting sophisticated and complex tools like cryptographic signatures into new contexts where they have no use.

Just from a baseline utilitarian philosophical perspective, for a given action A, all else being equal, it’s always better not to do A, because taking an action always has some non-zero opportunity cost even if it is just the time taken to do it. Epsilon cost and zero benefit is still a net harm. This is even more true in the context of a complex system. Any action taken in response to a rule in a system is going to interact with all the other rules in that system. You have to pay complexity-rent on every new rule. So an apparently-useless embellishment like signing commits can have potentially far-reaching consequences in the future.

Git commit signing itself is not particularly consequential. I have probably spent more time writing this blog post than the sum total of all the time wasted by all programmers configuring their git clients to add useless signatures; even the relatively modest readership of this blog will likely transfer more data reading this post than all those signatures will take to transmit to the various git clients that will read them. If I just convince you not to sign your commits, I don’t think I’m coming out ahead in the felicific calculus here.

What I am actually trying to point out here is that it is useful to carefully consider how to avoid adding junk complexity to your systems. One area where junk tends to leak in to designs and to cultures particularly easily is in intimidating subjects like trust and safety, where it is easy to get anxious and convince ourselves that piling on more stuff is safer than leaving things simple.

If I can help you avoid adding even a little bit of unnecessary complexity, I think it will have been well worth the cost of the writing, and the reading.

Acknowledgments

Thank you to my patrons who are supporting my writing on this blog. If you like what you’ve read here and you’d like to read more of it, or you’d like to support my various open-source endeavors, you can support me on Patreon as well! I am also available for consulting work if you think your organization could benefit from expertise on topics such as “What else should I not apply a cryptographic signature to?”.

Yes yes I know about heartbleed and Bleichenbacher attacks and adoption of forward-secret ciphers and CRIME and BREACH and none of that is relevant here, okay? Jeez. ↩
Do you see what I did there. ↩

January 25, 2024 12:29 AM UTC

Bruno Ponne / Coding The Past

Explore art with SQL and pd.read_sql_query

Greetings, humanists, social and data scientists!

Have you ever tried to load a large file in Python or R? Sometimes, when we have file sizes in the order of gigabytes, you may experience problems of performance with your program taking an unusually long time to load the data. SQL, or Structured Query Language, is used to deal with larger data files stored in relational databases and is widely used in the industry and even in research. Apart from being more efficient to prepare data, in your journey, you might encounter data sources whose main form of access is through SQL.

In this lesson you will learn how to use SQL in Python to retrieve data from a relational data base of the National Gallery of Art (US). You will also learn how to use a relational database management system (RDBMS) and pd.read_sql_query to extract data from it in Python.

1. Data source

The database used in this lesson is made available by National Gallery of Art (US) under a Creative Commons Zero license. The dataset contains data about more than 130,000 artworks and their artists since the Middle Ages until the present day.

It is a wonderful resource to study history and art. Variables available include the title of the artwork, dimensions, author, description, location, country where it was produced, the year the artist started the work and the year he or she finished it. These variables are only some examples, but there is much more to explore.

2. Download and install PostgreSQL and pgAdmin

PostgreSQL is a free and very popular relational database management system. It stores and manages the tables contained in a database. Please, consult this guide to install it in your computer.

After you install PostgreSQL, you will need to connect to the Postgre database server. In this tutorial, we will be using the pgAdmin application to establish this connection. It is a visual and intuitive interface and makes many operations easier to execute. The guide above will also guide you through the process of connecting to your local database. In the next steps, after being connected to your local database server, we will learn how to create a database that will store the National Gallery Dataset.

3. Creating the database and its tables

After you are connected to the server, click “Databases” with the right mouse button and choose “Create” and “Database…” as shown in the image below.

How to create a database with pgAdmin

Next, give a title to your database as shown in the figure below. In our case, it will be called “art_db”. Click “Save” and it is all set!

Naming your database in pgAdmin

With the database ‘art_bd’ selected, click the ‘Query Tool’ as shown below.

Where to find the query tool in pgAdmin

This will open a field where you can type SQL code. Our objective is to create the first table of our database, which will contain the content of ‘objects.csv’ available in the GitHub account of the National Gallery of Art, provided in the Data section above.

To create a table, we must specify the name and the variable type for each variable in the table. The SQL command to create a table is quite intuitive: CREATE TABLE name_of_your_table. Copy the code below and paste it in the window opened by the ‘Query Tool’. The code specify each variable of the objects table. This table contains information on each artwork available in the collection.

content_copy Copy

CREATE TABLE objects (
    objectID                    integer NOT NULL,
    accessioned                   CHARACTER VARYING(32),
    accessionnum                  CHARACTER VARYING(32),
    locationid                    CHARACTER VARYING(32),
    title                         CHARACTER VARYING(2048),
    displaydate                   CHARACTER VARYING(256),
    beginyear                     integer,
    endyear                       integer,
    visualbrowsertimespan         CHARACTER VARYING(32),
    medium                        CHARACTER VARYING(2048),
    dimensions                    CHARACTER VARYING(2048),
    inscription                   CHARACTER VARYING,
    markings                      CHARACTER VARYING,
    attributioninverted           CHARACTER VARYING(1024),
    attribution                   CHARACTER VARYING(1024),
    provenancetext                CHARACTER VARYING,
    creditline                    CHARACTER VARYING(2048),
    classification                CHARACTER VARYING(64),
    subclassification             CHARACTER VARYING(64),
    visualbrowserclassification   CHARACTER VARYING(32),
    parentid                      CHARACTER VARYING(32),
    isvirtual                     CHARACTER VARYING(32),
    departmentabbr                CHARACTER VARYING(32),
    portfolio                     CHARACTER VARYING(2048),
    series                        CHARACTER VARYING(850),
    volume                        CHARACTER VARYING(850),
    watermarks                    CHARACTER VARYING(512),
    lastdetectedmodification      CHARACTER VARYING(64),
    wikidataid                    CHARACTER VARYING(64),
    customprinturl                CHARACTER VARYING(512)
);

The last step is to load the data from the csv file into this table. This can be done through the ‘COPY’ command as shown below.

content_copy Copy

COPY objects (objectid, accessioned, accessionnum, locationid, title, displaydate, beginyear, endyear, visualbrowsertimespan, medium, dimensions, inscription, markings, attributioninverted, attribution, provenancetext, creditline, classification, subclassification, visualbrowserclassification, parentid, isvirtual, departmentabbr, portfolio, series, volume, watermarks, lastdetectedmodification, wikidataid, customprinturl) 
FROM 'C:/temp/objects.csv' 
DELIMITER ',' 
CSV HEADER;

tips_and_updates

Download the "objects.csv" file and save it in the desired folder. Note however, that sometimes your system might block access to this file via pgAdmin. Therefore I saved it in the "temp" folder. In any case, change the path in the code above to match where you saved the "objects.csv" file.

Great! Now you should have your first table loaded to your database. The complete database includes more than 15 tables. However, we will only use two of them for this example, as shown in the scheme below. Note that the two tables relate to each other through the key variable objectid.

Database scheme and relations

To load the “objects_terms” table, please repeat the same procedure with the code below.

content_copy Copy

CREATE TABLE objects_terms (
    termid             INTEGER,
    objectid           INTEGER,
    termtype           VARCHAR(64),
    term               VARCHAR(256),
    visualbrowsertheme VARCHAR(32),
    visualbrowserstyle VARCHAR(64)
);


COPY objects_terms (termid, objectid, termtype, term, visualbrowsertheme, visualbrowserstyle)
FROM 'C:/temp/objects_terms.csv' 
DELIMITER ',' 
CSV HEADER;

4. Exploring the data with SQL commands

Click the ‘Query Tool’ to start exploring the data. First, select which variables you would like to include in your analysis. Second, you tell SQL in which table this variables are. The code below selects the variables title and attribution from the objects table. It also limits the result to 5 observations.

content_copy Copy

SELECT title, attribution
FROM objects
LIMIT 5

Now, we would like to know what are the different kinds of classification in this dataset. To achieve that, we have to select the classification variable, but including only distinct values.

content_copy Copy

SELECT DISTINCT(classification)
FROM objects

The result tells us that there are 11 classifications: “Decorative Art”, “Drawing”, “Index of American Design”, “Painting”, “Photograph”, “Portfolio”, “Print”, “Sculpture”, “Technical Material”, “Time-Based Media Art” and “Volume”.

Finally, let us group the artworks by classification and count the number of objects in each category. COUNT(*) will count the total of items in the groups defined by GROUP BY. When you select a variable you can give it a new name with AS. Finally, the command ORDER BY orders the classification by number of items in a descending order (DESC).

content_copy Copy

SELECT classification, COUNT(*) as n_items
FROM objects
GROUP BY classification
ORDER BY n_items DESC

Note that prints is the largest classification, followed by photographs.

5. Using pd.read_sql_query to access data

Now that you have your SQL database working, it is time to access it with Python. Before using Pandas, we have to connect Python to our SQL database. We will do that with psycopg2, a very popular PostgreSQL adapter for Python. Please, install it with pip install psycopg2.

We use the connect method of psycopg2 to establish the connection. It takes 4 main arguments:

host: in our case, the database is hosted locally, so we will pass localhost to this parameter. Note, however, that we could specify an IP if the server was external;
database: the name given to your SQL database, art_db;
user: user name required to authenticate;
password: your database password.

content_copy Copy

import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="localhost",
    database="art_db",
    user="postgres",
    password="*******")

The next step is to store our SQL query in a string Python variable. The query below performs a LEFT JOIN with the two tables in our database. The operation uses the variable objectid to join the two tables. In practice we are selecting the titles, authors (attribution), classification - we keep only “Painting” with a WHERE command -, and term - we filter only terms that specify the “Style” of the painting.

content_copy Copy

command = ''' SELECT o.title, o.attribution, o.classification, ot.term
                FROM objects as o
                LEFT JOIN objects_terms as ot ON o.objectid = ot.objectid
                WHERE classification = 'Painting' AND termtype = 'Style' '''

Finally, we can extract the data. Use the cursor() method of conn to be able to “type” your SQL query. Pass the command variable and connection object to pd.read_sql_query and it will return a Pandas dataframe with the data we selected. Next, commit and close cursor and connections.

content_copy Copy

# open cursor to insert our query
cur = conn.cursor()

# use pd.read_sql_query to query our database and get the result in a pandas dataframe
paintings = pd.read_sql_query(command, conn)

# save any changes to the database
conn.commit()

# close cursor and connection
cur.close()
conn.close()

6. Visualizing the most popular styles

From the data we gathered from our database, we would like to check which are the 10 most popular art styles in our data, by number of paintings. We can use the value_counts() method of the column term to count how many paintings are classified in each style.

The result is a Pandas Series where the index contains the styles and the values contain the quantities of paintings of the respective style. The remaining code produces an horizontal bar plot showing the top 10 styles by number of paintings. If you would like to learn more about data visualization with matplotlib, please consult the lesson Storytelling with Matplotlib - Visualizing historical data.

content_copy Copy

import matplotlib.pyplot as plt

top_10_styles = paintings['term'].value_counts().head(10)

fig, ax = plt.subplots()

ax.barh(top_10_styles.index, top_10_styles.values, 
        color = "#f0027f", 
        edgecolor = "#f0027f")

ax.set_title("The Most Popular Styles")

# inverts y axis
ax.invert_yaxis()

# eliminates grids
ax.grid(False)

# set ticks' colors to white    
ax.tick_params(axis='x', colors='white')    
ax.tick_params(axis='y', colors='white')

# set font colors
ax.set_facecolor('#2E3031')
ax.title.set_color('white')   

# eliminates top, left and right borders and sets the bottom border color to white
ax.spines["top"].set_visible(False)         
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_color("white")

# fig background color:
fig.patch.set_facecolor('#2E3031')

Note that Realist, Baroque and Renaissance are the most popular art styles in our dataset.

The Top 10 Art Styles

Please feel free to share your thoughts and questions below!

6. Conclusions

It is possible to create a SQL database from csv files and access it with Python;
psycopg2 enables connection between Python and your SQL database;
pd.read_sql_query can be used to extract data into a Pandas dataframe.

January 25, 2024 12:00 AM UTC

January 24, 2024

TechBeamers Python

How Do I Install Pip in Python?

In this tutorial, we’ll provide all the necessary steps for you to install Pip in Python on both Windows and Linux platforms. If you’re using a recent version of Python (Python 3.4 and above), pip is likely already installed. To check if pip is installed, open a command prompt or terminal and run: If it’s […]

The post How Do I Install Pip in Python? appeared first on TechBeamers.

January 24, 2024 05:40 PM UTC

How Do You Filter a List in Python?

In this tutorial, we’ll explain different methods to filter a list in Python with the help of multiple examples. You’ll learn to use the Python filter() function, list comprehension, and also use Python for loop to select elements from the list. Filter a List in Python With the Help of Examples As we know there […]

The post How Do You Filter a List in Python? appeared first on TechBeamers.

January 24, 2024 02:33 PM UTC

Real Python

What Are Python Raw Strings?

If you’ve ever come across a standard string literal prefixed with either the lowercase letter r or the uppercase letter R, then you’ve encountered a Python raw string:

Python
      
>>> r"This is a raw string"
'This is a raw string'
Copied!

Although a raw string looks and behaves mostly the same as a normal string literal, there’s an important difference in how Python interprets some of its characters, which you’ll explore in this tutorial.

Notice that there’s nothing special about the resulting string object. Whether you declare your literal value using a prefix or not, you’ll always end up with a regular Python str object.

Other prefixes available at your fingertips, which you can use and sometimes even mix together in your Python string literals, include:

b: Bytes literal
f: Formatted string literal
u: Legacy Unicode string literal (PEP 414)

Out of those, you might be most familiar with f-strings, which let you evaluate expressions inside string literals. Raw strings aren’t as popular as f-strings, but they do have their own uses that can improve your code’s readability.

Creating a string of characters is often one of the first skills that you learn when studying a new programming language. The Python Basics book and learning path cover this topic right at the beginning. With Python, you can define string literals in your source code by delimiting the text with either single quotes (') or double quotes ("):

Python
      
>>> david = 'She said "I love you" to me.'
>>> alice = "Oh, that's wonderful to hear!"
Copied!

Having such a choice can help you avoid a syntax error when your text includes one of those delimiting characters (' or "). For example, if you need to represent an apostrophe in a string, then you can enclose your text in double quotes. Alternatively, you can use multiline strings to mix both types of delimiters in the text.

You may use triple quotes (''' or """) to declare a multiline string literal that can accommodate a longer piece of text, such as an excerpt from the Zen of Python:

Python
      
        
      
    
>>> poem = """
... Beautiful is better than ugly.
... Explicit is better than implicit.
... Simple is better than complex.
... Complex is better than complicated.
... """
Copied!

Multiline string literals can optionally act as docstrings, a useful form of code documentation in Python. Docstrings can include bare-bones test cases known as doctests, as well.

Regardless of the delimiter type of your choice, you can always prepend a prefix to your string literal. Just make sure there’s no space between the prefix letters and the opening quote.

When you use the letter r as the prefix, you’ll turn the corresponding string literal into a raw string counterpart. So, what are Python raw strings exactly?

Free Bonus: Click here to download a cheatsheet that shows you the most useful Python escape character sequences.

Take the Quiz: Test your knowledge with our interactive “Python Raw Strings” quiz. Upon completion you will receive a score so you can track your learning progress over time:

Take the Quiz »

In Short: Python Raw Strings Ignore Escape Character Sequences

In some cases, defining a string through the raw string literal will produce precisely the same result as using the standard string literal in Python:

Python
      
>>> r"I love you" == "I love you"
True
Copied!

Here, both literals represent string objects that share a common value: the text I love you. Even though the first literal comes with a prefix, it has no effect on the outcome, so both strings compare as equal.

To observe the real difference between raw and standard string literals in Python, consider a different example depicting a date formatted as a string:

Python
      
>>> r"10\25\1991" == "10\25\1991"
False
Copied!

This time, the comparison turns out to be false even though the two string literals look visually similar. Unlike before, the resulting string objects no longer contain the same sequence of characters. The raw string’s prefix (r) changes the meaning of special character sequences that begin with a backslash (\) inside the literal.

Note: To understand how Python interprets the above string, head over to the final section of this tutorial, where you’ll cover the most common types of escape sequences in Python.

Read the full article at https://realpython.com/python-raw-strings/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

January 24, 2024 02:00 PM UTC

Ned Batchelder

You (probably) don’t need to learn C

On Mastodon I wrote that I was tired of people saying, “you should learn C so you can understand how a computer really works.” I got a lot of replies which did not change my mind, but helped me understand more how abstractions are inescapable in computers.

People made a number of claims. C was important because syscalls are defined in terms of C semantics (they are not). They said it was good for exploring limited-resource computers like Arduinos, but most people don’t program for those. They said it was important because C is more performant, but Python programs often offload the compute-intensive work to libraries other people have written, and these days that work is often on a GPU. Someone said you need it to debug with strace, then someone said they use strace all the time and don’t know C. Someone even said C was good because it explains why NUL isn’t allowed in filenames, but who tries to do that, and why learn a language just for that trivia?

I’m all for learning C if it will be useful for the job at hand, but you can write lots of great software without knowing C.

A few people repeated the idea that C teaches you how code “really” executes. But C is an abstract model of a computer, and modern CPUs do all kinds of things that C doesn’t show you or explain. Pipelining, cache misses, branch prediction, speculative execution, multiple cores, even virtual memory are all completely invisible to C programs.

C is an abstraction of how a computer works, and chip makers work hard to implement that abstraction, but they do it on top of much more complicated machinery.

C is far removed from modern computer architectures: there have been 50 years of innovation since it was created in the 1970’s. The gap between C’s model and modern hardware is the root cause of famous vulnerabilities like Meltdown and Spectre, as explained in C is Not a Low-level Language.

C can teach you useful things, like how memory is a huge array of bytes, but you can also learn that without writing C programs. People say, C teaches you about memory allocation. Yes it does, but you can learn what that means as a concept without learning a programming language. And besides, what will Python or Ruby developers do with that knowledge other than appreciate that their languages do that work for them and they no longer have to think about it?

Pointers came up a lot in the Mastodon replies. Pointers underpin concepts in higher-level languages, but you can explain those concepts as references instead, and skip pointer arithmetic, aliasing, and null pointers completely.

A question I asked a number of people: what mistakes are JavaScript/Ruby/Python developers making if they don’t know these things (C, syscalls, pointers)?”. I didn’t get strong answers.

We work in an enormous tower of abstractions. I write programs in Python, which provides me abstractions that C (its underlying implementation language) does not. C provides an abstract model of memory and CPU execution which the computer implements on top of other mechanisms (microcode and virtual memory). When I made a wire-wrapped computer, I could pretend the signal travelled through wires instantaneously. For other hardware designers, that abstraction breaks down and they need to consider the speed electricity travels. Sometimes you need to go one level deeper in the abstraction stack to understand what’s going on. Everyone has to find the right layer to work at.

Andy Gocke said it well:

When you no longer have problems at that layer, that’s when you can stop caring about that layer. I don’t think there’s a universal level of knowledge that people need or is sufficient.

“like jam or bootlaces” made another excellent point:

There’s a big difference between “everyone should know this” and “someone should know this” that seems to get glossed over in these kinds of discussions.

C can teach you many useful and interesting things. It will make you a better programmer, just as learning any new-to-you language will because it broadens your perspective. Some kinds of programming need C, though other languages like Rust are ably filling that role now too. C doesn’t teach you how a computer really works. It teaches you a common abstraction of how computers work.

Find a level of abstraction that works for what you need to do. When you have trouble there, look beneath that abstraction. You won’t be seeing how things really work, you’ll be seeing a lower-level abstraction that could be helpful. Sometimes what you need will be an abstraction one level up. Is your Python loop too slow? Perhaps you need a C loop. Or perhaps you need numpy array operations.

You (probably) don’t need to learn C.

January 24, 2024 11:38 AM UTC

IslandT

How to search multiple lines with Python?

Often you will want to search for words or phrase in the entire paragraph and here is the python regular expression code which will do that.

pattern = re.compile(r'^\w+ (\w+) (\w+)', re.M)

We use the re.M flag which will search the entire paragraph for the match words.

Now let us try out the program above…

gad = pattern.findall("hello mr Islandt\nhello mr gadgets")
print(gad)

…which will then display the following outcome

[('mr', 'Islandt'), ('mr', 'gadgets')]

Explanation :

The program above will look for two words in the first line and keeps them under a tuple and when the program meets the new line character it continues the search in the second line and return another tuple, both of the tuple will include inside a list. Using re.M flag the search will go on for multiple lines as long as there are more matches out there!

January 24, 2024 09:34 AM UTC

PyBites

Exploring the Role of Static Methods in Python: A Functional Perspective

Introduction

Python’s versatility in supporting different programming paradigms, including procedural, object-oriented, and functional programming, opens up a rich landscape for software design and development.

Among these paradigms, the use of static methods in Python, particularly in an object-oriented context, has been a topic of debate.

This article delves into the role and implications of static methods in Python, weighing them against a more functional approach that leverages modules and functional programming principles.

The Nature of Static Methods in Python

Definition and Usage:

Static methods in Python are defined within a class using the @staticmethod decorator.

Unlike regular methods, they do not require an instance (self) or class (cls) reference.

They are typically used for utility functions that logically belong to a class but are independent of class instances.

Example in Practice:

Consider this code example from Django:

# django/db/backends/oracle/operations.py
class DatabaseOperations(BaseDatabaseOperations):

  ... other methods and attributes ...

  @staticmethod
  def convert_empty_string(value, expression, connection):
    return "" if value is None else value

  @staticmethod
  def convert_empty_bytes(value, expression, connection):
    return b"" if value is None else value

Here, convert_empty_string and convert_empty_bytes are static due to their utility nature and specific association with the DatabaseOperations class.

The Case for Modules and Functional Programming

Embracing Python’s Module System:

Python’s module system allows for effective namespace management and code organization.

Namespaces are one honking great idea — let’s do more of those!
The Zen of Python, by Tim Peters

Functions, including those that could be static methods, can be organized in modules, making them reusable and easily accessible.

Functional Programming Advantages:

Quick Development: Functional programming emphasizes simplicity and stateless operations, leading to concise and readable code.
Code Resilience: Pure functions (functions that do not alter external state) enhance predictability and testability. Related: 10 Tips to Write Better Functions in Python
Separation of Concerns: Using functions and modules promotes a clean separation of data representation (classes) and behavior (functions).

Combining Object-Oriented and Functional Approaches

Hybrid Strategy:

Abstraction with Classes: Use classes for data representation, encapsulating state and behavior that are closely related. See also our When to Use Classes article.
Functional Constructs: Utilize functional concepts like higher-order functions, immutability, and pure functions for business logic and data manipulation.
Factories and Observers: Implement design patterns like factory and observer for creating objects and managing state changes, respectively (shout-out to Brandon Rhodes’ awesome great design patterns guide!)

Conclusion: Striking the Right Balance

The decision to use static methods, standalone functions, or a functional programming approach in Python depends on several factors:

Relevance: Is the function logically part of a class’s responsibilities?
Reusability: Would the function be more versatile as a standalone module function?
Simplicity: Can the use of regular functions simplify the class structure and align with the Single Responsibility Principle? Related article: Tips for clean code in Python.

Ultimately, the choice lies in finding the right balance that aligns with the application’s architecture, maintainability, and the development team’s expertise.

Python, with its multi-paradigm capabilities , offers the flexibility to adopt a style that best suits the project’s needs.

Fun Fact: Static Methods Were an Accident

Guido added static methods as an accident! He originally meant to add class methods instead.

I think the reason is that a module at best acts as a class where every method is a *static* method, but implicitly so. Ad we all know how limited static methods are. (They’re basically an accident — back in the Python 2.2 days when I was inventing new-style classes and descriptors, I meant to implement class methods but at first I didn’t understand them and accidentally implemented static methods first. Then it was too late to remove them and only provide class methods.)
Guido van Rossum, see the discussion thread here, and thanks Will for pointing me to this.

Call to Action

What’s your approach to using static methods in Python?

Do you favor a more functional style, or do you find static methods indispensable in certain scenarios?

Share your thoughts and experiences in our community …

January 24, 2024 09:21 AM UTC

eGenix.com

eGenix Antispam Bot for Telegram 0.6.0 GA

Introduction

eGenix has long been running a local user group meeting in Düsseldorf called Python Meeting Düsseldorf and we are using a Telegram group for most of our communication.

In the early days, the group worked well and we only had few spammers joining it, which we could well handle manually.

More recently, this has changed dramatically. We are seeing between 2-5 spam signups per day, often at night. Furthermore, the signups accounts are not always easy to spot as spammers, since they often come with profile images, descriptions, etc.

With the bot, we now have a more flexible way of dealing with the problem.

Please see our project page for details and download links.

Features

Low impact mode of operation: the bot tries to keep noise in the group to a minimum
Several challenge mechanisms to choose from, more can be added as needed
Flexible and easy to use configuration
Only needs a few MB of RAM, so can easily be put into a container or run on a Raspberry Pi
Can handle quite a bit of load due to the async implementation
Works with Python 3.9+
MIT open source licensed

News

The 0.6.0 release fixes a few bugs and adds more features:

Upgraded to pyrogram 2.0.106, which fixes a weird error we have been getting recently with the old version 1.4.16 (see pyrogram/pyrogram#1347)
Catch weird error from Telegram when deleting conversations; this seems to sometimes fail, probably due to a glitch on their side
Made the math and char entry challenges a little harder
Added new DictItemChallenge

It has been battle-tested in production for several years already and is proving to be a really useful tool to help with Telegram group administration.

More Information

For more information on the eGenix.com Python products, licensing and download instructions, please write to [email protected].

Enjoy !

Marc-Andre Lemburg, eGenix.com

January 24, 2024 08:00 AM UTC

Wing Tips

AI Assisted Development in Wing Pro

This Wing Tip introduces Wing Pro's AI assisted software development capabilities. Starting with Wing Pro version 10, you can use generative AI to write new code at the current editor insertion point, or you can use the AI tool to refactor, redesign, or extend existing code.

Generative AI is astonishingly capable as a programmer's assistant. As long as you provide it with sufficient context and clear instructions, it can cleanly and correctly execute a wide variety of programming tasks.

AI Code Suggestion

Here is an example where Wing Pro's AI code suggestion capability is used to write a missing method for an existing class. The AI knows what to add because it can see what precedes and follows the insertion point in the editor. It infers from that context what code you would like it to produce:

Shown above: Typing 'def get_full_name' followed by Ctrl-? to initiate AI suggestion mode. The suggested code is accepted by pressing Enter.

AI Refactoring

AI refactoring is even more powerful. You can request changes to existing code according to written instructions. For example, you might ask it to "convert this threaded implementation to run asynchronously instead":

Shown above: Running the highlighted request in the AI tool to convert multithreaded code to run asynchronously instead.

Description-Driven Development

Wing Pro's AI refactoring tool can also be used to write new code at the current insertion point, according to written instructions. For example, you might ask it to "add client and server classes that expose all the public methods of FileManager to a client process using sockets and JSON":

Writing new code with AI refactoring in Wing Pro

Shown above: Using the AI tool to request implementation of client/server classes for remote access to an existing class.

Simpler and perhaps more common requests like "write documentation strings for these methods" and "create unit tests for class Person" of course also work. In general, Wing Pro's AI assistant can do any reasonably sized chunk of work for which you can clearly state instructions.

Used correctly, this capability will have a significant impact on your productivity as a programmer. Instead of typing out code manually, your role changes to one of directing an intelligent assistant capable of completing a wide range of programming tasks very quickly. You will still need to review and accept or reject the AI's work. Generative AI can't replace you, but it allows you to concentrate much more on higher-level design and much less on implementation details.

Getting Started

Wing Pro uses OpenAI as its AI provider, and you will need to create and pay for your own OpenAI account before you can use this feature. You may need to pay up to US $50 up front to be given computational rate limits that are high enough to use AI for your software development. However, individual requests often cost less than a US$ 0.01. More complex tasks may cost up to 30 cents, if you provide a lot of context with them. This is still far less than the paid programmer time the AI is replacing.

To use AI assisted development effectively, and you will need to learn how to create well-designed requests that provide the AI both with the necessary relevant context and clear and specific instructions. Please read all of the AI Assisted Development documentation for details on setup, framing requests, and monitoring costs. It takes a bit of time to get started, but it is well worth the effort incorporate generative AI into your tool chain.

That's it for now! We'll be back soon with more Wing Tips for Wing Python IDE.

As always, please don't hesitate to email support@wingware.com if you run into problems or have any questions.

January 24, 2024 01:00 AM UTC

Seth Michael Larson

Releases on the Python Package Index are never “done”

This critical role would not be possible without funding from the OpenSSF Alpha-Omega project. Massive thank-you to Alpha-Omega for investing in the security of the Python ecosystem!

PEP 740 and open-ended PyPI releases

PEP 740 is a proposal to add support for digital attestations to PyPI artifacts, for example publish provenance attestations, which can be verified and used by tooling.

William Woodruff has been working on PEP 740 which is in draft on GitHub, William addressed my feedback this week. During this work the open-endedness of PyPI releases came up during our discussion, specifically how it is a common gotcha for folks designing tools and policy for multiple software ecosystems difficult.

What does it mean for PyPI releases to be open-ended? It means that you can always upload new files to an existing release on PyPI even if the release has been created for years. This is because a PyPI “release” is only a thin layer aggregating a bunch of files on PyPI that happen to share the same version.

This discussion between us was opened up as a wider discussion on discuss.python.org about this property. Summarizing this discussion:

New Python releases mean new wheels need to be built for non-ABI3 compatible projects. IMO this is the most compelling reason to keep this property.
Draft releases seem semi-related, being able to put artifacts into a "queue" before making them public.
Ordering of which wheel gets evaluated as an installation candidate isn't defined well. Up to installers, tends to be more specific -> less specific.
PyPI doesn't allow single files to be yanked even though PEP 592 allows for yanking at the file level instead of only the release level.
The "attack" vector is fairly small, this property would mostly only provide additional secrecy for attackers by blending into existing releases.

CPython Software Bill-of-Materials update

CPython 3.13.0a3 was released, this is the very first CPython release that contains any SBOM metadata at all, and thus we can create an initial draft SBOM document.

Much of the work on CPython's SBOMs was done to fix issues related to pip's vendored dependencies and issues found by downstream distributors of CPython builds like Red Hat. The issues were as follows:

Don't require internet access to run the SBOM script. We use internet access to automatically generate metadata for pip, but if the internet isn't available we should continue using the metadata that we already have (assuming the file hasn't changed) and then rely on CI which should always have internet access (the script fails in CI) to verify the values.
If pip wheel is removed, don't raise an unskippable error. Redistributors will typically remove the wheel in favor of their own distribution of pip for ensurepip.
Enumerate pip's vendored dependencies in the SBOM. This requires parsing the vendor.txt script inside of pip's vendor directory.

All of these issues are mostly related and touch the same place in the codebase, so resulted in a medium-sized pull request to fix all the issues together.

On the release side, I've addressed feedback from the first round of reviews for generating SBOMs for source code artifacts and uploading them during the release. Once those SBOMs start being generated they'll automatically begin being added to python.org/downloads.

Other items

Two new Developer-in-Residence roles have been filled at the Python Software Foundation. Welcome, Petr Viktorin as the Deputy Developer-in-Residence and Serhiy Storchaka as the Supporting Developer-in-Residence. We've already gotten a chance to collaborate and I look forward to even more.
scikit-learn is considering build reproducibility.
Wrote my piece for the Python Software Foundation Annual Impact report.
Submitted to the OpenSSF SOSS Community Day Call for Proposals (see you in Washington!)
Reviewed a fix by Erlend Aasland for the SBOM generation script.
I published a blog post which provides guidance on how to remove a maintainer from an open source project to reduce the attack surface of an open source project.

That's all for this week! 👋 If you're interested in more you can read last week's report.

Thanks for reading! ♡ Did you find this article helpful and want more content like it? Get notified of new posts by subscribing to the RSS feed or the email newsletter.

This work is licensed under CC BY-SA 4.0

January 24, 2024 12:00 AM UTC

January 23, 2024

Kay Hayen

Nuitka Package Configuration Part 3

This is the third part of a post series under the tag package_config that explains the Nuitka package configuration in more detail. To recap, Nuitka package configuration is the way Nuitka learns about hidden dependencies, needed DLLs, data files, and just generally avoids bloat in the compilation. The details are here on a dedicate page on the web site in Nuitka Package Configuration but reading on will be just fine.

Problem Package

Each post will feature one package that caused a particular problem. In this case, we are talking about the package toga.

Problems like with this package are typically encountered in standalone mode only, but they also affect accelerated mode, since it doesn’t compile all the things desired in that case. Some packages, and in this instance look at what OS they are running on, environment variables, etc. and then in a relatively static fashion, but one that Nuitka cannot see through, loads a what it calls “backend” module.

We are going to look at that in some detail, and will see a workaround applied with the anti-bloat engine doing code modification on the fly that make the choice determined at compile time, and visible to Nuitka is this way.

Initial Symptom

The initial symptom reported was that toga did suffer from broken version lookups and therefor did not work, and we encountered even two things, that prevented it, one was about the version number. It was trying to do int after resolving the version of toga by itself to None.

Traceback (most recent call last):
  File "C:\py\dist\toga1.py", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\toga\__init__.py", line 1, in <module toga>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\toga\app.py", line 20, in <module toga.app>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\toga\widgets\base.py", line 7, in <module toga.widgets.base>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\travertino\__init__.py", line 4, in <module travertino>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\setuptools_scm\__init__.py", line 7, in <module setuptools_scm>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\setuptools_scm\_config.py", line 15, in <module setuptools_scm._config>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\setuptools_scm\_integration\pyproject_reading.py", line 8, in <module setuptools_scm._integration.pyproject_reading>
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "C:\py\dist\setuptools_scm\_integration\setuptools.py", line 62, in <module setuptools_scm._integration.setuptools>
  File "C:\py\dist\setuptools_scm\_integration\setuptools.py", line 29, in _warn_on_old_setuptools
ValueError: invalid literal for int() with base 10: 'unknown'

So, this is clearly something that we consider bloat in the first place, to runtime lookup your own version number. The use of setuptools_scm is implying the use of setuptools, for which the version cannot be determined, and that’s crashing.

Step 1 - Analysis of initial crashing

So first thing, we did was to repair setuptools, to know its version. It is doing it a bit different, because it cannot use itself. Our compile time optimization failed there, but also would be overkill. We never came across this, since we avoid setuptools very hard normally, but it’s not good to be incompatible.

- module-name: 'setuptools.version'
  anti-bloat:
    - description: 'workaround for metadata version of setuptools'
      replacements:
        "pkg_resources.get_distribution('setuptools').version": "repr(__import__('setuptools.version').version.__version__)"

We do not have to include all metadata for setuptools here, just to get that one item, so we chose to make a simple string replacement here, that just looks the value up at compile time and puts it into the source code automatically. That removes the pkg_resources.get_distribution() call entirely.

With that, setuptools_scm was not crashing anymore. That’s good. But we don’t really want it to be included, since it’s good for dynamically detecting the version from git, and what not, but including the framework for building C extensions, not a good idea in the general case. Nuitka therefore said this:

Nuitka-Plugins:WARNING: anti-bloat: Undesirable import of 'setuptools_scm' (intending to
Nuitka-Plugins:WARNING: avoid 'setuptools') in 'toga' (at
Nuitka-Plugins:WARNING: 'c:\3\Lib\site-packages\toga\__init__.py:99') encountered. It may
Nuitka-Plugins:WARNING: slow down compilation.
Nuitka-Plugins:WARNING:     Complex topic! More information can be found at
Nuitka-Plugins:WARNING: https://nuitka.net/info/unwanted-module.html

So that’s informing the user to take action. And in the case of optional imports, i.e. ones where using code will handle the ImportError just fine and work without it, we can use do this.

- module-name: 'toga'
  anti-bloat:
    - description: 'remove setuptools usage'
      no-auto-follow:
        'setuptools_scm': ''
      when: 'not use_setuptools'

He we say, no not automatically follow setuptools_scm reports, unless there is other code that still does it. In that way, the import still happens if some other part of the code imports the module, but only then. We no longer enforce the non-usage of a module here, we just make that decision based on other uses being present.

With this the bloat warning, and the inclusion of setuptools_scm into the compilation is removed, and you always want to make as small as possible and remove those packages that do not contribute anything but overhead, aka bloat.

The next thing discovered was that toga needs the toga-core distribution to version check. For that, we use the common solution, and tell that we want to include the metadata of it, for when toga is part of a compilation.

- module-name: 'toga'
  data-files:
    include-metadata:
      - 'toga-core'

So that moved the entire issue of version looks to resolved.

Step 2 - Dynamic Backend dependency

Now on to the backend issue. What remained was a need for including the platform specific backend. One that can even be overridden by an environment variable. For full compatibility, we invented something new. Typically what we would have done is to create a toga plugin for the following snippet.

- module-name: 'toga.platform'
  variables:
    setup_code: 'import toga.platform'
    declarations:
      'toga_backend_module_name': 'toga.platform.get_platform_factory().__name__'
  anti-bloat:
    - change_function:
        'get_platform_factory': "'importlib.import_module(%r)' % get_variable('toga_backend_module_name')"

There is a whole new thing here, a new feature that was added specifically for this to be easy to do. And with the backend selection being complex and partially dynamic code, we didn’t want to hard code that. So we added support for variables and their use in Nuitka Package Configuration.

The first block variables defines a mapping of expressions in declarations that will be evaluated at compile time given the setup code under setup_code.

This then allows us to have a variable with the name of the backend that toga decides to use. We then change the very complex function get_platform_factory that we used used, for compilation, to be replacement that Nuitka will be able to statically optimize and see the backend as a dependency and use it directly at run time, which is what we want.

Final remarks

I am hoping you will find this very helpful information and will join the effort to make packaging for Python work out of the box. Adding support for toga was a bit more complex, but with the new tool, once identified to be that kind of backend issue, it might have become a lot more easy.

Lessons learned. We should cover packages that we routinely remove from compilation, like setuptools, but e.g. also IPython. This will have to added, such that setuptools_scm cannot cloud the vision to actual issues.

January 23, 2024 11:00 PM UTC

Quansight Labs Blog

Captioning: A Newcomer’s Guide

What are those words on the bottom of your video screen and where do they come from? Captioning’s normalization in the past several decades may seem like it would render those questions moot, but understanding more about captions means making more informed decisions about when, how, and why we make sure information is accessible.

January 23, 2024 09:41 PM UTC

PyCoder’s Weekly

Issue #613 (Jan. 23, 2024)

#613 – JANUARY 23, 2024
View in Browser »

Python Packaging, One Year Later: A Look Back at 2023

This is a follow-on post to Chris’s article from last year called Fourteen tools at least twelve too many. “Are there still fourteen tools, or are there even more? Has Python packaging improved in a year?”
CHRIS WARRICK

Running Python on Air-Gapped Systems

This post describes running Python code on a “soft” air-gapped system, one without direct internet access. Installing packages in a clean environment and moving them to the air-gapped machine has challenges. Read Ibrahim’s take on how he solved the problem.
IBRAHIM AHMED

Elevate Your Web Development with MongoDB’s Full Stack FastAPI App Generator

Get ready to elevate your web development process with the newly released Full Stack FastAPI App Generator by MongoDB, offering a simplified setup process for building modern full-stack web applications with FastAPI and MongoDB →
MONGODB sponsor

Add Logging and Notification Messages to Flask Web Projects

After you implement the main functionality of a web project, it’s good to understand how your users interact with your app and where they may run into errors. In this tutorial, you’ll enhance your Flask project by creating error pages and logging messages.
REAL PYTHON

Discussions

PEP 736: Shorthand Syntax for Keyword Arguments

PYTHON.ORG

Python Jobs

Python Tutorial Editor (Anywhere)

Real Python

More Python Jobs >>>

Articles & Tutorials

Bias, Toxicity, and Truthfulness in LLMs With Python

How can you measure the quality of a large language model? What tools can measure bias, toxicity, and truthfulness levels in a model using Python? This week on the show, Jodie Burchell, developer advocate for data science at JetBrains, returns to discuss techniques and tools for evaluating LLMs With Python.
REAL PYTHON podcast

Postgres vs. DynamoDB: Which Database to Choose

This article presents various aspects you need to consider when choosing a database for your project - querying, performance, ORMs, migrations, etc. It shows how things are approached differently for Postgres vs. DynamoDB and includes examples in Python.
JAN GIACOMELLI • Shared by Jan Giacomelli

Building with Temporal Cloud Webinar Series

Hear from our technical team on how we’ve built Temporal Cloud to deliver world-class latency, performance, and availability for the smallest and largest workloads. Whether you’re using Temporal Cloud or self-host, this series will be full of insights into how to optimize your Temporal Service →
TEMPORAL sponsor

Python App Development: In-Depth Guide for Product Owners

“As with every technology stack, Python has its advantages and limitations. The key to success is to use Python at the right time and in the right place.” This guide talks about what a product owner needs to know to take on a Python project.
PAVLO PYLYPENKO • Shared by Alina

HTTP Requests With Python’s `urllib.request`

In this video course, you’ll explore how to make HTTP requests using Python’s handy built-in module, urllib.request. You’ll try out examples and go over common errors, all while learning more about HTTP requests and Python in general.
REAL PYTHON course

Beware of Misleading GPU vs CPU Benchmarks

Nvidia has created GPU-based replacements for NumPy and other tools and promises significant speed-ups, but the comparison may not be accurate. Read on to learn if GPU replacements for CPU-based libraries are really that much faster.
ITAMAR TURNER-TRAURING

Django Migration Files: Automatic Clean-Up

Your Django migrations are piling up in your repo? You want to clean them up without a hassle? Check out this new package django-migration-zero that helps make migration management a piece of cake!
RONNY VEDRILLA • Shared by Sarah Boyce

Understanding NumPy’s `ndarray`

To understand NumPy, you need to understand the ndarray type. This article starts with Python’s native lists and shows you when you need to move to NumPy’s ndarray data type.
STEPHEN GRUPPETTA • Shared by Stephen Gruppetta

Type Information for Faster Python C Extensions

PyPy is an alternative implementation of Python, and its C API compatibility layer has some performance issues. This article describes on-going work to improve its performance.
MAX BERNSTEIN

Fastest Way to Read Excel in Python

It’s not uncommon to find yourself reading Excel in Python. This article compares several ways to read Excel from Python and how they perform.
HAKI BENITA

How Are Requests Processed in Flask?

This article provides an in-depth walkthrough of how requests are processed in a Flask application.
TESTDRIVEN.IO • Shared by Michael Herman

Projects & Code

harlequin: The SQL IDE for Your Terminal

GITHUB.COM/TCONBEER

AnyText: Multilingual Visual Text Generation and Editing

GITHUB.COM/TYXSSPA

Websocket CLI Testing Interface

GITHUB.COM/LEWOUDAR • Shared by Kevin Tewouda

Autometrics-py: Metrics to Debug in Production

GITHUB.COM/AUTOMETRICS-DEV • Shared by Adelaide Telezhnikova

django-cte: Common Table Expressions (CTE) for Django

GITHUB.COM/DIMAGI

Events

Weekly Real Python Office Hours Q&A (Virtual)

January 24, 2024
REALPYTHON.COM

SPb Python Drinkup

January 25, 2024
MEETUP.COM

PyLadies Amsterdam: An Introduction to Conformal Prediction

January 25, 2024
MEETUP.COM

PyDelhi User Group Meetup

January 27, 2024
MEETUP.COM

PythOnRio Meetup

January 27, 2024
PYTHON.ORG.BR

Happy Pythoning!
This was PyCoder’s Weekly Issue #613.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

January 23, 2024 07:30 PM UTC

TechBeamers Python

Python Map vs List Comprehension – The Difference Between the Two

In this tutorial, we’ll explain the difference between Python map vs list comprehension. Both map and list comprehensions are powerful tools in Python for applying functions to each element of a sequence. However, they have different strengths and weaknesses, making them suitable for different situations. Here’s a breakdown: What is the Difference Between the Python […]

The post Python Map vs List Comprehension – The Difference Between the Two appeared first on TechBeamers.

January 23, 2024 06:04 PM UTC

Real Python

Python Basics: Lists and Tuples

Python lists are similar to real-life lists. You can use them to store and organize a collection of objects, which can be of any data type. Instead of just storing one item, a list can hold multiple items while allowing manipulation and retrieval of those items. Because lists are mutable, you can think of them as being written in pencil. In other words, you can make changes.

Tuples, on the other hand, are written in ink. They’re similar to lists in that they can hold multiple items, but unlike lists, tuples are immutable, meaning you can’t modify them after you’ve created them.

In this video course, you’ll learn:

What lists and tuples are and how they’re structured
How lists and tuples differ from other data structures
How to define and manipulate lists and tuples in your Python code

By the end of this course, you’ll have a solid understanding of Python lists and tuples, and you’ll be able to use them effectively in your own programming projects.

This video course is part of the Python Basics series, which accompanies Python Basics: A Practical Introduction to Python 3. You can also check out the other Python Basics courses.

Note that you’ll be using IDLE to interact with Python throughout this course.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

January 23, 2024 02:00 PM UTC

Python Bytes

#368 That episode where we just ship open source

Topics covered in this episode: <ul> <li><a href="https://www.syntaxerror.tech/syntax-error-11-debugging-python/">Syntax Error #11: Debugging Python</a></li> <li><a href="https://umami.is">umami</a> and <a href="https://pypi.org/project/umami-analytics/">umami-analytics</a></li> <li><a href="https://github.com/okken/pytest-suite-timeout">pytest-suite-timeout</a></li> <li><a href="https://listmonk.app">Listmonk</a> and <a href="https://pypi.org/project/listmonk/">(py) listmonk</a></li> <li>Extras</li> <li>Joke</li> </ul><a href='https://www.youtube.com/watch?v=Tac5MS__IBA' style='font-weight: bold;'data-umami-event="Livestream-Past" data-umami-event-episode="368">Watch on YouTube</a> About the show Sponsored by us! Support our work through: <ul> <li>Our <a href="https://training.talkpython.fm/">courses at Talk Python Training</a></li> <li><a href="https://courses.pythontest.com/p/the-complete-pytest-course">The Complete pytest Course</a></li> <li><a href="https://www.patreon.com/pythonbytes">Patreon Supporters</a></li> </ul> Connect with the hosts <ul> <li>Michael: <a href="https://fosstodon.org/@mkennedy">@[email protected]</a></li> <li>Brian: <a href="https://fosstodon.org/@brianokken">@[email protected]</a></li> <li>Show: <a href="https://fosstodon.org/@pythonbytes">@[email protected]</a></li> </ul> Join us on YouTube at <a href="https://pythonbytes.fm/stream/live">pythonbytes.fm/live</a> to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Brian #1: <a href="https://www.syntaxerror.tech/syntax-error-11-debugging-python/">Syntax Error #11: Debugging Python</a> <ul> <li>Juhis</li> <li>Issue 11 of a fun debugging newsletter from Juhis</li> <li>Debugging advice <ul> <li>mindeset <ul> <li>take a break</li> <li>adopt a process</li> <li>talk to a duck</li> </ul></li> <li>tools & techniques <ul> <li>print</li> <li>snoop</li> <li>debuggers</li> <li>Django debug toolbar & Kolo for VS Code</li> </ul></li> </ul></li> </ul> Michael #2: <a href="https://umami.is">umami</a> and <a href="https://pypi.org/project/umami-analytics/">umami-analytics</a> <ul> <li>Umami makes it easy to collect, analyze, and understand your web data — while maintaining visitor privacy and data ownership.</li> <li><a href="https://pypi.org/project/umami-analytics/">umami-analytics</a> is a client for privacy-preserving, open source <a href="https://umami.is/">Umami analytics platform</a> based on <code>httpx</code> and <code>pydantic</code>.</li> <li>Core features</li> <li>➕ Add a custom event to your Umami analytics dashboard.</li> <li>🌐 List all websites with details that you have registered at Umami.</li> <li>🔀 Both sync and async programming models.</li> <li>⚒️ Structured data with Pydantic models for API responses.</li> <li>👩‍💻 Login / authenticate for either a self-hosted or SaaS hosted instance of Umami.</li> <li>🥇Set a default website for a simplified API going forward.</li> </ul> Brian #3: <a href="https://github.com/okken/pytest-suite-timeout">pytest-suite-timeout</a> <ul> <li>While recording <a href="https://podcast.pythontest.com/episodes/213-repeating-tests">Python Test 213 : Repeating Tests</a> <ul> <li>I noted that pytest-repeat doesn’t have a timeout, but pytest-flakefinder does.</li> <li>And perhaps I should add a timeout to pytest-repeat</li> </ul></li> <li>But also, maybe there’s other places I’d like a timeout, not just with repeat, but often with other parametrizations and even parametrize matrices. </li> <li>So, <a href="https://github.com/okken/pytest-suite-timeout">pytest-suite-timeout</a> is born</li> <li>But <a href="https://hachyderm.io/@miketheman/111799555975904630">Why not pytest-timeout? asks Mike Felder</a> <ul> <li>timeout is only timeouts per test, and it isn’t always graceful</li> <li>suite-timeout is for the full suite, and only times out between tests.</li> <li>so, you could use both</li> </ul></li> </ul> Michael #4: <a href="https://listmonk.app">Listmonk</a> and <a href="https://pypi.org/project/listmonk/">(py) listmonk</a> <ul> <li><a href="https://listmonk.app">Listmonk</a> <ul> <li>Self-hosted newsletter and mailing list manager (think mailchimp)</li> <li>Built on Go and Vue</li> <li>Backed by a company charing for this service as SaaS</li> <li>Still requires a mail infrastructure backend (I’m using <a href="https://sendgrid.com">Sendgrid</a>)</li> </ul></li> <li><a href="https://pypi.org/project/listmonk/">listmonk</a> (on PyPI) <ul> <li>API Client for Python</li> <li>Created by Yours Truly</li> <li>I tried 4 other options first, they were all bad in their own way.</li> <li>Features:</li> <li>➕Add a subscriber to your subscribed users.</li> <li>🙎 Get subscriber details by email, ID, UUID, and more.</li> <li>📝 Modify subscriber details (including custom attribute collection).</li> <li>🔍 Search your users based on app and custom attributes.</li> <li>🏥 Check the health and connectivity of your instance.</li> <li>👥 Retrieve your segmentation lists, list details, and subscribers.</li> <li>🙅 Unsubscribe and block users who don't want to be contacted further.</li> <li>💥 Completely delete a subscriber from your instance.</li> <li>📧 Send transactional email with template data (e.g. password reset emails).</li> </ul></li> <li>These pair well in my new <a href="https://www.docker.com">docker</a> cluster infrastructure <ul> <li>Calls to the API from a client app (e.g. <a href="https://training.talkpython.fm">Talk Python Training</a>) are basically loopback on the local docker bridge network.</li> </ul></li> </ul> Extras Michael: <ul> <li>Every github repo that has “releases” has a releases RSS feed, e.g. <a href="https://github.com/umami-software/umami/releases.atom">Umami</a></li> <li><a href="https://kolo.app">Kolo Django + VS Code</a></li> <li><a href="https://www.warp.dev/linux-terminal">Warp Terminal</a> on linux</li> <li><a href="https://fosstodon.org/@mkennedy/111787125592445700">bpytop and btop</a> - live server monitoring</li> </ul> Joke: <a href="https://infosec.exchange/@jbhall56/111178034352233910">The cloud, visualized</a>

January 23, 2024 08:00 AM UTC

Glyph Lefkowitz

Your Text Editor (Probably) Isn’t Malware Any More

In 2015, I wrote one of my more popular blog posts, “Your Text Editor Is Malware”, about the sorry state of security in text editors in general, but particularly in Emacs and Vim.

It’s nearly been a decade now, so I thought I’d take a moment to survey the world of editor plugins and see where we are today. Mostly, this is to allay fears, since (in today’s landscape) that post is unreasonably alarmist and inaccurate, but people are still reading it.

Problem	Is It Fixed?
`vim.org` is not available via `https`	Yep! `http://www.vim.org/` redirects to `https://www.vim.org/` now.
Emacs's HTTP client doesn't verify certificates by default	Mostly! The documentation is incorrect and there are some UI problems¹, but it doesn’t blindly connect insecurely.
ELPA and MELPA supply plaintext-HTTP package sources	Kinda. MELPA correctly responds to HTTP only with redirects to HTTPS, and ELPA at least offers HTTPS and uses HTTPS URLs exclusively in the default configuration.
You have to ship your own trust roots for Emacs.	Fixed! The default installation of Emacs on every platform I tried (including Windows) seems to be providing trust roots.
MELPA offers to install code off of a wiki.	Yes. Wiki packages were disabled entirely in 2018.

The big takeaway here is that the main issue of there being no security whatsoever on Emacs and Vim package installation and update has been fully corrected.

Where To Go Next?

Since I believe that post was fairly influential, in particular in getting MELPA to tighten up its security, let me take another big swing at a call to action here.

More modern editors have made greater strides towards security. VSCode, for example, has enabled the Chromium sandbox and added some level of process separation. Emacs has not done much here yet, but over the years it has consistently surprised me with its ability to catch up to its more modern competitors, so I hope it will surprise me here as well.

Even for VSCode, though, this sandbox still seems pretty permissive — plugins still seem to execute with the full trust of the editor itself — but it's a big step in the right direction. This is a much bigger task than just turning on HTTPS, but I really hope that editors start taking the threat of rogue editor packages seriously before attackers do, and finding ways to sandbox and limit the potential damage from third-party plugins, maybe taking a cue from other tools.

Acknowledgments

the documention still says “gnutls-verify-error” defaults to nil and that means no certificate verification, and maybe it does do that if you are using raw TLS connections, but in practice, url-retrieve-synchronously does appear to present an interactive warning before proceeding if the certificate is invalid or expired. It still has yet to catch up with web browsers from 2016, in that it just asks you “do you want to do this horribly dangerous thing? y/n” but that is a million times better than proceeding without user interaction. ↩

January 23, 2024 02:05 AM UTC

Seth Michael Larson

Removing maintainers from open source projects

Here's a tough but common situation for open source maintainers:

You want a project you co-maintain to be more secure by reducing the attack surface.
There are one or more folks in privileged roles who previously were active contributors, but now aren't active.
You don't want to take away from or upset the folks who have contributed to the project before you.

These three points feel like they're in contention. This article is here to help resolve this contention and potentially spur some thinking about succession for open source projects.

Why do people do open source?

Most rewards that come from contributing to open source are either intrinsic (helping others, learning new skills, interest in a topic, improve the world) or for recognition (better access to jobs, proof of a skill-set, “fame” from a popular project). Most folks don't get paid to work on open source for their first project, so it's unlikely to be their initial motivation.

Recognition is typically what feels “at stake” when removing a previous maintainer from operational roles on an open source project.

Let's split recognition into another two categories: operational and celebratory. Operational recognition is the category of recognition that has security implications like access to sensitive information or publishing rights. Celebratory has no security implications, it's there because we want to thank contributors for the work they've done for the project. Here's some examples of the two categories:

Operational:

Additional access on source control like GitHub (“commit bit”)
Additional access on package repository like PyPI
Listing email addresses for security contacts

Celebratory:

Author and maintainer annotation in package metadata
Elevating contributors into a triager role
Maintainer names listed in the README
Thanking contributors in release notes
Guest blog posts about the project

You'll notice that the celebratory recognition might be a good candidate for offsetting the removal of incidental operational recognition (like your account being listed on PyPI).

Suggestions for removing maintainers' with empathy

Ensure the removal of operational recognition is supplanted by deliberate celebratory recognition. Consider thanking the removed individual publicly in a blog post, release notes, or social media for their contributions and accomplishments. If there isn't already a permanent place to celebrate past maintainers consider adding a section to the documentation or README.

Don't take action until you've reached out to the individual. Having your access removed without any acknowledgement feels bad and there's no way around that fact. Even if you don't receive a reply, sending a message and waiting some time should be a bare minimum.

Practice regular deliberate celebratory recognition. Thank folks for their contributions, call them out by name in release notes, list active and historical maintainers in the documentation. This fulfills folks that are motivated by recognition and might inspire them to contribute again.

Think more actively about succession. In one of the many potential positive outcomes for an open source project, you will be succeeded by other maintainers and someone else may one day be in the position that you are in today.

How can you prepare that individual to have a better experience than you are right now? I highly recommend Sumana Harihareswara's writing on this topic. There are tips like:

Actively recruit maintainers by growing and promoting contributors.
Talk about succession openly while you are still active on the project.
Give privileges or responsibility to folks that repeatedly contribute positively, starting from triaging or reviewing code.
Recognize when you are drifting away from a project and make it known to others, even if you intend to contribute in the future.

Thanks for reading! ♡ Did you find this article helpful and want more content like it? Get notified of new posts by subscribing to the RSS feed or the email newsletter.

This work is licensed under CC BY-SA 4.0

January 23, 2024 12:00 AM UTC

January 22, 2024

Python Morsels

None in Python

Python's None value is used to represent emptiness. None is the default function return value.

Table of contents

Python's `None` value

Python has a special object that's typically used for representing emptiness. It's called None.

If we look at None from the Python REPL, we'll see nothing at all:

>>> name = None
>>>

Though if we print it, we'll see None:

>>> name = None
>>> name
>>> print(name)
None

When checking for None values, you'll usually see Python's is operator used (for identity) instead of the equality operator (==):

>>> name is None
True
>>> name == None
True

Why is that?

Well, None has its own special type, the NoneType, and it's the only object of that type:

>>> type(None)
<class 'NoneType'>

In fact, if we got a reference to that NoneType class, and then we called that class to make a new instance of it, we'll actually get back the same exact instance, always, every time we call it:

>>> NoneType = type(None)
>>> NoneType() is None
True

The NoneType class is a singleton class. So comparing to None with is works because there's only one None value. No object should compare as equal to None unless it is None.

`None` is falsey

We often rely on the …

Read the full article: https://www.pythonmorsels.com/none/

January 22, 2024 11:00 PM UTC

Planet Python

January 26, 2024

Requirements #

A minimal plausible solution #

Problem: expired items should go first... #

Problem: name PriorityQueue is not defined #

Problem: ...low priority items second #

Problem: we're deleting items in three places #

Problem: ...least recently used items last #

functools.lru_cache() #

OrderedDict #

Problem: our priority queue is slow #

heapq #

bisect #

pop() optimization #

Binary search trees #

Sorted Containers #

Problem: Sorted Containers is not in stdlib #

Logarithmic time #

bisect, redux #

Conclusion #

January 25, 2024

Introduction

Basic String Operators

Creating Strings

Accessing and Indexing Strings

Accessing Characters Using Indexing

Negative Indexing

String Concatenation and Replication

Slicing Strings

String Immutability

What is String Immutability?

Why are Strings Immutable?

How to "Modify" a String in Python?

Common String Methods

upper() and lower() Methods

capitalize() and title() Methods

strip(), rstrip(), and lstrip() Methods

The split() Method

Controlling the Number of Splits

The join() Method

Efficiency of the join()

The replace() Method

find() and rfind() Methods

The find() Method

index() and rindex() Methods

startswith() and endswith() Methods

The count() Method

isalpha(), isdigit(), isnumeric(), and isalnum() Methods

The isspace() Method

The format() Method

center(), ljust(), and rjust() Methods

The zfill() Method

The swapcase() Method

The partition() and rpartition() Methods

The encode() Method

Error Handling

The expandtabs() Method

islower(), isupper(), and istitle() Methods

The casefold() Method

Formatting Strings in Python

Basic String Formatting Techniques

The % Operator

The str.format() Method

Introduction to f-strings

Advanced String Formatting with f-strings

Multi-line f-strings

Indentation and Whitespace

Complex Expressions Inside f-strings

Embedding Expressions

Calling Functions and Methods

Inline Conditional Logic

List Comprehensions

Nested f-strings

Handling Exceptions

Conditional Logic and Ternary Operations in Python f-strings

Formatting Dates and Times with Python f-strings

Advanced Number Formatting with Python f-strings

Lambdas and Inline Functions in Python f-strings

Debugging with f-strings in Python 3.8+

The `find()` Method

The `%` Operator

The `str.format()` Method