PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Denish Patel on 2016-12-26 at 18:49
Merry Christmas!! If you are using SQL Server or Oracle databases, I’m giving away my 1 hour of time for rest of this week for free of cost to  discuss how you can save $$$$ by migrating them to PostgreSQL database by keeping same features and achieve better performance. you can shoot an email to [email protected] Happy … Continue reading Christmas Gift!

It is too late for stories about Christmas, so here is a Postgres story about the following commit of Postgres 10:

commit: 6ef2eba3f57f17960b7cd4958e18aa79e357de2f
author: Andres Freund <[email protected]>
date: Thu, 22 Dec 2016 11:31:50 -0800
Skip checkpoints, archiving on idle systems.

Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/[email protected]
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: 

Postgres 9.0 has been the first release to introduce the parameter wal_level, with three different values:

  • “minimal”, to get enough information WAL-logged to recover from a crash.
  • “archive”, to be able to recover from a base backup and archives.
  • “hot_standby”, to be able to have a standby node work, resulting in information about exclusive locks and currently running transactions to be WAL-logged, called standby snapshots.
  • “logical”, introduced in 9.4, to work with logical decoding.

In 9.6, “archive” and “hot_standby” have been merged into “replica” as both levels have no difference in

[...]
Posted by Pavel Stehule on 2016-12-23 at 19:27
I had a customer with strange issues of OpenERP - the main problem was long life of OpenERP/PostgreSQL sessions. The fix was not hard - using pgBouncer in transaction mode with short server life time (ten minutes).
Posted by Scott Mead in OpenSCG on 2016-12-22 at 22:17

I spend a lot of time trying to learn what’s already been implemented, as DBA’s, we tend to live in that world. It’s important that you have tools that allow you to quickly get the visual so you can get on with your job. One of the biggest ‘reverse engineering’ tasks that DBA’s have to do is in the area of data modeling. There’s a lot of tools that do this, but my favorite of late is SchemaSpy (schemapy.sf.net). I’ve found that the output is complete and the ERD’s are easy to read.

java -jar schemaSpy_5.0.0.jar -t pgsql -host localhost:5432 -db postgres \
-dp ~/software/pg/drivers/jdbc/postgresql-9.4.1208.jre6.jar \
-o ERD -u postgres -s myschema

SchemaSpy is nice because it can talk to many different types of databases… a side effect of this is a fairly complex set of commandline switches.

-t specifies the type, -t pgsql specifies postgres
-o specifies a directory to create the output in

All you need in order to run this is the SchemaSpy jar and a postgresql jdbc driver. In the output directory ‘ERD’. Pulling up ERD/index.html gives you a nice page with a table list and some basics.

SchemaSpyDash

As a DBA, I really love the ‘Insertion Order’ and ‘Deletion Order’ links here. SchemaSpy reverse engineers the referential integrity chain for me! Clicking either one gives me a simple page, top to bottom with the right order!

InsertionOrder

Now for the real reason that I super-love SchemaSpy. The ‘Relationships’ tab. I’ve loaded up a pgbench schema. pgbench doesn’t actually create any real keys, but column names are consistent. SchemaSpy notices this and ‘infers’ relationships for me! This is huge, even without explicity keys, I can begin to infer what the developer intended (the estimated rowcount is helpful too 🙂

Relationships

I won’t force you to follow me through all of the tabs. If you’re looking for a schema visualization tool, give SchemaSpy a try.

Happy PostgreSQL-ing!

In this blog I’ll show you how to deploy a PostgreSQL cluster using the latest Docker 1.12 technology. Updates to Docker in their 1.12 release greatly simplify deploying a PostgreSQL cluster. IP addresses and hostnames used in this blog are just examples and not mandated.

Relational databases are the first choice of data store for many applications due to their enormous flexibility and reliability. Historically the one knock against relational databases is that they can only run on a single machine, which creates inherent limitations when data storage needs outpace server improvements. The solution to rapidly scaling databases is to distribute them, but this creates a performance problem of its own: relational operations such as joins then need to cross the network boundary. Co-location is the practice of dividing data tactically, where one keeps related information on the same machines to enable efficient relational operations, but takes advantage of the horizontal scalability for the whole dataset.

The principle of data co-location is that all tables in the database have a common distribution column and are sharded across machines in the same way such that rows with the same distribution column value are always on the same machine, even across different tables. As long as the distribution column provides a meaningful grouping of data, relational operations can be performed within the groups.

In a previous article we described how to shard a multi-tenant application with co-location as the underlying technique. In this article, we zoom into how data co-location works in Citus and the benefits it provides.

Data co-location in Citus for hash-distributed tables

The Citus extension for PostgreSQL is unique in being able to form a distributed database of databases. Every node in a Citus cluster is a fully functional PostgreSQL database and Citus adds the experience of a single homogenous database on top. While it does not provide the full functionality of PostgreSQL in a distributed way, in many cases it can take full advantage of features offered by PostgreSQL on a single machine through co-location, including full SQL support, transactions and foreign keys.

In Citus a row is stored in a shard if the hash of the value in the distribution column falls within the shard’s hash range. T

[...]
Posted by Oskari Saarenmaa on 2016-12-21 at 10:42
We're happy to announce that Aiven PostgreSQL now has support for connection pools. Connection pooling allow you to maintain very large numbers of connections to a database while keeping the server resource usage low.

Aiven PostgreSQL connection pooling utilizes PGBouncer for managing the database connection and each pool can handle up to 5000 database client connections. Unlike when connecting directly to the PostgreSQL server, each client connection does not require a separate backend process on the server. PGBouncer automatically interleaves the client queries and only uses a limited number of actual backend connections, leading to lower resource usage on the server and better total performance.

Why connection pooling?


Eventually a high number of backend connections becomes a problem with PostgreSQL as the resource cost per connection is quite high due to the way PostgreSQL manages client connections. PostgreSQL creates a separate backend process for each connection and the unnecessary memory usage caused by the processes will start hurting the total throughput of the system at some point. Also, if each connection is very active, the performance can be affected by the high number of parallel executing tasks.

It makes sense to have enough connections so that each CPU core on the server has something to do (each connection can only utilize a single CPU core [1]), but a hundred connections per CPU core may be too much. All this is workload specific, but often a good number of connections to have is in the ballpark of 3-5 times the CPU core count.

[1] PostgreSQL 9.6 introduced limited parallelization support for running queries in parallel on multiple CPU cores.

Without a connection pooler the database connections are handled directly by PostgreSQL backend processes, one process per connection:





Adding a PGBouncer pooler that utilizes fewer backend connections frees up server resources for more important uses, such as disk caching:




Many frameworks and libraries (ORMs, Django, Rails, etc.) support c[...]

Do you know situations with unstable runtimes? You have a query, which is sometimes slow and sometimes just lightning fast. The question is usually: Why? Maybe this little post can be inspiring and shade some light one some of your performance issues, which might bug you in your daily life. Reasons for unstable runtimes There […]

The post Detecting unstable runtimes in PostgreSQL appeared first on Cybertec - The PostgreSQL Database Company.

At 9.5 pg_rewind was introduced, with this feature it was possible to make a server that is no longer master to follow a promoted standby that has a new timeline. 
There are at least a few use cases that could benefit from this feature, to name a few :

Testing Failovers : Its pretty common for one of customers to ask for a failover simulation just to make sure that everything works, by using pg_rewind this is much easier and much faster because we don't have to re-sync the whole database that sometimes is multi-TB sized. 

Minor version upgrades : during an upgrade you have to put the new binaries in place and restart the database. By using pg_rewind if the application is flexible enough , you could upgrade the slave , switchover , upgrade the master and pg_rewind the old master to follow the new slave, possibly minimising even more the downtime.

Requirements.

pg_rewind has a couple of requirements :


1. data page checksums enabled (initdb -k)
or
2. parameter wal_log_hints has to be enabled 
3. The old-master (the one we want to re-sync) to be properly shutdown or you'll get: 
target server must be shut down cleanly 
Failure, exiting

How does it work ?

pg_rewind searches the old-master’s data directory, finds the data blocks changed during the switchover and then copies only the nesessary blocks from the promoted slave. Keep in mind that the configuration files will also be copied from the new master so take a look for mis-configurations before you start it ! Also you have to have all the wal files since the last checkpoint  from the old master. Changes are identified by comparing the state of the data blocks present in the data directory with the changes logged in the wal files. When the deferrable blocks are identified, the wals are replayed.
For more info refer to the documentation here

How to do it

Recently i've posted about how to use Docker to setup a lab environment, the link can be found here. I'm going to use two containers for this blogpost using the image created from the dockerfile that i have.
[...]

Postgres 10 has support for an additional feature related to replication slots, a facility holding a way to retain WAL data in pg_xlog depending on the data consumed by clients connected to it. The feature spoken about here has been implemented by the following commit:

Add support for temporary replication slots

This allows creating temporary replication slots that are removed
automatically at the end of the session or on error.

From: Petr Jelinek <[email protected]>

As the commit log already mentions, replication slots have the possibility to be made temporary. When created, they are associated with the connection that created them, and once the connection session finishes, the slots are automatically dropped. Note as well that the slots have no consistent data on disk, so on crash those are of course dropped as well (it’s not like Postgres crashes a lot anyway, per its reputation for stability).

One direct application of this feature is the case where WAL segments are needed by this client and that they absolutely need this data to be consistent with what has been done and that the slot is a one-shot need. In existing Postgres versions, any application can create a slot, be it via SQL or the replication protocol, though any failure results in a cleanup logic that needs to be done or WAL would be retained infinitely if the cleanup work does not happen. At this end that would crash the backend once the partition holding pg_xlog gets full.

Once this commit has been done, a patch has been sent for pg_basebackup to make use of temporary slots. This is a perfect match for the use of temporary replication slots as many users are already using the stream mode of pg_basebackup to fetch the WAL segments with a secundary replication connection to be sure that segments are included to make a full consistent backup that can be used as-is when restoring an instance.

The main risk of using pg_basebackup without replication slots is to not be able to take a consistent backup as WAL segments may have been alr

[...]

Postgres is great, but it can’t run itself in all cases. Things come up. Queries go awry. Hardware fails, and users leave transactions open for interminable lengths of time. What happens if one of these things occur while the DBA themselves has a hardware fault? While they’re down for maintenance, someone still has to keep an eye on things. For the last PG Phriday of the year completely unrelated to my upcoming surgery, let’s talk about what happens when your DBA becomes inoperative due to medical complications.

This is Fake Postgres DBA 101!

Getting Around

When in doubt, SSH is the name of the game. Database accounts are either locked-down or lack sufficient permissions to do everything, but usually the postgres user itself has total access. Invoke your favorite SSH client to connect to the host running the database in question, and use sudo to become the postgres user:

ssh my-db-host
sudo su -l postgres

Afterwards, it’s a good idea to add our public key to the postgres user’s .ssh/authorized_keys file so we can log in as the postgres user without the intermediate sudo. Here’s a good guide for doing that. If there’s configuration management like salt or Puppet involved, that file is probably part of the stack and needs to be modified there or it’ll be overwritten.

Either way, we’re in. If we’re lucky enough to be on a Debian derivative like Ubuntu or Mint, we can use the pg_lsclusters command to see which database instances are running on this server. Here’s what that looks like on a sample VM:

pg_lsclusters 
 
Ver Cluster Port Status Owner    Data directory          Log file
9.5 main    5432 online postgres /data/pgsql/main/data   /var/log/postgresql/postgresql-9.5-main.log
9.6 96test  5440 online postgres /data/pgsql/96test/data /var/log/postgresql/postgresql-9.6-96test.log

The next thing we need is the psql command-line client—the nexus of Postgres interoperability. Postgres instances monitor the listed port for connection attempts, so if we wanted to target a specific instance, we’d want to

[...]

If you are using transaction log shipping (or Point-In-Time-Recovery) in PostgreSQL to keep your data safe, you will surely already know about pg_basebackup. The idea behind pg_basebackup is to allow users to create a binary copy of the data, which can serve as the basis for Point-In-Time-Recovery.  However, recently we have seen some support cases, […]

The post pg_basebackup: Creating self-sufficient backups appeared first on Cybertec - The PostgreSQL Database Company.

There is a nice lie out there. A lot of people want to believe it. They think by believing this lie it will somehow increase something for them. In some ways that is true. If you want what you are doing to be about you. If you are a believer in Open Source, it isn't about you. It is about the community and bettering that community as a whole.

If we provide the videos of our sessions for free, you won't attend the conference.

The PgConf US conference grows every year and guess what, they provide their videos for free.

If you pay for the conference, we will provide the videos for free.

LinuxFest Northwest which is larger than PgConf US, PgConf EU, Postgres Vision and Char(16) combined also grows every year and provides their videos for free. It is also free to attend.

The next time someone says, "You won't show if we give you videos" or "Maybe if you paid to come", I recommend you attend an Open Source conference that follows Open Source values. 

That isn't to say that it isn't a conference's right to not give free videos. Of course it is. It is also your right to only attend conferences about Open Source that embrace what Open Source is about. There is a reason I only speak at non-profit or community run conferences. I want to contribute to the greater good. That is also the reason why Command Prompt only sponsors community run or non-profit conferences.

That is also not to say that for-profit conferences about Open Source are bad. They aren't and more power to you if you can find a way to make money off a conference about Open Source. It is to say that Open Source conferences should follow Open Source values. If you are going to record and you are an Open Source conference, give the recording away.  

Merry Christmas & Tally Ho!

Citus is a distributed database that extends (not forks) PostgreSQL. Citus does this by transparently sharding database tables across the cluster and replicating those shards.

After open sourcing Citus, one question that we frequently heard from users related to how Citus replicated data and automated node failovers. In this blog post, we intend to cover the two replication models available in Citus: statement-based and streaming replication. We also plan to describe how these models evolved over time for different use cases.

Going back to Citus’ first version, one of the early use cases that we looked at was analyzing large volumes of event data in real-time. This data has the nice property that they are append-only and have a natural time dimension. Users could therefore batch those events together and load them into the distributed cluster.

These properties of event data enabled parallel loads of events data, without sacrificing from consistency semantics, relatively easy. A coordinator node would keep metadata related to shards and shard placements (replicas) in the cluster. Clients would then talk to the coordinator node and exchange metadata on which shards to append events data. Once a client appended related events data to the related shards, the client would conclude the operation by updating the shard metadata on the coordinator node.

Citus cluster ingesting append-only events data from files

The simplified diagram above shows an example data load. The client tells the coordinator node that it wants to append events data to an append-distributed table. The coordinator node grants the client information about shard 6’s placements. The client then copies these events to the shard’s placements and updates the coordinator with related metadata. If the client fails to copy events to one of the nodes, it can either mark the related shard placement as invalid or abort the copy operation.

There are three important points to highlight in this example.

  1. The method above is known as statement-based replication and it provides automated failover in a relatively simple way
[...]
I recently came across a discussion in #postgresql channel about how to add a new column that will represent record insert timestamps to a table with minimal locking.

Normally you would do something like this :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone default now();
ALTER TABLE
Time: 13643.983 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1616.108 ms

This though , would require an AccessExclusiveLock on the table and the application would have to wait for the duration of alter.
What you could do instead is :

monkey=# alter table alter_test add column insert_timestamp timestamp without time zone;
ALTER TABLE
Time: 1.135 ms
monkey=# alter table alter_test alter column insert_timestamp set default now();
ALTER TABLE
Time: 1.238 ms
monkey=# create index concurrently insert_timestamp_idx on alter_test (insert_timestamp) where insert_timestamp is not null;
CREATE INDEX
Time: 1618.788 ms

As you can see the timings are completely different, in the first case the database has to change data in all pre-existing rows, which isn't really needed since the value will be dummy (it will default to the transaction time).
The second way will also require an AccessExclusiveLock but it wont have to change any data ignoring the old rows so the lock should be almost instant (assuming no blockers exist),
next step you alter the table adding the default value and from now on the new column will have the default value. If you don't want to have nulls, or you want to add a not null constraint to the column, you can update the old (null) rows in a more progressive and non intrusive way later on.

Thanks for reading
- Vasilis Ventirozos
Posted by Simon Riggs in 2ndQuadrant on 2016-12-14 at 10:38

CHAR(16) Conference was a live conference in NYC held on 6 Dec 2016, focusing on Scalability for Business.

The Speaker Panel was recorded, so you can find out what happened here: https://youtu.be/eLjBoZr4dTc (1 hour)

Eric David Benari hosted an hour long “chat show” format, featuring

  • Mason Sharp
  • Petr Jelinek
  • Peter Eisentraut
  • Simon Riggs
  • Jonathan Katz

Other videos? No, sorry. Obviously, if you have to pay to attend, but we post the videos for free, then fewer people will attend. So next time please come along for the conference and find out what’s happening, ask questions and enjoy chatting with the attendees.

Hope to see you next year.

When I talk to various people about PostgreSQL, based on where they are in the process of analyzing it, I get asked many similar questions. One of the most frequently asked questions is: What is the PostgreSQL licensing model? Here is my attempt to address this question from the perspective of what the PostgreSQL license means for your business.free-licence_web

Go Ahead, Read It – It’s Really Not That Scary

Let’s be honest with ourselves, we are really not in the habit of reading license agreements – they are best left to the legal department to decipher. And seriously, licenses have become so complex and so detailed that even if I try really really hard to read just a few clauses, my head starts to hurt.

The PostgreSQL License is different though.

I have reproduced the license in full at the end of this blog. Take a look. I guarantee it will take less than 5 minutes and you will understand every word of what it says. Straightforward, transparent, simple to understand – and this is just the beginning of how PostgreSQL makes your life easier.

Free To Use – Forever

Yup, you read that right. PostgreSQL is absolutely free to use, forever. All you need to do is download the database, install it, and start using it. You don’t need to get into any contract negotiations, you don’t need to worry about asking for permission from anyone, you don’t need to haggle over pricing.

By now you are probably thinking, “Yeah it’s free but that’s probably the limited edition. The features my organization needs are likely in the enterprise edition, and I would need to sell my kidney in order to get it.” You’d be wrong. There are no ‘editions’ of PostgreSQL. All features are available for you to use in the community version, which is free, forever.

Unlimited Usage – Forever

Now you may be thinking – “Ok, so the community probably imposes various different limits to force organizations to pay before they can really scale”. Wrong again!

The PostgreSQL community does not impose any limits to the number of servers you install the database on, the amount of

[...]
Posted by Keith Fiske in OmniTI on 2016-12-12 at 16:48

Since I have a passing interest in partitioning in PostgreSQL, I figured I’d check out a recent commit to the development branch of PostgreSQL 10

Implement table partitioning – https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f0e44751d7175fa3394da2c8f85e3ceb3cdbfe63

Table partitioning is like table inheritance and reuses much of the
existing infrastructure, but there are some important differences.
The parent is called a partitioned table and is always empty; it may
not have indexes or non-inherited constraints, since those make no
sense for a relation with no data of its own.  The children are called
partitions and contain all of the actual data.  Each partition has an
implicit partitioning constraint.  Multiple inheritance is not
allowed, and partitioning and inheritance can't be mixed.  Partitions
can't have extra columns and may not allow nulls unless the parent
does.  Tuples inserted into the parent are automatically routed to the
correct partition, so tuple-routing ON INSERT triggers are not needed.
Tuple routing isn't yet supported for partitions which are foreign
tables, and it doesn't handle updates that cross partition boundaries.

Currently, tables can be range-partitioned or list-partitioned.  List
partitioning is limited to a single column, but range partitioning can
involve multiple columns.  A partitioning "column" can be an
expression.

Because table partitioning is less general than table inheritance, it
is hoped that it will be easier to reason about properties of
partitions, and therefore that this will serve as a better foundation
for a variety of possible optimizations, including query planner
optimizations.  The tuple routing based which this patch does based on
the implicit partitioning constraints is an example of this, but it
seems likely that many other useful optimizations are also possible.

Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat,
Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova,
Rushabh Lathia, Erik Rijkers, among others.  
[...]
Posted by Kaarel Moppel in Cybertec on 2016-12-12 at 09:27

First – if the word “Bloom” doesn’t yet resonate with you in Postgres context, I’d recommend to read my previous introductory post on the topic here. This one is a follow up with an aim to set up a simple demo use case to exemplify “fields of application” for Bloom and to also gauge pros/cons/performance […]

The post Trying out Postgres Bloom indexes appeared first on Cybertec - The PostgreSQL Database Company.

Posted by Paul Ramsey in PostGIS on 2016-12-11 at 16:09

PostgreSQL "Compatible" Aurora

While I know full well that Amazon’s marketing department doesn’t need my help, I cannot resist flagging this new development from the elves in Santa’s AWS workshop:

Today we are launching a preview of Amazon Aurora PostgreSQL-Compatible Edition. It offers … high durability, high availability, and the ability to quickly create and deploy read replicas. Here are some of the things you will love about it:

Performance – Aurora delivers up to 2x the performance of PostgreSQL running in traditional environments.

Compatibility – Aurora is fully compatible with the open source version of PostgreSQL (version 9.6.1). On the stored procedure side, we are planning to support Perl, pgSQL, Tcl, and JavaScript (via the V8 JavaScript engine). We are also planning to support all of the PostgreSQL features and extensions that are supported in Amazon RDS for PostgreSQL.

Cloud Native – Aurora takes full advantage of the fact that it is running within AWS.

The language Amazon uses around Aurora is really wierd – they talk about “MySQL compatibility” and “PostgreSQL compatibility”. At an extreme, one might interpret that to mean that Aurora is a net-new database providing wire- and function-level compatibility to the target databases. However, in the PostgreSQL case, the fact that they are additionally supporting PostGIS, the server-side languages, really the whole database environment, hints strongly that most of the code is actually PostgreSQL code.

There is not a lot of reference material about what’s going on behind the scenes, but this talk from re:Invent shows that most of the action is in the storage layer. For MySQL, since storage back-ends are pluggable, it’s possible that AWS has added their own back-end. Alternately, they may be running a hacked up version of the InnoDB engine.

For PostgreSQL, with only one storage back-end, it’s pretty much a foregone conclusion that AWS have taken a fork and added some secret sauce to it. However, the fact that they are tracking the community version almost exactly (they curren

[...]
Posted by Chris Travers on 2016-12-11 at 14:51
Having been using various vcs solutions for a while, I think it is worth noting that my least favorite from a user experience perspective is git.  To be sure, git has better handling of whitespace-only merge conflicts and a few other major features than any other vcs that I have used.

And the data structure and model are solid.

But even understanding what is going on behind the scenes, I find git to be an unintuitive mess at the CLI issue.

Let me start by saying things that git got right:


  1. The dag structure is nice
  2. Recursive merges are good
  3. The data models and what goes on behind the scenes is solid.
  4. It is the most full-featured vcs I have worked with
However, I still have real complaints with the software  These include fundamentally different concepts merged into the same label and the fact that commands may do many different things depending on how you call them.  The fact that the concepts are not clear means that it is worse than a learning curve issue.  One cannot have a good grasp of what git is doing behind the scenes because this is not always clear.  In particular:

  1. In what world is a fast-forward a kind of merge?
  2. Is there any command you can explain (besides clone) in three sentences or less?
  3. What does git checkout do?  Why does it depend on what you checkoout?   Can you expect an intermediate user to understand what to expect if you have staged changes on a file, when you try to check out a copy of that file from another revision?
  4. What does git reset do from a user perspective?  Is there any way a beginner can understand that from reading the documentation?
  5. In other words, git commands try to do too much at once and this is often very confusing.
  6. Submodules are an afterthought and not well supported across the entire tool chain (why does git-archive not have an option to recurse into submodules?)
These specific complaints come from what I see as a lack of clarity regarding what concepts mean and they indicate that those who wrote the tools did so at a time when the concepts were still not enti
[...]
Posted by Joel Jacobson on 2016-12-10 at 19:58

I couldn’t find any graph showing all the relations between all the pg_catalog tables,
so just for fun I wrote a little script to parse the SGML and generate a graph using GraphViz.

#!/usr/bin/perl
use strict;
use warnings;

use File::Slurp qw(slurp);
use Data::Dumper;

open my $fh, "<:encoding(utf8)", './doc/src/sgml/catalogs.sgml' or die "$!";

my $table;
my $column;
my $references;
my $pg_catalog_fk_map = {};
while (my $line = <$fh>) {
    chomp $line;
    if ($line =~ m!^\s+<title><structname>([^<>]+)</> Columns</title>$!) {
        $table = $1;
    } elsif ($line =~ m!^\s+<entry><structfield>([^<>]+)</structfield></entry>$!) {
        $column = $1;
    } elsif ($line =~ m!^\s+<entry><type>(oid|regproc)</type></entry>$!) {
    } elsif (defined $column && $line =~ m!^\s+<entry><literal><link\ linkend="[^"]+"><structname>([^<>]+)</structname></link>\.oid</literal></entry>$!) {
        $references = $1;
        if (!defined($pg_catalog_fk_map->{$table}->{$column}))
        {
            $pg_catalog_fk_map->{$table}->{$column} = $references;
        } elsif ($pg_catalog_fk_map->{$table}->{$column} ne $references) {
            die "Inconsistent data, $table.$column references both $pg_catalog_fk_map->{$table}->{$column} and $references";
        }
        $pg_catalog_fk_map->{$table}->{$column} = $references;
        if (!defined($pg_catalog_fk_map->{$references})) {
            $pg_catalog_fk_map->{$references} = {};
        }
    } else {
        undef($column);
        undef($references);
    }
}

my $dot = qq!
    digraph g {
        graph [
            rankdir = "LR"
        ];
        node [
            fontsize = "16"
            shape = "ellipse"
        ];
        edge [
        ];
!;
foreach my $table (sort keys %{$pg_catalog_fk_map}) {
    $dot .= qq!
        "$table" [
            headlabel = "$table"
            label = "$table | <oid> oid|!;
    foreach my $column (sort keys %{$pg_catalog_fk_map->{$table}}) {
        my $references = $pg_catalog_fk_map->{$table}->{$column};
        $dot .= "<$colum
[...]
Posted by Shaun M. Thomas on 2016-12-09 at 18:07

Recently a coworker asked me this question:

Should I expect variance between minutes and hours for the same query?

And I was forced to give him this answer:

Potentially, but not commonly. Query planning is an inexact science, and regardless of the query being the “same query,” the data is not the “same data.” This isn’t generally the case, but on occasion, changes in data can affect the query execution path. Usually this is a good thing, as the database accounts for new value distributions.

For example, if there are a million distinct values in one column, but 90% of them are the same, certain values should trigger an index scan instead of a sequential scan. Those values will change over time, and if the stats don’t account for that, queries will have non-dependable performance. Of course, this introduces potential correlation assumptions that aren’t correct in some cases, and that also causes unreliable query performance. I guess the question is: which would you rather have?

That answer, despite being “right”, didn’t sit well with me. While we did eventually sit down and optimize the query in question so it was less likely to mislead Postgres, there’s no reason to assume an end-user is going to readily accept unreliable query performance. Nor should they.

But we need to perform some analysis to determine how things got to this point. Let’s start with distribution. We have a table that contains about two weeks worth of data, represented by 66M rows over 100GB of space (135GB including indexes). We’re not scheduled to upgrade to 9.6 until early 2017, so the instance is running Postgres 9.5. It isn’t bleeding edge, but this is hardly an ancient installation.

Consider the query plan for a standard retrieval of the last day worth of data:

EXPLAIN
SELECT *
  FROM NEW
 WHERE create_time >= '2016-12-08';
 
                               QUERY PLAN
-----------------------------------------------------------------------------
 Bitmap Heap Scan ON NEW
     (cost=137498.78..8770565.51 ROWS=5819512 width=1032)
   Re
[...]

libpq is getting some improvements in its connection strings to define some properties that are expected from the backend server thanks to the following commit, that will be part of Postgres 10:

commit: 721f7bd3cbccaf8c07cad2707826b83f84694832
author: Robert Haas <[email protected]>
date: Tue, 29 Nov 2016 12:18:31 -0500
libpq: Add target_session_attrs parameter.

Commit 274bb2b3857cc987cfa21d14775cae9b0dababa5 made it possible to
specify multiple IPs in a connection string, but that's not good
enough for the case where you have a read-write master and a bunch of
read-only standbys and want to connect to whichever server is the
master at the current time.  This commit allows that, by making it
possible to specify target_session_attrs=read-write as a connection
parameter.

[...]

Victor Wagner and Mithun Cy.  Design review by Álvaro Herrera, Catalin
Iacob, Takayuki Tsunakawa, and Craig Ringer; code review by me.  I
changed Mithun's patch to skip all remaining IPs for a host if we
reject a connection based on this new parameter, rewrote the
documentation, and did some other cosmetic cleanup.

In short, a new parameter called target_session_attrs is added, and it can use the following values:

  • “any”, meaning that any kind of servers can be accepted. This is as well the default value.
  • “read-write”, to disallow connections to read-only servers, hot standbys for example.

The strings using this parameter can have the following format, both normal connection strings and URIs are of course supported:

host=host1 target_session_attrs=any
host=host1,host2 port=port1,port2 target_session_attrs=any
postgresql://host1:port2,host2:port2/?target_session_attrs=read-write

When attempting for example to connect to a standby, libpq would complain as follows:

$ psql -d "postgresql://localhost:5433/?target_session_attrs=read-write"
psql: could not make a writable connection to server "localhost:5433"

This feature finds its strength with for example a cluster in the shape of a primary and one or more standbys. If a failo

[...]
While trying to work with an AWS Postgres instance, it was interesting to see pg_dumpall failing to dump databases from my own Postgres instance! Delving further, for obvious reasons a managed service like this hides some global-information, which pg_dump(all) needs ... and no, this post is NOT about circumventing that. This branch, gives a 'near' workaround for those who are okay with being
Posted by Simon Riggs in 2ndQuadrant on 2016-12-08 at 22:57

CHAR(16) Conference on Scalability for Business went very well in the heart of New York City.

Hotel Pennsylvania was a great venue; convenient for travel and a great conference venue with coffee, food and evening drinks.

Jonathan Katz was great as MC for the conference, while Eric David Benari did a great job of hosting the Speaker Panel and keeping everybody on their toes with surprising and revealing questions. We’ll be posting the video for that sometime soon.

All the presentations were well received and the room was full all day long. Corridor feedback was that the speakers were good and the content was “dense” and very much in line with the specialised nature of the conference. Many good detailed questions after each talk, always a good sign people are tuned in.

Thanks to everybody that contributed, sponsored or attended. See you again soon. Go Postgres!

Anytime you run a query in Postgres, it needs to compile your SQL into a lower-level plan explaining how exactly to retrieve the data. The more it knows about the tables involved, the smarter the planner can be. To get that information, it gathers statistics about the tables and stores them, predictably enough, in the system table known as pg_statistic. The SQL command ANALYZE is responsible for populating that table. It can be done per-cluster, per-database, per-table, or even per-column. One major pain about analyze is that every table *must* be analyzed after a major upgrade. Whether you upgrade via pg_dump, pg_upgrade, Bucardo, or some other means, the pg_statistic table is not copied over and the database starts as a clean slate. Running ANALYZE is thus the first important post-upgrade step.

Unfortunately, analyze can be painfully slow. Slow enough that the default analyze methods sometimes take longer that the entire rest of the upgrade! Although this article will focus on the pg_upgrade program in its examples, the lessons may be applied to any upgrade method. The short version of the lessons is: run vacuumdb in parallel, control the stages yourself, and make sure you handle any custom per-column statistics.

Before digging into the solution in more detail, let's see why all of this is needed. Doesn't pg_upgrade allow for super-fast Postgres major version upgrades, including the system catalogs? It does, with the notable exception of the pg_statistics table. The nominal reason for not copying the data is that the table format may change from version to version. The real reason is that nobody has bothered to write the conversion logic yet, for pg_upgrade could certainly copy the pg_statistics information: the table has not changed for many years..

At some point, a DBA will wonder if it is possible to simply copy the pg_statistic table from one database to another manually. Alas, it contains columns of the type "anyarray", which means it cannot be dumped and restored:

$ pg_dump
[...]

If you are a DBA and ever have had to manually call pg_terminate_backend() to kill some misbehaving backend process, then this tool might be of interest.

Long running queries are not a problem, as long as they don’t force other important processes to wait.

Another classic is a human user being forgetting to COMMIT.

Please see the Github page for more info:

https://github.com/trustly/pgterminator



Lately, i wanted to make a new test / lab environment that would be dispensable, fast to deploy and easily customisable. VM's are all nice for lab use and they served me well over the years, but they tend to be slow(er) to deploy, heavy on resources and lets face it,



So i've decided to play around with Docker. I've played with Docker in the past but i haven't done anything too serious, i just took the official postgres Dockerfile and run it to see how it works. This is how i started my whole Docker exploration, i took the official postgres docker file that can be found here and made an image. Immediately i saw that it wasn't exactly what i wanted, i wanted something less generic. The things that i would like are :

  • latest postgres version, compiled from source
  • easily customisable pgdata and prefix
  • custom postgresql.conf settings (mostly having it replication ready)
  • custom shell environment
  • some extensions
  • data page checksums
  • easy replica deployment
So i started creating my own Dockerfile that would fill my needs and at the same time i would learn some more about Docker, 2 birds one stone kind of thing. After several hours and some testing i came up with something that was working (link at the bottom). It still needs some work to be done but i plan maintaining it for my own use so feel free to use it if you like what i've done there.

Let me explain a bit how it works.
With :
ENV PGBINDIR=/home/postgres/pgsql
ENV PGDATADIR=/home/postgres/pgdata
I can customise the installation and the pgdata directories, I install all packages i need, note that comments on joe will be deleted! Then i add postgres user to sudoers (this dockerfile is not meant to be secure, ease of use is what i am looking for when it comes to a lab postgres instance).
I -always- get the latest postgres version sources, compile and install (including extensions , documentation etc) set some environmental variables, configure postgres (postgresql.conf , pg_hba.conf) , put a sample recovery.done there so slaves can get it from pg_basebackup ,
[...]