Querying Postgres Tables Straight from DuckDB thumbnail

Querying Postgres Tables Straight from DuckDB

2022-09-30Hannes Mühleisen

TLDR: DuckDB can now right this moment search info from queries saved in PostgreSQL and velocity up advanced analytical queries with out duplicating data.

DuckDB goes Postgres

Introduction

PostgreSQL is the area’s most superior starting up source database (self-proclaimed). From its attention-grabbing beginnings as an academic DBMS, it has developed all the plot via the final 30 years right into a standard workhorse of our digital atmosphere.

PostgreSQL is designed for outmoded transactional exhaust cases, “OLTP”, where rows in tables are created, updated and removed similtaneously, and it excels at them. However this earn resolution makes PostgreSQL far much less just for analytical exhaust cases, “OLAP”, where plentiful chunks of tables are learn to earn summaries of the saved data. But there are pretty a pair of exhaust cases where each transactional and analytical exhaust cases are critical, as an instance when making an are attempting to earn the most modern substitute intelligence insights into transactional data.

There had been some attempts to manufacture database management systems that produce well on each workloads, “HTAP”, but in celebrated many earn choices between OLTP and OLAP systems are arduous substitute-offs, making this endeavour advanced. Accepting that one dimension would now not match all in spite of all the pieces, systems are on the total separated, with the transactional utility data dwelling in a reason-constructed machine treasure PostgreSQL, and a copy of the info being saved in an exclusively varied DBMS. Using a reason-constructed analytical machine accelerates analytical queries by quite loads of orders of magnitude.

Unfortunately, sustaining a copy of the info for analytical capabilities could be problematic: The copy will without extend be outdated as new transactions are processed, requiring a advanced and non-trivial synchronization setup. Storing two copies of the database additionally would require twice the storage house. As an example, OLTP systems treasure PostgreSQL traditionally exhaust a row-based data illustration, and OLAP systems are inclined to favor a chunked-columnar data illustration. That that you must’t earn each with out sustaining a copy of the info with your whole points that brings with it. Moreover, the SQL syntaxes between whatever OLAP machine you’re utilizing and Postgres could maybe maybe differ pretty drastically.

However the earn house is now not as murky and white as it seems. As an example, the OLAP performance in systems treasure DuckDB would now not handiest near from a chunked-columnar on-disk data illustration. Powerful of DuckDB’s performance comes from its vectorized search info from processing engine that is personalized-tuned for analytical queries. What if DuckDB changed into as soon as ready to by some means learn data saved in PostgreSQL? Whereas it seems daunting, now we earn launched into a quest to manufacture appropriate this likely.

To allow for instantaneous and consistent analytical reads of Postgres databases, we designed and performed the “Postgres Scanner”. This scanner leverages the binary switch mode of the Postgres client-server protocol (Interrogate the Implementation Allotment for more particulars.), permitting us to successfully transform and exhaust the info right this moment in DuckDB.

Amongst other things, DuckDB’s earn is varied from outmoded data management systems attributable to DuckDB’s search info from processing engine can flee on virtually arbitrary data sources with out requiring to repeat the info into its possess storage structure. As an example, DuckDB can on the second right this moment flee queries on Parquet info, CSV info, SQLite info, Pandas, R and Julia data frames to boot to Apache Arrow sources. This new extension adds the aptitude to right this moment search info from PostgreSQL tables from DuckDB.

Usage

The Postgres Scanner DuckDB extension source code is available on the market on GitHub, but its right this moment installable via DuckDB’s new binary extension installation mechanism. To set up, appropriate flee the following SQL search info from as soon as:

INSTALL postgres_scanner;

Then, everytime you acquire to want to exhaust the extension, it be critical to first load it:

LOAD postgres_scanner;

To manufacture a Postgres database accessible to DuckDB, exhaust the POSTGRES_ATTACH present:

CALL postgres_attach('dbname=myshinydb');

postgres_attach takes a single required string parameter, which is the libpq connection string. As an example it’s likely you’ll circulate 'dbname=myshinydb' to make a desire a varied database name. In basically the most attention-grabbing case, the parameter is acceptable ''. There are three additional named parameters to the feature:

  • source_schema the name of a non-popular schema name in Postgres to earn tables from. Default is public.
  • overwrite whether we could maybe maybe soundless overwrite existing views within the target schema, default is false.
  • filter_pushdown whether filter predicates that DuckDB derives from the search info from want to be forwarded to Postgres, defaults to false. Interrogate under for a dialogue of what this parameter controls.

The tables within the database are registered as views in DuckDB, it’s likely you’ll checklist them with

PRAGMA show_tables;

Then it’s likely you’ll search info from those views on the total utilizing SQL. All over again, no data is being copied, here is acceptable a virtual look on the tables on your Postgres table.

Would possibly maybe well maybe earn to you acquire to now not place all tables, but appropriate search info from a single table, that is feasible utilizing the POSTGRES_SCAN and POSTGRES_SCAN_PUSHDOWN table-producing capabilities right this moment, e.g.

SELECT FROM postgres_scan('dbname=myshinydb', 'public', 'mytable');SELECT FROM postgres_scan_pushdown('dbname=myshinydb', 'public', 'mytable');

Every capabilities takes three unnamed string parameters, the libpq connection string (undercover agent above), a Postgres schema name and a table name. The schema name is on the total public. Because the name counsel, the variant with “pushdown” within the name will originate desire pushdown as described under.

The Postgres scanner will handiest be ready to learn exact tables, views have to now not supported. Then over again, it’s likely you’ll for constructive recreate such views internal DuckDB, the syntax want to be the very same!

Implementation

From an architectural level of view, the Postgres Scanner is performed as a traipse-in extension for DuckDB that affords a so-known as table scan feature (postgres_scan) in DuckDB. There are pretty a pair of such capabilities in DuckDB and in extensions, equivalent to the Parquet and CSV readers, Arrow readers and so forth.

The Postgres Scanner makes exhaust of the same outdated libpq library, which it statically hyperlinks in. Sarcastically, this makes the Postgres Scanner more uncomplicated to set up than the change Postgres purchasers. Then over again, Postgres’ celebrated client-server protocol is extremely uninteresting, so we spent pretty some time optimizing this. As a display, DuckDB’s SQLite Scanner would now not face this teach, as SQLite is additionally an in-direction of database.

We really performed a prototype insist reader for Postgres’ database info, but while performance changed into as soon as big, there is the teach that dedicated but now not but checkpointed data would now not be saved within the heap info but. As well, if a checkpoint changed into as soon as on the second working, our reader would in most cases overtake the checkpointer, inflicting additional inconsistencies. We abandoned that methodology since we’re looking with a thought to search info from an actively historical Postgres database and think that consistency is mandatory. One other architectural chance would had been to place in pressure a DuckDB Foreign Data Wrapper (FDW) for Postgres an identical to duckdb_fdw but while this will maybe maybe strengthen the protocol teach, deployment of a postgres extension is extremely unsafe on manufacturing servers so we expect few other folks will be ready to provide so.

In its place, we exhaust the hardly ever-historical binary switch mode of the Postgres client-server protocol. This structure is extremely an identical to the on-disk illustration of Postgres data info and avoids a pair of of the in every other case expensive to-string and from-string conversions. As an example, to learn a celebrated int32 from the protocol message, all now we want to provide is to swap byte affirm (ntohl).

The Postgres scanner connects to PostgreSQL and points a search info from to learn a explicit table utilizing the binary protocol. In basically the most attention-grabbing case (undercover agent optimizations under), to learn a table known as lineitem, we internally flee the search info from:

COPY (SELECT FROM lineitem) TO STDOUT (FORMAT binary);

This search info from will starting up reading the contents of lineitem and write them right this moment to the protocol circulate in binary structure.

Parallelization

DuckDB helps automatic intra-search info from parallelization via pipeline parallelism, so we additionally are seeking to parallelize scans on Postgres tables: Our scan operator opens more than one connections to Postgres, and reads subsets of the table from each. To successfully split up reading the table, we exhaust Postgres’ pretty obscure TID Scan (Tuple ID) operator, which enables a search info from to surgically learn a specified fluctuate of tuple IDs from a table. The Tuple IDs earn the manufacture (web page, tuple). We parallelize our scan of a Postgres table consistent with database web page ranges expressed in TIDs. Every scan project reads 1000 pages on the second. As an example, to learn a table consisting of 2500 pages, we would starting up three scan tasks with TID ranges [(0,0),(999,0)], [(1000,0),(1999,0)] and [(2000,0),(UINT32_MAX,0)]. Having an starting up sure for the final fluctuate is mandatory for the explanation that desire of pages (relpages) in a table within the pg_class table is merely an estimate. For a given web page fluctuate (P_MIN, P_MAX), our search info from from above is thus prolonged to gaze treasure this:

COPY (   SELECT      FROM lineitem    WHERE      ctid BETWEEN '(P_MIN,0)'::tid AND '(P_MAX,0)'::tid   ) TO STDOUT (FORMAT binary);

This form, we’re going to successfully scan the table in parallel while now not relying on the schema in any manner. Due to web page dimension is fastened in Postgres, this additionally has the added bonus of equalizing the effort to learn a subset of the web page self reliant of the quantity of columns in each row.

“However wait!”, that you must maybe maybe declare, per the documentation the tuple ID is now not get and could soundless be modified by operations equivalent to VACUUM ALL. How will you exhaust it for synchronizing parallel scans? Here’s exact, and could be problematic, but we found an answer:

Transactional Synchronization

For constructive a transactional database equivalent to Postgres is anticipated to flee transactions while we flee our table scans for analytical capabilities. Attributable to this truth now we want to tackle concurrent modifications to the table we’re scanning in parallel. We resolve this by first increasing a new learn-handiest transaction in DuckDB’s bind piece, where search info from planning happens. We leave this transaction working till we’re utterly done reading the table. We exhaust one more little-known Postgres feature, pg_export_snapshot(), which enables us to earn the contemporary transaction context in a single connection, and then import it into our parallel learn connections utilizing SET TRANSACTION SNAPSHOT .... This form, all connections related to one single table scan will undercover agent the table speak precisely as it regarded on the very starting place of our scan throughout the doubtlessly lengthy learn direction of.

Projection and Risk Push-Down

DuckDB’s search info from optimizer moves choices (filters on rows) and projections (removal of unused columns) as little as likely within the search info from thought (push down), and even instructs the lowermost scan operators to already originate those operations within the event that they fortify them. For the Postgres scanner, now we earn performed each push down variants. Projections are pretty easy – we’re going to without extend whine Postgres to handiest retrieve the columns the search info from is utilizing. This for constructive additionally reduces the quantity of bytes that want to be transferred, which accelerates queries. For choices, we make a SQL filter expression from the pushed down filters. As an example, if we flee a search info from treasure SELECT l_returnflag, l_linestatus FROM lineitem WHERE l_shipdate via the Postgres scanner, it could flee the following queries:

COPY (  SELECT     "l_returnflag",    "l_linestatus"   FROM "public"."lineitem"   WHERE     ctid BETWEEN '(0,0)'::tid AND '(1000,0)'::tid AND     ("l_shipdate" 

As you can see, the projection and selection pushdown has expanded the queries ran against Postgres accordingly. Using the selection push-down is optional. There may be cases where running a filter in Postgres is actually slower than transferring the data and running the filter in DuckDB, for example when filters are not very selective (many rows match).

Performance

To investigate the performance of the Postgres Scanner, we ran the well-known TPC-H benchmark on DuckDB using its internal storage format, on Postgres also using its internal format and with DuckDB reading from Postgres using the new Postgres Scanner. We used DuckDB 0.5.1 and Postgres 14.5, all experiments were run on a MacBook Pro with an M1 Max CPU. The experiment script is available. We run “scale factor” 1 of TPCH, creating a dataset of roughly 1 GB with ca. 6 M rows in the biggest table, lineitem. Each of the 22 TPC-H benchmark queries was run 5 times, and we report the median run time in seconds. The time breakdown is given in the following table.

query duckdb duckdb/postgres postgres
1 0.03 0.74 1.12
2 0.01 0.20 0.18
3 0.02 0.55 0.21
4 0.03 0.52 0.11
5 0.02 0.70 0.13
6 0.01 0.24 0.21
7 0.04 0.56 0.20
8 0.02 0.74 0.18
9 0.05 1.34 0.61
10 0.04 0.41 0.35
11 0.01 0.15 0.07
12 0.01 0.27 0.36
13 0.04 0.18 0.32
14 0.01 0.19 0.21
15 0.03 0.36 0.46
16 0.03 0.09 0.12
17 0.05 0.75 > 60.00
18 0.08 0.97 1.05
19 0.03 0.32 0.31
20 0.05 0.37 > 60.00
21 0.09 1.53 0.35
22 0.03 0.15 0.15

Inventory Postgres is now not ready to provide queries 17 and 20 internal a one-minute timeout thanks to correlated subqueries containing a search info from on the lineitem table. For the change queries, we’re going to undercover agent that DuckDB with the Postgres Scanner now not handiest finished all queries, it additionally changed into as soon as quicker than stock Postgres on roughly half of of them, which is fantastic provided that DuckDB has to learn its enter data from Postgres via the client/server protocol as described above. For constructive, stock DuckDB is soundless 10x quicker with its possess storage, but as talked about on the very starting place of this put up this requires the info to be imported there first.

Totally different Employ Cases

The Postgres Scanner can additionally be historical to mix live Postgres data with pre-cached data in ingenious programs. Here’s terribly efficient when facing an append handiest table, but could maybe maybe additionally be historical if a modified date column is contemporary. Defend in mind the following SQL template:

INSERT INTO my_table_duckdb_cacheSELECT FROM postgres_scan('dbname=myshinydb', 'public', 'my_table') WHERE incrementing_id_column> (SELECT MAX(incrementing_id_column) FROM my_table_duckdb_cache);SELECT FROM my_table_duckdb_cache;

This affords quicker search info from performance with completely updated search info from results, on the price of info duplication. It additionally avoids advanced data replication technologies.

DuckDB has constructed-in fortify to write down search info from results to Parquet info. The Postgres scanner affords a gorgeous easy manner to write down Postgres tables to Parquet info, it will even right this moment write to S3 if desired. As an example,

COPY(SELECT FROM postgres_scan('dbname=myshinydb', 'public', 'lineitem')) TO 'lineitem.parquet' (FORMAT PARQUET);

Conclusion

DuckDB’s new Postgres Scanner extension can learn PostgreSQL’s tables while PostgreSQL is working and compute the solutions to advanced OLAP SQL queries on the total quicker than PostgreSQL itself can with out the want to repeat data. The Postgres Scanner is on the second in preview and we’re recurring to hear what you think. Would possibly maybe well maybe earn to you scrutinize any points with the Postgres Scanner, please document them.

support to info archive

Tags: Querying,Postgres