Redshift Research Project

White Papers

Introduction

Every white paper is accompanied by the Python script which generates the evidence used by the white paper to draw conclusions.

The scripts require boto3 version 1.17.34 or later.

The scripts start a new cluster, run the tests/investigations/benchmarks, emits the data in raw format (pprint) and as markdown, shut down the cluster, and quit. All the white papers contain from their run of their script the markdown and an appendix with the raw data, so you can can check the evidence in the white paper matches what you yourself currently find.

White Papers

Title and Abstract Published Updated Download
Late-Binding Views 23 pages
There are in Redshift two types of view, normal and late-binding. When the leader node creates a normal view, it checks at the time of creation pg_class for the tables and views being used by the view, and will not create the view if it refers to non-existent tables or views. Accordingly, external tables cannot be used with normal views because external tables are not present in pg_class and so are perceived as being non-existent. To work around this, late-binding views were created, which perform no dependency checking - they do not inspect pg_class. This approach has two major and three minor issues; the lack of dependency checking means blunders when creating or modifying views lead to faults which can become apparent only at a very different time and in a very different place to a very different person who has no idea what caused the fault, there are as the number and complexity of late-binding views increase critical performance problems when enumerating the columns in late-binding views such that it is possible for it to become impossible to enumerate late-binding view columns, a system table function is used to enumerates late-binding view columns and access privileges for this function are incorrectly implemented such that syslog unrestricted and access system table do not work, the lack of dependency information means no records in pg_depend so it is impossible to perform data lineage or inspect dependencies, and finally, late-binding views pollute pg_class as they look like normal views (except for a single undocumented difference). Late-binding views should due to their flaws be used for and only for their one essential purpose, when an external table must be in a view.
Amazon Redshift Serverless 113 pages
Redshift Serverless is not serverless. A workgroup is a normal, ordinary Redshift cluster. All workgroups are initially created as a 16 node cluster with 8 slices per node, which is the default 128 RPU workgroup, and then elastic resized to the size specified by the user. This is why the original RPU range is 32 to 512 in units of 8 and the default is 128 RPU; the default is the mid-point of a 4x elastic resize range, and a single node, the smallest possible change in cluster size, is 8 RPU/slices. 1 RPU is 1 slice. With elastic rather than classic resize, the greater the change in either direction from the size of the original cluster, the more inefficiency is introduced into the cluster. As a cluster becomes increasingly larger, it becomes increasingly computationally inefficient (the largest workgroup has 128 normal data slices, but 384 of the lesser compute slices), and increasingly under-utilizes disk parallelism. As a cluster becomes increasingly smaller, it becomes increasingly computationally inefficient (each data slice must process multiple slices’ worth of data), and incurs increasingly more disk use overhead with tables. The more recently introduced smaller workgroups, 8 to 24 RPU (inclusive both ends) use a 4 slice node and have two nodes for every 8 RPU. In this case, the 8 RPU workgroup is initially a 16 node cluster with 8 slices per node, which is resized to a 2 node cluster with 4 slices per node - a staggering 16x elastic resize; the largest resize permitted to normal users is 4x; an 8 RPU workgroup, with small tables, uses 256mb per column rather than the 16mb per column of a native two node cluster. Workgroups have a fixed number of RPU and require a resize to change this; workgroups do not dynamically auto-scale RPUs. I was unable to prove it, because AutoWLM is in the way, but I am categorically of the view that the claims made for Serverless for dynamic auto-scaling are made on the basis of the well-known and long-established mechanisms of AutoWLM and Concurrency Scaling Clusters (“CSC”). Finally, it is possible to confidently extrapolate from the ra3.4xlarge and ra3.16xlarge node types a price as they would be in a provisioned cluster for the 8 slice node type, of 6.52 USD per hour. Both Provisioned and Serverless clusters charge per node-second, but Serverless goes to zero cost with zero use. With the default Serverless workgroup of 128 RPU/16 nodes (avoiding the need to account for the inefficiencies introduced by elastic resize), one hour of constant use (avoiding the need to account for the Serverless minimum query charge of 60 seconds of run-time), without CSC (avoiding the question of how AutoWLM will behave), costs 46.08 USD. A Provisioned cluster composed of the same nodes costs 104.32 USD; about twice as much. Here we have to take into consideration the inefficiencies introduced by elastic resize, which become more severe the more the cluster deviates from 16 NRPU, that Serverless uses AutoWLM, with all its drawbacks, and which is a black box controlling the use of CSC, with each CSC being billed at the price of the workgroup, and the 60 second minimum charge. All Serverless costs (including the charge for Redshift Spectrum S3 access) have been rolled into a single AWS charge for Serverless, so it is not possible to know what is costing money. It would have been much better if AWS had simply introduced zero-zero billing on Provisioned clusters. This would avoid many of the weaknesses and drawbacks of Serverless Redshift, which wholly unnecessarily devalue the Serverless product, as well as avoiding the duplicity, considerable added complexity, end-user confusion, cost in developer time and induced cluster inefficiency involved in the pretence that Serverless is serverless.
MERGE 41 pages
The MERGE command is implemented as a wrapper around existing SQL commands, calling CREATE TEMP TABLE, UPDATE and INSERT or DELETE. The SQL standard specifies the MERGE command will either match (and so update) or not match (and so insert) each source row. When manually issuing an UPDATE followed by an INSERT, it is possible for a row to match and not match, when the change performed by the update causes a row which did match to no longer match. This problem is solved by using a temp table, to store all rows which are to be inserted, prior to running the update. However, in the case when the data is such this problem does not occur, the temp table is not necessary, and is pure overhead. As such, MERGE is a kind of worst-case implementation; it has to always work correctly, so it always uses a temp table, even when it is not necessary. Finally, MERGE mandates the use of and only of a table for merge source rows, and I can see no reason for this, as none of the wrapped commands require it, and all accept a sub-query or view.
Introduction to the Fundamentals of Amazon Redshift 31 pages
This is a short, punchy document written to present the critical, central issues involved in Redshift use to a weakly technical (or better) reader. It is not lengthy, highly structured, completely thorough or detailed, but rather directly conveys the key issues as concisely and approachably as possible.
Users, Groups, Roles and Privileges 22 pages
In Postgres, roles replaced users and groups (both become roles). Roles in Redshift do not replace users or groups, but exist alongside them, as a third, separate, first-class type. Roles, like groups, can be granted privileges, but roles, unlike groups, can be granted to roles. Granting a role is in effect the same as adding a user to a group. Along with roles come a new set of Redshift-specific privileges, which can be granted to and only to roles (roles can also be granted the existing Postgres-style privileges). These new privileges are unlike the existing privileges (which are per-user, per-object) as they are global; they apply to everything, everywhere, always, in all databases. They are essentially fragments of the super user. Finally, note the documentation for roles is particularly egregious, and there are a few of the new privileges which are properly tantamount to granting super user, as they allow a user to elevate themselves to super user.
Robust ETL Design and Implementation for Amazon Redshift 18 pages
Usually ETL systems are provided with only minimal information and so are capable only of minimal actions. This leads to simple ETL systems, which for example check for new data files in some location, load them to a table, then move those files out of the way, and have no other capabilities. Such ETL systems are not robust, as they require arbitrary, unplanned human intervention to fix and to recover from bugs or data errors. The key to robustness in ETL systems is the provision of additional information such that the ETL system can undertake a wider range of actions, reducing the need for human intervention. A simple but profound enabler is the provision to the ETL system of the information of the set of available data files, and the set of data files which have been loaded into the database. This allows the automation of a wide range of behaviour, including the recovery process once human intervention has fixed bugs or data errors, and so provides a robust ETL system.
AQUA 43 pages
AQUA is a solution to the problem that min-max culling (aka the Zone Map) is wholly ineffective when searching for strings with wildcard patterns, forcing full column scans, and as such all scan steps and only all scan steps which use the LIKE and/or SIMILAR TO operators are routed to AQUA, as these are the two operators offering search in strings with wildcards. AQUA is implemented as a multi-node distributed set of independent data processors, separate from RMS and from Redshift, and as such the basic design is the same as Redshift Spectrum, with each processor scanning a portion of the data in a table and returning results to the Redshift cluster, and appears to implement most if not all of the SQL functionality possible with this design, which is to say, that functionality which can be performed by examining a single row at a time in a single table, and by maintaining aggregate information from those rows; so there exists functionality which AQUA cannot implement, such as distinct. When AQUA is presented with a scan step with work AQUA cannot perform, necessarily some, most or all of the rows in the table must be returned to the Redshift cluster, such that that work can there be performed. As such, queries must be written correctly for AQUA, just as with Spectrum. AWS do not provide the information to do this, and so guesswork and experimentation must be used. When a scan step is routed to AQUA, AQUA downloads to itself, not to the Redshift cluster local SSD cache, the table being scanned from RMS. With the 12,988 block test table, on an idle two node ra3.xlplus cluster, this took 41 seconds. After this, AQUA performs more quickly than the cluster, with the time taken with test queries used for the part of the work routed to AQUA going from about 9.4 seconds to about 1.7 seconds. It is necessary then to issue a number of LIKE or SIMILAR TO queries on a table to reclaim the large initial cost of the first query, otherwise AQUA will actually reduce performance. There are no findings about how long must pass before a table needs to be re-read by AQUA. In the course of investigating AQUA, it was discovered that RMS is a block-based store, not a table-based store; Redshift brings into local SSD cache only those blocks which are needed to service queries. With an ra3.xlplus node, bandwidth to RMS is about 160mb/second, and bandwidth to the local SSD cache is about 720mb/second.
AZ64 Encoding 35 pages
The AZ64 encoding is the Gorilla compression method from Facebook (2015), but with the 1-bit encoding for repeating values removed and replaced by no less than three distinct runlength-like compression methods, all of which seem completely crazy. With repeating values, when the number of repetitions is 2 to 63, no runlength encoding occurs and the full Gorilla encoding is used for each row, a lost compression opportunity; when the number of repeating values is a positive integer power of 2 between (inclusive both ends) 64 and 16384, each group of repeating values is encoded into a “storage unit” which is between 5 and 33 (inclusive both ends) bytes depending on data type which stores a copy of the value and a 15 bit counter, allowing for billions of rows per block, where I think (but have not yet investigated) materialization is block based and so needing to materialize any row in that block means all rows must be materialized; and, finally, when the number of repetitions is 65 or more, but not a positive integer power of 2, the number of rows stored per block varies depending on the value being encoded, the number of repetitions and the bit-pattern of the number of repetitions (!), as Gorilla-like encoding is being used on that bit-pattern, such that increasing the number of repetitions often decreases the number of rows stored in a block (as the bit-pattern of the number of repetitions has become less favourable to Gorilla encoding).
Query Compilation 307 pages
Redshift converts SQL queries to C++ and this compiles to multiple binaries (known as segments) which are then distributed to, and executed by, the worker nodes. Historically, and still in some regions, this work is performed on the leader node and then compiling a segment on an unloaded cluster typically takes about 4 to 8 seconds, where a complex query might have ten segments. More recently, a compiled segment cache was introduced, which appears by its performance and rapid expiry of segments to be on the leader node, and if a segment is in this cache, it is not compiled but served and extremely quickly (thousandths of a second) from the cache. More recently still, an ancillary system was introduced, the compile off-load system, which off-loads compilation work from the leader-node. This system appears to have its own cache and when a query is compiled, if one or more segments are not in the cache, then there is a slow first segment (typically the usual 4 to 8 seconds), during which all segments are compiled in parallel, and after this the leader node the obtains segments from the compile off-load system at the rate of about a tenth of a second each. However, if all segments are in the compile off-load segment cache, then the slow first segment does not occur. In all cases, increasing the number of concurrently compiled queries increases compile time, but in a somewhat better than linear fashion. Finally, a serious vulnerability was discovered, where by a normal unprivileged user can instantly crash Redshift. This is achieved by creating an empty table with a sufficiently large number of columns, and then issuing a query which has one window function per column, the core issue I suspect actually being simply to issue a query with between (inclusive both ends) 980 and 1664 segments.
Cross-Region Benchmarks 100 pages
Within the resolving power of the benchmark suite used, Redshift node types are identical in performance across regions, with the single exception of the dc2.large node type, which is much slower in five regions; af-south-1, ap-east-1, ap-southeast-3, eu-south-1, and me-south-1.
Multi-Version Concurrency Control and Serialization Isolation Failure 53 pages
This white paper describes and explains Multi-Version Concurrency Control (MVCC for short), which inherently entails describing and explaining transactions and table locks as these are part of MVCC, where aside from generally explaining what’s going on inside Redshift, the particular goal is to explain serialization isolation failures, as these originate from MVCC; to understand their origin, the manifold ways by which they occur, how to write code such that isolation failures do not occur in the first place, and how to fix them when you are working with code where they do.
Bytedict Encoding 11 pages
The bytedict dictionary contains up to 256 entries, but the final entry is always used as an end-of-dictionary marker and so does not hold a value from the block. The dictionary holds a copy of the values which are placed in the dictionary and so its size varies according to the size of the column data type; varchar are stored with their full DDL length, regardless of the string length. The dictionary is stored in the block, so a large dictionary means less storage space for values. The dictionary is populated by up to the first 255 distinct values in the block (and so is affected by sorting), but is only as large as it needs to be, rather than always being the size of 256 entries. A value which is present in the dictionary is replaced by a one byte index into the dictionary. A value which is not present in the dictionary is written out in full, plus a one byte overhead. A NULL (as opposed to NOT NULL) bytedict column consumes an additional 1 bit per value, even for varchar, which normally consumes an additional 1 byte per value.
Pseudo-Random Number Generation 35 pages
Redshift provides two PRNGs, one on the leader node and one on the worker nodes. The leader node PRNG is correctly implemented. The worker node PRNG is fundamentally flawed. Each worker slice produces a linear, minutely incrementing number sequence (the “non-PRNG number sequence”) which cycles between 0.0 and 1.0, where when a query is issued, each slice consumes the current number from that sequence and uses it as the seed for a PRNG, which in turn uses it to generate a random number sequence for and only for that query. The first random number emitted by a query on each slice is the number from the non-PRNG number sequence on that slice. The following numbers are from the PRNG. Low-entropy non-PRNG number sequence numbers, occurring when the numbers contain many zero bits, lead to correlation between the initial numbers produced by a PRNG for a query, and also between the initial values produced between queries.
Maximum Number of Concurrent Queries 50 pages
There is a limit of 30 concurrent queries per queue, regardless of the number of slots configured for a queue. When any queue has reached the limit of 30 concurrent queries, the superuser queue is no longer able to execute queries, but other user queues are unaffected.
Query Memory Allocation Behaviour with Strings 8 pages
An open question has been whether with varchar and char Redshift when moving rows to memory, either during data load (COPY) or query processing, allocates the full length of the string as specified in the DDL, or only the allocates the actual length of the string. Four query steps were investigated (aggr, hash, sort and unique) and for all of them varchar does not allocate its full length in the DDL, but rather only the length of the actual string, but char does allocate the full length in the DDL, regardless of the length of the actual string. It seems not entirely unreasonable, although certainly it is not proven, to imagine these findings may be generally true for query processing. No findings were made for behaviour during data load.
Auto-Vacuum 340 pages
Auto-vacuum is a black box and so it is impossible to devise a test method which is known to be fair. Given this limitation, the test method used found that extremely light server loads, one trivial query every thirty seconds on an otherwise idle cluster, appeared enough to completely forestall auto-vacuum. This implies clusters with 24/7 load never experience auto-vacuum. Longer intervals, of 40, 50 and 60 seconds, allowed auto-vacuum to run five times during 15 minutes of test runs. Of the auto-vacuums which were seen, only delete vacuums were observed; sort vacuums were never observed.
Minimum and Maximum Values by Data Type 27 pages
The official Redshift documentation indicates minimum and maximum values for data types. These values contain factual errors, typographic errors, and errors of omission (some data types have no values given). The actual minimum and maximum values are presented. During investigation, it also became clear that the mechanisms used to connect to Redshift - psycopg2, psql, ODBC, etc - all seems to be performing significant data processing, and at times are getting it wrong, leading to behaviour such as the silent modification of inserted values and incorrect values being presented from SELECT. Finally, there appears to be a bug in Redshift’s handling of very small and very large float8 values, such that rather than INSERT failing, the values are inserted but are converted to -/+ infinity.
Materialized Views 45 pages
Materialized views are implemented as a normal table, a normal view and a procedure, all created by the CREATE MATERIALIZED VIEW command, where the procedure is called by the REFRESH MATERIALIZED VIEW command and performs refresh. The table is created by CREATE TABLE AS, which is why column encodings cannot be specified. The encoding choices made by Redshift are extremely poor. A full refresh makes a new table, populates it, and uses table rename to replace the existing table. An incremental refresh uses an insert followed by a delete, using the system columns deletexid and insertxid to keep track of which rows have changed, and as such runs a full refresh when any of the tables used by the materialized view have either manually or automatically been vacuumed, as vacuum resets the values in the deletexid and insertxid columns and so invalidates the book-keeping information held by the materialized view, this information being stored in extra columns in the materialized view, one plus one for every table used in the materialized view SQL. The table underlying the materialized view is never vacuumed, except by auto-vacuum, which I suspect running so infrequently as to be inconsequential. Auto-refresh is an undocumented black box, likely subject to ongoing unannounced change, and its behaviour is unpredictable. On a small single column table on an idle cluster refresh occurred normally after about 55 seconds; with one row inserted per second, refresh occurred after between 54 to 1295 seconds (twenty-one minutes).
The Commit Queue 536 pages
There is a single commit queue, which executes a single commit at a time. With the small clusters tested, the time taken for a commit is in linear proportion to the number of blocks being committed. A commit runs in parallel over nodes and slices. The largest commit tested, with 200 blocks per slice, took on the slowest node type about 7.5 seconds. The number of blocks already present in a table makes no difference to the time needed to commit. Node types vary greatly in their commit performance. but the number of nodes in a cluster, over the range of small clusters tested, has no impact on commit performance. Unlike other node types, the commit performance of ds2.xlarge nodes looks to vary greatly by individual node in the cluster. Commits run as slowly as the slowest node in a cluster.
Cross-Database Queries 17 pages
Cross-database queries are not implemented by improvements in the use of the system tables such that Redshift can now access tables across databases, but rather by bringing over from the remote database to the local database a copy of the table being queried, leading to a duplicate of the table in every database issuing cross-database queries to that table. Where the remote table is not on local disk, the first query is slow, as it must wait for a local copy of the table to be made and similarly, when updates are made to the remote table, the next query after the updates is slow, as it must bring the updates over to the local copy of the table.
Effect of Unused Columns on Query Performance 18 pages
Redshift is column-store and as such in principle queries are unaffected by the unused columns in the tables being queried. In practise, accessing any single column is unaffected by the number of columns, but, tentatively, it looks like accessing multiple columns shows that the more columns are present between a column and the final column in the table, the slower it is to access the column, and so the more columns are present in a table, the slower access becomes for all columns. The slowdown, even when there are 1600 columns, is very small for dc2.large and ra3.xlplus, but is much larger for ds2.xlarge. However, for normally sized tables, up to say 100 columns, even on ds2.xlarge the slowdown is very small.
Zone Map Sorting By Data Type 49 pages
Redshift provides functionality to sort rows in a table according to the values in the table columns. Sorting is however not according to the entire values in the columns, but according only to an eight byte signed integer, known as the sorting value, which is derived from each value, where the method used to derive the sorting value varies by data type. This document describes in detail the methods used to derive sorting values. About half of the data types sort as would be intuitively expected, but half do not, and system designs which assume all data types to sort as intuitively expected are not functioning as their designers expect.


Home 3D Друк Blog Bring-Up Times Consultancy Cross-Region Benchmarks Email Forums IRC Mailing Lists Reddit Redshift Price Tracker Redshift Version Tracker Redshift Workbench System Table Tracker The Known Universe Twitter White Papers