-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISTKEY not supported #99
Comments
If it helps, the reason we haven't run into this ourselves is because we use sqlalchemy-redshift + alembic + sqlalchemy tables/models to model our databases. Alembic (with sqlalchemy-redshift) will produce the distkey/sortkeys if you put them in sqlalchemy's table_args (on a model), but pytest-mock-resources will directly use the models to create the tables and ignores them. If that's not an acceptable workflow, the problem with this request is that you're executing raw SQL. While I'm not sure that i.e. a With raw SQL, my impression is that sqlalchemy just sends your text verbatim, so there's much less we can do in a 100% correct way. We could, attempt somewhat hacky regex checks through the queries to parse out distkey/sortkey and either translate or remove them, but i wouldn't expect it to be particularly robust (for example if DISTKEY resided in some string) |
The performance is not gained on the CREATE TABLE statement, but can be huge on a big cluster if it ensures that subsequent JOINs never leaves a particular node. My reason for trying this library was that it seems that no one are able to mock redshift :) The project Im working on actually still uses Pentaho PDI with JDBC to drive most of the ETL, even if almost all functional code has left PDI, so unfortunately I can't use anything else than raw SQL and wanted to avoid the hacky regex solution. I didnt look into how you had implemented this before I tried to use it. So, thanks for a quick reply, you can choose to close this as out of scope or similar. Maybe add a comment in the documentation that you dont support raw sql DISTKEY and SORTKEY where it is stated that you do support UNLOAD. |
To be clear, I'm not necessarily against limited parsing the SQL statement, especially because it could likely be scoped to i.e. CREATE TABLE statements. And fyi, this is how we support COPY/UNLOAD (through a combination of specific sqlalchemy-redshift handling and manual parsing of COPY/UNLOAD statements and altering the query at runtime). So ultimately, any support we add for raw sql statements for this is going to be doing the "hacky regex solution" (to essentially just remove them from the query string), which is honestly probably fine for most cases. But as you've already noticed we don't support it today. If you were so inclined, I'd be happy to direct you to the relevant locations for adding support yourself, else we could certainly add support, but it wouldn't be as quick as my issue responses 😛. |
Describe the bug
One of the main changes in redshift vs postgres is how index/keys are used. More specifically, one can specify a DISTKEY and SORTKEY for a table like
CREATE TABLE myschema.mytable (
col1 INT
,col2 VARCHAR(20)
)
DISTKEY(col1)
SORTKEY(col1, col2);
This can improve performance by helping redshift to keep joins on a single node, skip partitions etc
For my tests, I don't care about the performance, but I need the production SQL to look like this, and I want my tests to test actual code, not a similar but not the same variation.
It would make this mock library more complete and should probably be handled like UNLOAD etc. NB: I am a newbie using this library, so have haven't checked how this is handled.
Environment
Host OS
Using Windows, setting PYTEST_MOCK_RESOURCES_HOST=localhost (somehow my networking failed for host.docker.internal, but it could be Docker Desktop or Corporate Security)
Docker image if applicable
Postgres with default params as provided by pytest version 1.4.1
Python Version
Python 3.8.5 64-bit
Virtualenv/Pyenv etc.. if applicable
virtualenv, installing pytest-mock-resources with postgres binaries and redshift
(Also in requirements.txt before trying pytest-mock-resources
SQLAlchemy
psycopg2
sqlalchemy-redshift
pandas
To Reproduce
Steps to reproduce the behavior:
redshift = create_redshift_fixture()
def test_redshift_with_distkey(redshift):
with redshift.connect() as conn:
conn.execute('create table table1 ( col1 int, col2 varchar(20)) distkey(col1) sortkey(col1, col2);')
Error is logged and test is aborted
(psycopg2.errors.SyntaxError) syntax error at or near "DISTKEY"
Expected behavior
Since DISTKEY and SORTKEY probably arent relevant in test cases, ignore / remove this part of SQL before executing in docker postgres
Actual Behavior
Described above - can't disclose much more anyways.
Additional context
Running simpler SQL works in my current setup
Seems related to #82
The text was updated successfully, but these errors were encountered: