Skip to content

Feat(experimental): DBT project conversion#4495

Merged
erindru merged 1 commit intomainfrom
erin/dbt-convert
Jun 16, 2025
Merged

Feat(experimental): DBT project conversion#4495
erindru merged 1 commit intomainfrom
erin/dbt-convert

Conversation

@erindru
Copy link
Copy Markdown
Collaborator

@erindru erindru commented May 22, 2025

This PR contains an initial implementation of a command that can take a DBT project, read it into memory and write the result out as a native SQLMesh project.

To invoke it, use:

$ sqlmesh dbt convert -i <input_path> -o <output_path>

The way this works is that the project is first loaded into memory using our existing DbtLoader.

  • This means that the existing SQLMesh mappings from DBT model types -> SQLMesh model types are utilized
  • It also means that our existing DBT shims in the Jinja context can be utilized

The resulting models and macros are extracted from the context and their Jinja is parsed into an AST (using the Jinja library). A bunch of AST transforms are run to replace DBT-isms with SQLMesh native concepts as much as possible:

  • {{ ref() }} and {{ source() }} calls are replaced with the actual model names they reference (where possible)
  • {% is_incremental() %} blocks are removed
  • ...etc

Jinja has no way of turning a Jinja AST back into a Jinja template string (since its goal is to generate Python code, not more Jinja) so I wrote a JinjaGenerator class to go from AST back to str.

If, after applying all the Jinja AST transforms, there is still some Jinja left in the result then it is written to the SQLMesh model file as a Jinja model surrounded with JINJA_QUERY_BEGIN; JINJA_END; blocks.

If there is no Jinja left after applying the transforms, it is written directly as a native SQL model with no Jinja wrapping.

Macros that call into DBT packages are also handled. The dependency tree is migrated and put in the target folder under macros/__dbt_packages__ so that the macro hierarchy is still available when the macros are called on the SQLMesh side.

A migrated project isnt truly native until all the dbt-isms have been removed. If the native loader detects it is loading a migrated project, it injects DBT shims into the Jinja context to make the migrated macros still work. The bulk of the DBT shim code is re-used from our existing DBT loader.

Known limitations:

  • The source DBT project must be loadable by the SQLMesh DBT loader. It doesnt need to be runnable, just loadable.
  • Currently only works for BigQuery and DuckDB as these were the initial focus. Other DB types can be added by implementing from_sqlmesh() in the relevant dbt TargetConfig class
  • Jinja handles whitespace stripping at parse time, so the AST has no idea if eg a {%- or {% block was used. This can lead to less than ideal formatting once the AST transforms are run and the resulting AST is turned back into a Jinja string.
  • Only the jinja constructs that were present in the test projects are handled by the Jinja generator, some more esoteric AST nodes are currently not handled
  • Audits are not generalized by our DBT loader so each model currently gets its own version as an inline audit
  • {{ source() }} calls on the DBT side that used dynamic inputs and were aliased in the DBT config are not correctly migrated
  • pre/post hooks / statements are not currently handled

@erindru erindru force-pushed the erin/dbt-convert branch from 0f4ddc6 to 738b2c7 Compare May 22, 2025 01:39
@erindru erindru marked this pull request as ready for review May 22, 2025 02:46
Comment thread sqlmesh/cli/main.py Outdated
# extract {{ var() }} references used in all jinja macro dependencies to check for any variables specific
# to a migrated DBT package and resolve them accordingly
# vars are added into __sqlmesh_vars__ in the Python env so that the native SQLMesh var() function can resolve them
if migrated_dbt_project_name:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be encapsulated into its own function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the logic to _extract_migrated_dbt_variable_references

Comment thread sqlmesh/core/model/kind.py Outdated

return v.transform(d.replace_merge_table_aliases)

@field_validator("batch_concurrency", mode="before")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? There's already a validator for this field

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, are you able to point me to the existing validator? I'm having trouble locating it

I added this because if you called model.render_definition() and it wrote something like:

        MODEL (
            name db.table,
            kind INCREMENTAL_BY_UNIQUE_KEY (
                unique_key a,
                batch_concurrency 1
            )
        );

And then you tried to read it back, you get a Pydantic error:

E           pydantic_core._pydantic_core.ValidationError: 1 validation error for IncrementalByUniqueKeyKind
E           batch_concurrency
E             Input should be 1 [type=literal_error, input_value=Literal(this=1, is_string=False), input_type=Literal]

I added a test in test_model.py to show this behaviour. It fails because it wants an int but it gets given an exp.Literal instead

Copy link
Copy Markdown
Collaborator

@izeigerman izeigerman Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's true for other model kinds but not IncrementalByUniqueKeyKind where it was defined as:

batch_concurrency: t.Literal[1] = 1

which doesnt invoke the validator.

However it looks like changing it to the following does the trick:

batch_concurrency: t.Annotated[t.Literal[1], BeforeValidator(positive_int_validator)] = 1

Comment thread sqlmesh/dbt/converter/console.py Outdated
)


class DbtConversionConsole(TerminalConsole):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to inherit TerminalConsole?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using it for things like _print and _confirm. How would you prefer this was structured?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are just one-liners, and I’m not sure they’re worth coupling over

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, i've made this class standalone now

yield prev, curr


class JinjaGenerator:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: any reason to have this class? It doesn't look like the methods benefit from the shared self instance in any way. Should these just be top-level functions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about that but having the class made it easy to do the following:

generator_fn_name = f"_generate_{node_type.__name__.lower()}"

if generator_fn := getattr(self, generator_fn_name, None):
    generator_fn(...)

i.e it gives a handle self to call getattr on and limits the functions to just what are available on this class

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: can't you replace self with __module__? No objections otherwise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably but it does mean that it's easier to accidentally call something unrelated that is defined in the module but not part of the generator.

I guess i'm just thinking in terms of self-containment. I could probably add a generate() top level module function and make the JinjaGenerator class private to make this API more ergonomic, but I do see it evolving a bit as this feature evolves

Comment thread sqlmesh/core/loader.py Outdated

@property
def migrated_dbt_project_name(self) -> t.Optional[str]:
return self.config.variables.get(c.MIGRATED_DBT_PROJECT_NAME)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know from the config that this a migrated project, can we put all of the logic below into a separate loader that inherits from SqlMeshLoader?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the logic to a MigratedDbtProjectLoader.

I couldn't see a good way to hook into the _load_scripts lifecycle of the parent SqlMeshLoader to change the builtins module and infer / set the dbt package name for macros under __dbt_packages__/, so I ended up duplicating a bunch of the existing code.

Not super happy with this but modifying the parent SqlMeshLoader to take callbacks to set these things would be just as invasive as the original changes

Comment thread sqlmesh/core/loader.py Outdated
if self.is_migrated_dbt_project:
from sqlmesh.dbt.target import TARGET_TYPE_TO_CONFIG_CLASS

connection_config = self.context._connection_config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the equivalent public method?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this

Comment thread sqlmesh/core/loader.py Outdated
)
except NotImplementedError:
raise ConfigError(
f"No DBT 'Target Type' mapping for connection type: {connection_config.type_}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"No DBT 'Target Type' mapping for connection type: {connection_config.type_}"
f"Unsupported dbt target type: {connection_config.type_}"

Comment thread sqlmesh/core/model/definition.py Outdated
if migrated_dbt_project_name:
# note: JinjaMacroRegistry is trimmed here so "all_macros" should be just be all the macros used by this model
for _, _, jinja_macro in jinja_macros.all_macros:
_, extracted_variable_names = extract_macro_references_and_variables(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were doing this unconditionally, but now it appears that we only extract if this is a migrated dbt project? Why is that?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an oversight although surprisingly no tests failed. I'll revert it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it to the else condition

for prefix in ("start", "end", "execution"):
yield f"{prefix}_{suffix}"

for item in ("runtime_stage", "gateway", "this_model", "this_env", "model_kind_name"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way we'll remember to update this once we add more macro variables

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I added this when I was in "hack whatever it takes to get the thing to work" mode.

Is there somewhere that can be referenced to get a definitive list?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't think we have any centralized place atm.

Comment thread sqlmesh/core/loader.py Outdated
if self.is_migrated_dbt_project:
if path.is_relative_to(migrated_dbt_package_base_path):
package = str(
path.relative_to(migrated_dbt_package_base_path).parents[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have infer_dbt_package_from_path for this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sure do, and it was written after this. I'll update it

Comment thread sqlmesh/dbt/converter/convert.py Outdated

ctx, dbt_project = _load_project(src)
dbt_load_context = dbt_project.context
dbt_project.load
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely nothing, i'll remove it

Comment thread sqlmesh/dbt/converter/convert.py Outdated
ctx.create_external_models()
else:
console.log_warning(
"More than 1 config detected; not sure how to handle this. External models file not created"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is a helpful message. What can user do to mitigate this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more to trigger a bug report. I didn't have an example of how someone could get into this situation but the data structure makes it possible.

If someone did hit this then it would be helpful to understand how so we can add a test

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the external model creation entirely

Comment thread sqlmesh/dbt/converter/convert.py Outdated

if output_external_models:
# External models
if len(ctx.configs) == 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be more than 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.configs is defined as t.Dict[Path, Config] so based on that there can be any number.

However having said that I don't think this check matters, i'll remove it

Comment thread sqlmesh/dbt/converter/convert.py Outdated
Comment thread sqlmesh/dbt/converter/convert.py Outdated
test_rendered = _render(model.query)
if is_self_referencing(model, test_rendered):
report.self_referencing_models.append((model_output_path, model))
# if it's self referencing, we need to output the columns or we get an error like: Error: Self-referencing models require inferrable column types.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still convert them as INCREMENTAL_UNMANAGED and keep the is_incremental() macro?

Copy link
Copy Markdown
Collaborator Author

@erindru erindru Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some are legitimately self referencing, is_self_referencing also returns True if model.depends_on_self.

For the others that only become self referencing after removing is_incremental(), I guess we can test for that and not apply the transform which will result in them getting output with the is_incremental() checks intact

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to test-load the converted model.

If it throws the Self-referencing models require inferrable column types error then I re-run the conversion but excluding the transforms that remove is_incremental().

This does work although on the 1000+ model test project it does have the downside of context.upsert_model() busts the fingerprint cache so it spends a bunch of extra time re-rendering queries of the dependency tree

Comment thread sqlmesh/dbt/converter/convert.py Outdated

orig_depends_on = META_FIELD_CONVERTER["depends_on_"]

# The 'depends_on' key always gets rendered but the value is missing if its an empty list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need this. And even if we do why can't we update the original converter

Copy link
Copy Markdown
Collaborator Author

@erindru erindru Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because if we don't have it, the following can occur when calling model.render_definition():

MODEL (
   ...,
   depends_on ,
   ...
)

Which is a syntax error when you try to read it back.

Initially I was trying to wrap things rather than change them. I'll update the original converter instead

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the original converter

Comment thread sqlmesh/dbt/converter/convert.py
Comment thread sqlmesh/dbt/converter/convert.py Outdated
Comment thread sqlmesh/dbt/converter/jinja_builtins.py Outdated
Comment thread sqlmesh/dbt/converter/jinja_builtins.py Outdated
Comment thread sqlmesh/dbt/converter/sqlglot_transforms.py Outdated
@erindru erindru force-pushed the erin/dbt-convert branch from 738b2c7 to f59c316 Compare June 5, 2025 04:09
@erindru erindru requested a review from izeigerman June 5, 2025 04:30
@erindru erindru force-pushed the erin/dbt-convert branch from f59c316 to 4db71b5 Compare June 12, 2025 06:02
@erindru erindru merged commit 976ffee into main Jun 16, 2025
26 checks passed
@erindru erindru deleted the erin/dbt-convert branch June 16, 2025 21:46
erindru added a commit that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants