Feat(experimental): DBT project conversion by erindru · Pull Request #4495 · SQLMesh/sqlmesh

erindru · 2025-05-22T01:12:05Z

This PR contains an initial implementation of a command that can take a DBT project, read it into memory and write the result out as a native SQLMesh project.

To invoke it, use:

$ sqlmesh dbt convert -i <input_path> -o <output_path>

The way this works is that the project is first loaded into memory using our existing DbtLoader.

This means that the existing SQLMesh mappings from DBT model types -> SQLMesh model types are utilized
It also means that our existing DBT shims in the Jinja context can be utilized

The resulting models and macros are extracted from the context and their Jinja is parsed into an AST (using the Jinja library). A bunch of AST transforms are run to replace DBT-isms with SQLMesh native concepts as much as possible:

{{ ref() }} and {{ source() }} calls are replaced with the actual model names they reference (where possible)
{% is_incremental() %} blocks are removed
...etc

Jinja has no way of turning a Jinja AST back into a Jinja template string (since its goal is to generate Python code, not more Jinja) so I wrote a JinjaGenerator class to go from AST back to str.

If, after applying all the Jinja AST transforms, there is still some Jinja left in the result then it is written to the SQLMesh model file as a Jinja model surrounded with JINJA_QUERY_BEGIN; JINJA_END; blocks.

If there is no Jinja left after applying the transforms, it is written directly as a native SQL model with no Jinja wrapping.

Macros that call into DBT packages are also handled. The dependency tree is migrated and put in the target folder under macros/__dbt_packages__ so that the macro hierarchy is still available when the macros are called on the SQLMesh side.

A migrated project isnt truly native until all the dbt-isms have been removed. If the native loader detects it is loading a migrated project, it injects DBT shims into the Jinja context to make the migrated macros still work. The bulk of the DBT shim code is re-used from our existing DBT loader.

Known limitations:

The source DBT project must be loadable by the SQLMesh DBT loader. It doesnt need to be runnable, just loadable.
Currently only works for BigQuery and DuckDB as these were the initial focus. Other DB types can be added by implementing from_sqlmesh() in the relevant dbt TargetConfig class
Jinja handles whitespace stripping at parse time, so the AST has no idea if eg a {%- or {% block was used. This can lead to less than ideal formatting once the AST transforms are run and the resulting AST is turned back into a Jinja string.
Only the jinja constructs that were present in the test projects are handled by the Jinja generator, some more esoteric AST nodes are currently not handled
Audits are not generalized by our DBT loader so each model currently gets its own version as an inline audit
{{ source() }} calls on the DBT side that used dynamic inputs and were aliased in the DBT config are not correctly migrated
pre/post hooks / statements are not currently handled

izeigerman · 2025-05-26T22:00:24Z

+    # extract {{ var() }} references used in all jinja macro dependencies to check for any variables specific
+    # to a migrated DBT package and resolve them accordingly
+    # vars are added into __sqlmesh_vars__ in the Python env so that the native SQLMesh var() function can resolve them
+    if migrated_dbt_project_name:


Can this be encapsulated into its own function?

I've moved the logic to _extract_migrated_dbt_variable_references

izeigerman · 2025-05-26T22:01:26Z


        return v.transform(d.replace_merge_table_aliases)

+    @field_validator("batch_concurrency", mode="before")


Why is this needed? There's already a validator for this field

Oh, are you able to point me to the existing validator? I'm having trouble locating it

I added this because if you called model.render_definition() and it wrote something like:

MODEL ( name db.table, kind INCREMENTAL_BY_UNIQUE_KEY ( unique_key a, batch_concurrency 1 ) );

And then you tried to read it back, you get a Pydantic error:

E pydantic_core._pydantic_core.ValidationError: 1 validation error for IncrementalByUniqueKeyKind E batch_concurrency E Input should be 1 [type=literal_error, input_value=Literal(this=1, is_string=False), input_type=Literal]

I added a test in test_model.py to show this behaviour. It fails because it wants an int but it gets given an exp.Literal instead

The validator is defined through the type annotation:
https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/utils/pydantic.py#L342C5-L342C23 which in turn calls https://github.com/TobikoData/sqlmesh/blob/main/sqlmesh/utils/pydantic.py#L204

Ah, that's true for other model kinds but not IncrementalByUniqueKeyKind where it was defined as:

batch_concurrency: t.Literal[1] = 1

which doesnt invoke the validator.

However it looks like changing it to the following does the trick:

batch_concurrency: t.Annotated[t.Literal[1], BeforeValidator(positive_int_validator)] = 1

izeigerman · 2025-05-26T22:01:54Z

+    )
+
+
+class DbtConversionConsole(TerminalConsole):


Does this need to inherit TerminalConsole?

I was using it for things like _print and _confirm. How would you prefer this was structured?

Those are just one-liners, and I’m not sure they’re worth coupling over

No problem, i've made this class standalone now

izeigerman · 2025-05-26T22:03:59Z

+        yield prev, curr
+
+
+class JinjaGenerator:


Just curious: any reason to have this class? It doesn't look like the methods benefit from the shared self instance in any way. Should these just be top-level functions?

I did think about that but having the class made it easy to do the following:

generator_fn_name = f"_generate_{node_type.__name__.lower()}" if generator_fn := getattr(self, generator_fn_name, None): generator_fn(...)

i.e it gives a handle self to call getattr on and limits the functions to just what are available on this class

Just curious: can't you replace self with __module__? No objections otherwise

Probably but it does mean that it's easier to accidentally call something unrelated that is defined in the module but not part of the generator.

I guess i'm just thinking in terms of self-containment. I could probably add a generate() top level module function and make the JinjaGenerator class private to make this API more ergonomic, but I do see it evolving a bit as this feature evolves

izeigerman · 2025-06-02T22:35:07Z

+
+    @property
+    def migrated_dbt_project_name(self) -> t.Optional[str]:
+        return self.config.variables.get(c.MIGRATED_DBT_PROJECT_NAME)


Since we know from the config that this a migrated project, can we put all of the logic below into a separate loader that inherits from SqlMeshLoader?

I've moved the logic to a MigratedDbtProjectLoader.

I couldn't see a good way to hook into the _load_scripts lifecycle of the parent SqlMeshLoader to change the builtins module and infer / set the dbt package name for macros under __dbt_packages__/, so I ended up duplicating a bunch of the existing code.

Not super happy with this but modifying the parent SqlMeshLoader to take callbacks to set these things would be just as invasive as the original changes

izeigerman · 2025-06-02T22:38:27Z

+        if self.is_migrated_dbt_project:
+            from sqlmesh.dbt.target import TARGET_TYPE_TO_CONFIG_CLASS
+
+            connection_config = self.context._connection_config


Why not use the equivalent public method?

I've updated this

izeigerman · 2025-06-02T22:41:02Z

+                    )
+                except NotImplementedError:
+                    raise ConfigError(
+                        f"No DBT 'Target Type' mapping for connection type: {connection_config.type_}"


Suggested change

f"No DBT 'Target Type' mapping for connection type: {connection_config.type_}"

f"Unsupported dbt target type: {connection_config.type_}"

izeigerman · 2025-06-02T22:43:34Z

+    if migrated_dbt_project_name:
+        # note: JinjaMacroRegistry is trimmed here so "all_macros" should be just be all the macros used by this model
+        for _, _, jinja_macro in jinja_macros.all_macros:
+            _, extracted_variable_names = extract_macro_references_and_variables(


Previously we were doing this unconditionally, but now it appears that we only extract if this is a migrated dbt project? Why is that?

It's an oversight although surprisingly no tests failed. I'll revert it

I've added it to the else condition

izeigerman · 2025-06-02T22:50:29Z

+            for prefix in ("start", "end", "execution"):
+                yield f"{prefix}_{suffix}"
+
+        for item in ("runtime_stage", "gateway", "this_model", "this_env", "model_kind_name"):


There's no way we'll remember to update this once we add more macro variables

Yeah I added this when I was in "hack whatever it takes to get the thing to work" mode.

Is there somewhere that can be referenced to get a definitive list?

Good question. I don't think we have any centralized place atm.

izeigerman · 2025-06-02T22:52:46Z

+                    if self.is_migrated_dbt_project:
+                        if path.is_relative_to(migrated_dbt_package_base_path):
+                            package = str(
+                                path.relative_to(migrated_dbt_package_base_path).parents[0]


Don't we have infer_dbt_package_from_path for this?

We sure do, and it was written after this. I'll update it

izeigerman · 2025-06-02T23:42:58Z

+
+    ctx, dbt_project = _load_project(src)
+    dbt_load_context = dbt_project.context
+    dbt_project.load


What does this do?

Absolutely nothing, i'll remove it

izeigerman · 2025-06-03T02:17:33Z

+            ctx.create_external_models()
+        else:
+            console.log_warning(
+                "More than 1 config detected; not sure how to handle this. External models file not created"


Not sure this is a helpful message. What can user do to mitigate this?

It's more to trigger a bug report. I didn't have an example of how someone could get into this situation but the data structure makes it possible.

If someone did hit this then it would be helpful to understand how so we can add a test

I've removed the external model creation entirely

izeigerman · 2025-06-03T02:18:16Z

+
+    if output_external_models:
+        # External models
+        if len(ctx.configs) == 1:


How can it be more than 1?

ctx.configs is defined as t.Dict[Path, Config] so based on that there can be any number.

However having said that I don't think this check matters, i'll remove it

izeigerman · 2025-06-03T02:20:09Z

+                test_rendered = _render(model.query)
+                if is_self_referencing(model, test_rendered):
+                    report.self_referencing_models.append((model_output_path, model))
+                    # if it's self referencing, we need to output the columns or we get an error like: Error: Self-referencing models require inferrable column types.


Can we still convert them as INCREMENTAL_UNMANAGED and keep the is_incremental() macro?

Some are legitimately self referencing, is_self_referencing also returns True if model.depends_on_self.

For the others that only become self referencing after removing is_incremental(), I guess we can test for that and not apply the transform which will result in them getting output with the is_incremental() checks intact

I've updated this to test-load the converted model.

If it throws the Self-referencing models require inferrable column types error then I re-run the conversion but excluding the transforms that remove is_incremental().

This does work although on the 1000+ model test project it does have the downside of context.upsert_model() busts the fingerprint cache so it spends a bunch of extra time re-rendering queries of the dependency tree

izeigerman · 2025-06-03T02:21:37Z

+
+    orig_depends_on = META_FIELD_CONVERTER["depends_on_"]
+
+    # The 'depends_on' key always gets rendered but the value is missing if its an empty list


I'm confused why we need this. And even if we do why can't we update the original converter

We need this because if we don't have it, the following can occur when calling model.render_definition():

MODEL ( ..., depends_on , ... )

Which is a syntax error when you try to read it back.

Initially I was trying to wrap things rather than change them. I'll update the original converter instead

I've updated the original converter

This reverts commit 976ffee.

erindru force-pushed the erin/dbt-convert branch from 0f4ddc6 to 738b2c7 Compare May 22, 2025 01:39

erindru marked this pull request as ready for review May 22, 2025 02:46

izeigerman reviewed May 26, 2025

View reviewed changes

Comment thread sqlmesh/cli/main.py Outdated

izeigerman reviewed May 26, 2025

View reviewed changes

izeigerman reviewed Jun 2, 2025

View reviewed changes