February 18, 2025

What is the Datafold Migration Agent, and how does it work?

Discover how the Datafold Migration Agent (DMA) automates code translation and data validation, cutting migration timelines by up to 6x while ensuring data quality.

Kira Furuichi

What is the Datafold Migration Agent, and how does it work?

The data world has largely accepted that migrations are destined to be painful. For decades, we've resigned ourselves to months or years of manual code translation, endless validation cycles, and astronomical consulting fees. It's been seen as an unavoidable cost of modernization—just the way things are done.

The Datafold Migration Agent (DMA) challenges this status quo. By combining advanced AI, sophisticated LLMs, and precise data diffing technology, we've created something that many thought impossible: a way to automate the most painful parts of migrations while maintaining perfect data integrity; DMA takes the two most manual, taxing, and time-intensive parts of a migration—code translation and data validation—and automates them at previously inconceivable speeds.

This article is an under the hood look into how DMA works, and how it accelerates migration timelines without forfeiting data quality.

Why use automation and AI for data migrations

Before I jump into the technical details (I know, I’m keeping you on your toes!), it’s important to address why Datafold is tackling data migrations with AI and automation.

Migrations are time-consuming, manual, and resource-intensive projects: There are many steps to a typical migration: auditing assets to migrate, moving the data, translating code, validating data across systems, earning stakeholder sign-off, and more. Each of these steps requires data teams to spend considerable internal (and often external) resources to working on repetitive, manual work. And as we all know, the more manual the work, the more time-consuming, error-prone, and expensive it is.
Modernization is vital, but comes at the cost of near-term innovation: With every migration, there is a cost-opportunity tradeoff; with your team focused on a migration, innovation is often pushed to the side. When data teams can automate the most laborious parts of a migration, it creates space and time for them to focus on higher leverage work.
AI and LLMs have caught up with technological needs: 2023 and 2024 were tremendous for technological advances in AI and LLMs, and everyone (data teams, SaaS companies like ourselves) was figuring out what this means for our day-to-days. Technology is now sophisticated enough (and will continue to improve) to automate critical workflows in data engineers’ work.

The simultaneous convergence of technology advancing rapidly with the growing acute pain of migrations is why we at Datafold believe that automation, AI, and the Datafold Migration Agent, have never been more vital to evolving the way teams plan and execute migrations with speed and quality.

The technical deep-dive on DMA

The Datafold Migration Agent, which I’ll be addressing as DMA from this point on, consists of three core parts:

The DMA Source Aligner
The Feedback Loop: Translate, Diff, Repeat
Delivery of Code with Data Diff Validation

Together, these three parts of the DMA experience automate the most laborious parts of a migration, and accelerate timelines by up to 6x the speed without sacrificing on data quality.

The DMA Source Aligner: "All things considered equal"

Before DMA translates and validates your new SQL code, Datafold ensures that the inputs into your translation pipeline match exactly. To do this, we create a frozen version of the input datasets in both the legacy and new system, and ensure they match exactly—no mismatched values, no missing or added rows.

We call this the DMA Source Aligner.

The DMA Source Aligner enables inputs across legacy and new databases to be the same. By holding the inputs constant, we ensure that the data diff—a value-level comparison of tables across databases—validation is not compromised by noise from inconsistent environments.

With aligned sources, we check the work of DMA by automatically running cross-database data diff across your entire production dataset—without any manual effort from your team.

With this process, DMA supports true end-to-end validation, ensuring all data source inputs to transformed outputs match exactly between legacy and new databases.

The Feedback Loop: Translate, diff, repeat

In the core feedback loop of DMA, legacy code is translated to the new SQL dialect or framework (e.g., dbt), tested for accuracy, and fine-tunes itself until parity between legacy output matches the new output produced by DMA.

LLM code translation

In the first step of the feedback loop of DMA, DMA’s LLM takes the legacy code you’re migrating from (e.g., stored procedures, XML/GUI-based business logic) and translates each model to the SQL dialect of your new warehouse or framework.

For example, if your team is migrating from stored procedures in SQL Server to a dbt project built in BigQuery, DMA would map stored procedures 1-1 with new dbt models that adhere to dbt syntax, templating, formatting and best practices and use BigQuery’s SQL syntax.

This is the stage of DMA where the agent sifts through a range of possible challenging issues that can trip up a migration due to conflicts between legacy and new databases, including:

Different default collation
Different handling of special (non-ASCII) characters (irrespective of collation)
Different handling of upper/lower casing (irrespective of collation)
Conversion of special non-ANSI code syntax that is used by a downstream BI tool (think, curly brackets containing dynamic variables)
Identification and resolution of queries with non-deterministic outputs (e.g., unordered concatenation)

Automated data validation: The "data diff"

Once the legacy code has been translated, the input data provided by the DMA Source Aligner is input into the new translated code. DMA will then automatically run a data diff of the outputs across the legacy code and the new code. The data diff will identify any data differences (e.g., columns added or removed, missing rows, incorrect values) between these code outputs across legacy and new databases.

Fine-tune until parity is met

If there are any differences between the output of the legacy code and the new code output by DMA, DMA’s model will automatically fine-tune the SQL translation and repeat the process until parity is achieved and validated with data diff..

Delivery of code with data diff validation

Once DMA is confident in the code that it produces (e.g., the new output code produces parity between the output in the legacy database), DMA will automatically provide two types of reports:

A git repository or PR of translated and validated code: This contains all the new code DMA translated and validated, so teams can merge in new code to their existing repository or use a new one to contain their data transformations.
Links to auditable data diffs in Datafold: There’s no reason to trust a migration tool unless it can show you proof it’s working; DMA outputs value-level table comparisons (data diffs) of the full dataset automatically as its final validation step. With this, you have evidence that the translation is successful. DMA customers can visually see for themselves that parity has been met; more importantly, data diffs provide concrete evidence data teams can provide to stakeholders to achieve migration sign-off.

A data diff between a table in Oracle and Snowflake

DMA as the standard for migration acceleration

The Datafold Migration Agent is the only tool to both automate code conversion and validate that the data migration is complete because both inputs and outputs match across systems. DMA continues to stand out with its:

True automation: DMA drastically reduces the amount of manual work done in a migration by automating code translation and value-level validation.
Workflow migration: DMA does not have to just output standard SQL during its translation process; it can translate GUI-based transformation workflows and also adhere SQL to code-based engines like dbt or Coalesce.io.
Integrations: DMA integrates with a variety of legacy databases including, but not limited to, SQL Server, SAP Hana, MySQL, Netezza, Oracle, Teradata, and more.
Security: Datafold offers multiple deployment types of meet the security needs of your team.

DMA is pushing the boundaries of how data teams can approach and execute migrations to ultimately allow them to modernize faster and focus on higher-leverage work.

If you’d like to learn more about how DMA can accelerate your migration, please book time with our team.