Databricks hackathon/migrating to databricks page by Lsnaathorst1 · Pull Request #263 · dfe-analytical-services/analysts-guide

Lsnaathorst1 · 2026-06-01T13:01:26Z

Overview of changes

Adding in new Migrating to Databricks page to the Databricks and ADA section.

Why are these changes being made?

This is to help teams understand what their options are and what the different approaches are that can be taken when migrating and to point teams to relevant guidance in other sections of the Analysts Guide (e.g., connecting to RStudio from Databricks or Databricks Notebooks.

Detailed description of changes

Adding in Migrating to Databricks page after discussion within SDT team. We decided to include

Introduction explaining there are multiple approaches, each has pros and cons and it is down to the analyst to decide which is most suitable for their process.
Connecting to Databricks from RStudio approach with pros and cons
Coding in Databricks approach with pros and cons
Added 'What if I have both R and SSMS code in my current process?' as lots of teams will be in this position.

Databricks fundamentals

Removing 'What this means for existing code section' from Databricks fundamentals and using some of this detail where relevant in the above
Removing the diagram after a team discussion, as this is publication focused and doesn't show all options.

Issue ticket number/s and link

Resolves issue #184

Checklist before requesting a review

I have checked the contributing guidelines
I have checked for and linked any relevant issues that this may resolve
I have checked that these changes build locally
I understand that if merged into main, these changes will be publicly available

…ndex - Removing content from fundamentals page where this is now covered or redundant due to new page. Also includes removing the diagram after team discussion. - Moving workflows section up to still be covered in fundamentals but out of What Databricks means for exisitng code section

…g errors and removing incorrect formatting

laragarbett

Thank you @Lsnaathorst1 for setting up the new page of recommendations! It's got a nice, clear structure and the two approaches are well laid out :)
I've added various suggestions for wording changes, as well as a couple of restructuring/layout changes. Do push back on any you disagree with!
It's also highlighted I should think about RStudio references in my own PR too.

Lsnaathorst1 · 2026-06-11T11:10:27Z

Hey @laragarbett, thank you for the detailed feedback! I think this is now back with you, with most changes implemented and just a few unresolved conversations to look at above.

laragarbett · 2026-06-12T15:00:16Z

Thank you @Lsnaathorst1 for addressing my comments - there are a couple of unresolved ones from my first review but they're only small things.

I then had another review of the page now that you resolved my first round of comments, as it's such an important one for us to get right. I did pick up on some more things I think we should adjust. I've added some more comments, mainly around structuring but a few more wording suggestions. :)

… a bit smaller than previously, so you can tell they are smaller than H4, and then reducing H4 headers in size for the same reason. Also adding in formatting lines for consistancty with other pages

Lsnaathorst1 · 2026-06-15T13:05:09Z

Hey @laragarbett, hopefully all of those changes are now reflected and this is back with you for re-review :)

laragarbett · 2026-06-18T12:57:56Z

+
+------------------------------------------------------------------------
+
+## Guidance on SQL code


Sorry, I know I'm going back on a previously requested change but seeing this section rendered, I feel it's actually a bit confusing because we're first pointing people to Approach 2 for SQL-only code, then here we're saying that Approach 1 is good for complex SQL code, which we we talk about in the last section as one of the "hybrid" options anyway.

I suggest we:

Delete the "Guidance on SQL code" heading and paragraph under it

Put the "Translating T-SQL..." heading and paragraph into a callout box, so it's not a section but more of a note which appears after we've explained the 2 approaches

In the Approach 2 section, under the 3 option bullet points, add a sentence saying "SQL Editor is recommended for short, ad hoc SQL queries. For longer or more complex SQL analysis, consider using notebooks."

laragarbett · 2026-06-18T13:07:19Z

+
+------------------------------------------------------------------------
+
+This approach is a useful short-term or transitional approach when you want to reuse existing SQL code with minimal changes. It keeps SQL and R closely linked by embedding SQL queries within an R workflow. Any SQL code would first need updating from T-SQL to Spark SQL, where it could then be passed via R code using wrapper functions to run in Databricks, whilst R controls execution.You can run SQL from R by creating a reusable wrapper function that uses a Databricks connection (e.g., via the DBI and odbc packages) and executes queries with a function like dbGetQuery()


We should put "DBI", "odbc" and "dbGetQuery()" in apostrophes here, and this para also needs a full stop at the end :)

laragarbett · 2026-06-18T13:10:21Z

+
+------------------------------------------------------------------------
+
+This approach is a useful short-term or transitional approach when you want to reuse existing SQL code with minimal changes. It keeps SQL and R closely linked by embedding SQL queries within an R workflow. Any SQL code would first need updating from T-SQL to Spark SQL, where it could then be passed via R code using wrapper functions to run in Databricks, whilst R controls execution.You can run SQL from R by creating a reusable wrapper function that uses a Databricks connection (e.g., via the DBI and odbc packages) and executes queries with a function like dbGetQuery()


It would be helpful to link to an example in the last sentence, perhaps to this section: https://dfe-analytical-services.github.io/analysts-guide/ADA/databricks_rstudio_sql_warehouse.html#pulling-data-into-rstudio-from-databricks

laragarbett · 2026-06-18T13:24:57Z

+
+This approach is most suitable where your code is written primarily in R and supports a quick and low-disruption migration. If you have an existing pipeline set up using RStudio / Positron / another IDE, there is no expectation that you must migrate your existing code or scripts into the Databricks platform (although there's no reason you shouldn't if you'd like to!).
+
+Code that reads or writes data from or to SSMS databases will need to be redirected to your Databricks catalog. To do this, you'll need to manually set up a connection to a Databricks compute. The best compute option for this is an SQL Warehouse. You can find more information about setting up a connection to an SQL Warehouse on our [set up Databricks SQL Warehouse with RStudio](../ADA/databricks_rstudio_sql_warehouse.html) page. 


Given we say this (or near enough this) at 4 different points on the page, I think it would be best to state it once at the bottom of the page and reference it with an asterisk or something at those 4 places.

So at the bottom of the page:
*For all processes that run SQL code or read from / write to the Databricks catalog from outside Databricks, you’ll need to manually set up a connection to a Databricks compute resource. The best compute option for this is an SQL Warehouse. You can find more information about setting up a connection to an SQL Warehouse on our Databricks SQL Warehouse with RStudio page.

And then here just say "Code that reads or writes data from or to SSMS databases will need to be redirected to your Databricks catalog*."

laragarbett · 2026-06-18T13:29:00Z

+
+------------------------------------------------------------------------
+
+This approach is suitable when your existing SQL code is complex, well-tested or often reused and you want to keep it as SQL code. It involves running the SQL code directly in Databricks to create intermediate or final tables, which are then written to the Databricks catalog. The SQL processing happens entirely in Databricks, after which you connect to the Databricks catalog from RStudio, Positron or another IDE to read in the created tables and continue the process with R code. 


If we have the note on connecting to Databricks with a SQL warehouse at the bottom of the page, then we'd put an asterisk here:
"and continue the process with R code.*"

and delete the "For all processes...." paragraph

laragarbett · 2026-06-18T13:29:35Z

+
+------------------------------------------------------------------------
+
+This approach is appropriate when your team primarily works in R and your SQL logic is not particularly complex or lengthy. It involves translating all existing SQL logic into R so that all your code is in the same language. The R code would then be run from RStudio / Positron / another IDE and would connect to the Databricks catalog to access the data as in Approach 1 above.


If we have the note on connecting to Databricks with a SQL warehouse at the bottom of the page, then we'd put an asterisk here:
"as in Approach 1 above.*"

and delete the "For all processes...." paragraph

laragarbett · 2026-06-18T13:30:59Z

+
+------------------------------------------------------------------------
+
+This approach is a useful short-term or transitional approach when you want to reuse existing SQL code with minimal changes. It keeps SQL and R closely linked by embedding SQL queries within an R workflow. Any SQL code would first need updating from T-SQL to Spark SQL, where it could then be passed via R code using wrapper functions to run in Databricks, whilst R controls execution.You can run SQL from R by creating a reusable wrapper function that uses a Databricks connection (e.g., via the DBI and odbc packages) and executes queries with a function like dbGetQuery()


If we have the note on connecting to Databricks with a SQL warehouse at the bottom of the page, then we'd put an asterisk here:
"...function like dbGetQuery().*"

and delete the "For all processes...." paragraph

laragarbett · 2026-06-18T13:32:07Z

+
+------------------------------------------------------------------------
+
+This approach is a useful short-term or transitional approach when you want to reuse existing SQL code with minimal changes. It keeps SQL and R closely linked by embedding SQL queries within an R workflow. Any SQL code would first need updating from T-SQL to Spark SQL, where it could then be passed via R code using wrapper functions to run in Databricks, whilst R controls execution.You can run SQL from R by creating a reusable wrapper function that uses a Databricks connection (e.g., via the DBI and odbc packages) and executes queries with a function like dbGetQuery()


Need to add space between "execution." and "You"

laragarbett · 2026-06-18T13:32:57Z

Almost there @Lsnaathorst1 ! Just a handful of comments now, most very small!

Lsnaathorst1 added 2 commits June 1, 2026 12:17

Adding in section on combined approach

fd24e62

Lsnaathorst1 requested review from cjrace, laragarbett, mzayeddfe and rmbielby as code owners June 1, 2026 13:01

Lsnaathorst1 marked this pull request as draft June 1, 2026 13:02

Lsnaathorst1 added 2 commits June 1, 2026 14:57

Amending to use call out collapsable boxes, as well as fixing spellin…

342243b

…g errors and removing incorrect formatting

Adding more detail to comvined approaches

c7850d4

Lsnaathorst1 marked this pull request as ready for review June 1, 2026 15:03

laragarbett requested changes Jun 8, 2026

View reviewed changes

Updates based on PR feedback to improve wording and layout

6d6a502