CICD for Databases

Thanks to modern DevOps practices, we have come to make liberal use of systems that integrate and deploy our code using containers, artifact managers and the like, but little ink is spilled discussing how to manage our databases using some of these practices. Many applications have a data layer, which often make use of a relational database such as MySQL, MSSQL, PostGreSQL, or others. Fortunately, managing CICD for databases is not that hard, we simply need a recipe to follow. Let’s don our chef hats and get cooking.

Environments

First and formost, we’ll need to set up our environments. At a minimum, we’ll need a test and production environments, and perhaps also a dev environment, validation environment – however many your process dictates. These environments may need to exist on separate servers, separate data centers, separate regions – ultimately you should seek to have as many environments that expose your most complex configuration. If you have only a development and production environment within a single cloud region in your North America deployments, but have an additional QA region for your Eurozone datacenter – you’re best off normalizing the number of environments across every valid combination of regions.

Once these environments are set up, you are able to implement a proper replication strategy, that reliably ports data from your production environment down the chain to your lowest environment on a regular schedule, most effectively daily. Don’t worry about losing work in lower environments – we’ll cover that later here. At first, just get that data flowing on a regular basis.

If you’re working with an existing database, start from production and work downwards. It’s not a safe assumption that your other environments are the same – often times different machines and specifications are found the ‘lower’ the environment, which can negatively affect the choices developers make in regards to performance, space, or method.

Development

Ready, set….open your SQL editor….right? Not so fast! One of the hurdles for introducing CICD to the data pipeline is understanding the disruption developers bring to the process by directly editing table definitions, data, stored procedures, functions, etc through their IDE in the development environment. That’s not to suggest using that SQL editor is off the table, but let’s think of that a different way – what if I (gasp!) turn OFF auto commit? What if I took every command i executed into a script and was able to have that script executed instantly? The difference is dramatic – by lifting our scripts out of the editor and into a process we’re introducing the opportunity for continuous integration. Develop SQL as scripts, not commands that execute in our IDE. Now, what to do with those scripts….

Scripts as code

(Deep movie announcer voice) Imagine a world…where code repositories managed SQL….okay, not much of a selling point said like that.

Let’s give it another shot – let’s say we’ve written a nice “ALTER TABLE MYSCHEMA.FOO” script that adds a column to my FOO table. It’s working beautifully in my SQL editor, but like a good practioner, it’s not committed tp the database yet. I navigate to my ‘Database’ repository, which has been set up with a /master branch and a /develop branch. locate the repository “Databases\MYSCHEMA”, and… [Choose your own adventure! If you employ a branching strategy, choose #1. If you use a trunk flow, choose #2!]

1…create a branch off /develop. Within my branch, I find or create the masterScript.sql file, which is either empty or has other developers’ code in it, to which I append my block of SQL and check in. Create a pull request for your DBA to review and approve, which when merged to /develop calls a webhook that….

2. … find or create the masterScript.sql file, which is either empty or has other developers’ code in it, to which I append my block of SQL and check in to /develop. The check in calls a webhook that…

…wait, a webhook? All modern repositories have the ability to call a pre-set URL when certain events occur. In our case, we’re going to have a webhook set to respond to all push (assuming git here, could be commit in svn, etc) events in the Databases repository. That webhook will call a URL that will do a fresh checkout of the repository, and for each schema, run the masterScript.sql file. There are many ways to accomplish this, using Jenkins jobs, a cron job manager such as Automic, ControlM, AutoSys…the list goes on. The important part is that the continuous deployment aspect of our pipeline is now being triggered by the continuous integration aspect. Nirvana!

Back in batch

In the first step, we set up daily replication of data structures from our production environment downward into our Dev, QA, and other lower environments. Doesn’t that mean we’re going to overwrite all the great development work we’ve just done? Yes, unless we call that same continuous deployment job after the replication is complete to each environment by making it the last step of our replication process. Likewise, as a developer if I change my mind about that table alteration, I can simply delete my portion of the script from the masterscript and the table will be restored to its previous production state the next morning. Things are really clicking now….

Getting to production

This cycle has worked reliably throughout all the integration testing our teams had to do with our new column, and we’re ready for production. Very similar to our /develop branch, there is a /master branch which contains an *empty* masterScript.sql file. Why empty? Follow the flow; creating a pull request from /develop to /master which, if accepted, would merge the now battle-tested development script into an empty script. Again using a webhook, our deployment engine would now run the script against the production environment, but when completed clears the masterScript.sql file, commits, and pushes to /master. This step is intentionally automated, but if you favor the manual execution of these scripts as a quality gate, you could easily not set the webhook for production, and have the DBA perform the tasks of execution, clearing, and checking in. Likewise, you could also vary the process to work in automated batch (as opposed to on demand) if that’s your particular preference.

There’s a last step here that’s critical – after any changes are made to production, any database object that was altered should have a ‘CREATE’ script generated for it, and checked into the /master branch with its name (ie tables/tablename, procedures/procedurenam). There are a number of ways to automate this process that are outside the scope of this article, but the point is to have an easy reference to answer the question of “when did this table last change” or “what’s changed about this stored procedure” without having to wade through commit messages to find the difference.

Lean flow – wrapping up

Many of the pitfalls of managing our databases at release time are intentionally addressed with this process. Have you ever stayed past regular hours to watch a DBA execute your script on a shared screen so you can validate the result? Ever noticed at the last minute that the production data creates an issue that you never saw in development or testing? These are wasteful artifacts that run contrary to the principles of lean flow. There are, of course, other variations to be applied that are beyond the scope of this article, including how to manage sensitive data, encryption schemas, truncating large databases for size management, capacity testing, and many more – but these become bolt-on capabilities to the core process outlined above.

As always, Scaled Labs would be privileged to guide you on your Agile journey, simply send us a note at inquiry@scaledlabs.com so we can connect and get working. Until then, happy coding!

Leave a Reply

Your email address will not be published. Required fields are marked *