A pattern for smooth and live microservice migrations

Published in

YounitedTech

11 min readFeb 27, 2020

Back in early 2016, we decided here at Younited to transition to a microservice architecture. We have learned a great deal since. And one thing we learned is how to substitute a microservice by a new one without downtime, and without requiring to update all callers at the same time.

A bit of context

You have a piece of software that provides a service to other parts of a larger system. For instance, a service that tells you if a customer on your website is a returning customer. This piece of software can be a feature of an existing monolith or a stand-alone component. Whatever. And this component is coming up to the end of its life, for any reason, be it technical or functional. So basically, you want to build a new piece of software and move all the calls made to the old piece of software to this shiny new one.

The way the calls are being made don’t really matter. At Younited, our microservices interact over REST API calls or using a message broker. But the problem and solutions remain the same if the calls are made in memory inside the same runtime or even by shifting data files around (please stop doing that).

In this example, let’s say you want to replace a legacy microservice by a shiny new one. This microservice has 3 callers, and has its own database.

Brutal options

In some simple scenarios, Caller1, Caller2 and Caller3 don’t share a common state and can just move progressively from Legacy Api to VNext Api. Pretty straightforward, but nothing very interesting to write about. In our case, we want to ensure that at any point in time, the data is consistent. This means that we can’t find ourselves in a position where Caller1 writes data through Legacy Api, and Caller2 tries to read this data using VNext Api. Hmm.

One solution would be to imagine some synchronization mechanism. But whoever has already worked on developing and running a bidirectional synchronization mechanism would probably rather swallow a bag of rusty nails than go down that path again.

The most common practice we see goes this way :

Step 1: turn off all callers

Step 2: Migrate data

This can be rather tricky depending on how different the data models are. They might also be using different technologies, like moving from a relational database to a NoSQL database.

Step 2': redirect all calls from Legacy Api to VNext Api

This step can also be quite tricky, if the public surface of VNext Api is very different from Legacy Api’s. But be careful! Do not shape your new API according to your old API just to make the migration easier. Rethink it, as if you built it from scratch, and benefit from the domain knowledge you accumulated building and running Legacy Api. Hopefully, you’ll spend far more time running VNext Api that migrating to it, so don’t make short term decisions. What you can do, though, is to invest into readable documentation to ease the migration and give other teams a hand.

Steps 2 and 2' can be done in any order, or in parallel, since we shut down all reading and writing operations by turning the callers off.

Step 3: turning callers back on

At this point, the migration if over from the caller perspective. They’re on target, and you should now be monitoring closely VNext Api and making sure all is running well.

Step 4: garbage day

This is the moment we live for, really. Throw the bloody stuff away. Maybe not on the day after step 4, in case you have to rollback in emergency. But pretty soon enough after. Burn everything. Code, infrastructure resources, everything. The only thing you can keep a trace of is the code in your source control system. The rest is history.

Brutal?

Well yeah, this is a brutal system. First, you shut down part of your system by turning off all the callers for all the time it took to do steps 1 to 3. Secondly, you migrated all your callers in a single sweep, making this migration a critical operation.

This doesn’t mean it’s a bad way to proceed. Actually, it’s the cheapest if you’re OK with the 2 points I’ve just raised. Sadly, it isn’t always the case. Let’s now dig into an alternative.

The Smooth and Live Migration Pattern (c)

Step 1: build the proxy

The first step is to build the Legacy Api proxy inside VNext Api. The goal of this temporary feature is to be able to call VNext Api, which will internally call Legacy Api. Keep in mind that this proxy has a cost, which can be pretty high. But what this means is that you can know move progressively the callers. This can be done caller by caller, or even by rerouting one caller’s calls progressively with a canary release mechanism.

You absolutely must ensure that even though the public surfaces of Legacy Api and VNext Api are different, you need to be able to translate VNext’s inputs models into Legacy’s, and Legacy Api’s output models into VNext’s. Example:

Legacy Api exposes a IsAReturningCustomer endpoint that takes a LegacyCustomer parameter and returns a boolean. Our shiny new API exposes a similar endpoint, but it is called GetCustomerType. It takes a VNextCustomer parameter and returns a CustomerStatusCode that can be “Unknown”, or “Customer”.

We need to ensure 2 things :

You can map inputs. Meaning you can map the input instance of a VNextCustomer to a new instance of a LegacyCustomer : f(VNextApiInputModel) = LegacyApiInputModel

The mapping can be straightforward, as for Email, FirstName and LastName. But it can require calculus, like calculating Age from the inputed DateOfBirth.

2. You can map outputs. Meaning you can map a boolean to a CustomerStatusCode : f(LegacyApiOutputModel) = VNextApiOutputModel

If you can ensure this 2 prerequisites, we can now move and learn why. You’ve done the heavylifting.

Step 2: rerouting the callers

This is where you benefit from building the proxy. You can now reroute the callers one after the other from Legacy Api to VNext Api. What you write on one API can be read from the other. So, no big bang. Move your callers one after the other. You can even canary release each caller. So this is no longer a big bang migration. You are now in position to test and consolidate VNext Api by routing the traffic progressively.

What we strongly recommend for this step is to monitor very strongly calls both to Legacy Api and VNext Api. This step has a cost, since VNext Api is temporarily more complex that it will be at target. Also, keep in mind that Legacy Proxy should be designed in order to make it easy to drop. It needs to be strictly isolated in VNext Api’s code base. The comfort of migrating progressively shifts all the complexity of this migration to the team building VNext Api. So keep in mind that this step is not a place you want to stay in too long. Sketch up a migration calendar and push the callers you reroute to VNext Api.

Step 3: closing up the migration (from the caller point of view)

At this point, all callers have been rerouted, and Legacy Api should get calls only from VNext Api, and therefore being its slave. This can be ensured by cutting authorizations for all callers except VNext Api. The important part now is that the team building VNext Api has no dependencies on other teams, and the rest of the migration is now an internal project.

Step 4: moving the data

In this step, we execute 2 tasks :

migrating all of Legacy Api’s data to VNext’s data storage. This includes transformation, since the data storage models are be different. But the work done on the proxy should also help for this transformation, since it’s the opposite operation but this time for storage models: f(LegacyApiStorageModel = VNextApiStorageModel
rerouting the internal calls inside VNext Api from the proxy to the target.

This can be tricky, since you want to ensure data integrity. The simplest solution is to perform a service interruption and perform these 2 tasks as a single operation. We rather prefer an intelligent migration, where NVext Api looks for its data first in its internal database, and fallbacks on the proxy.

This way, you can move the data progressively without service interruption. But you must ensure that each piece of data you move is done quickly. And you can find youself in a situation where data is in VNext’s database AND Legacy Api’s database, bit in different versions. In this case, you need to choose where the freshest data is. The way to determine this can simply be on latest modification date comparison.

Another option again is to move the data on demand and in real time. This is an interesting option if the action of moving pieces of data from LegacyApi database to VNextApi database is quick enough to guarantee sufficent performance from the caller point of view.

The cost of the migration is of course payed only by the first call. Please note that this operation must necessarily be atomic.

Here again, you must choose between a costly but smooth option or a simple but cheaper option.

Step 5: garbage day

No difference here as with the brutal migration pattern. But you get the same fun out of it.

Fancy options

In the example above, we worked with a version of VNext Api that have equivalent features to Legacy Api. But this pattern leaves place for a few options.

New features!

It is possible — and quite easy — to add new features to VNext Api. This can be an incentive to accelerate the migration to the target. For instance, let’s imagine that we want to enrich the customer data we store in our database. We want to add an ID and a MiddleName property. These properties will only be exposed through VNext Api. So now, VNext’s customer class looks like this:

We don’t want to make major changes to Legacy Api, and keep this a VNext Api feature only. What we recommend, is to store the ID and MiddleName new properties in the VNext local database, and then call Legacy Api through the proxy :

This ensures retrocompatibility for Legacy Api callers on reads and writes. And when retrieving data through VNext Api, we call Legacy Api through the proxy and VNext’s local database, and then merge the data to populate a complete instance of VNextCustomer.

You’ll need a correlation key in order to merge the data. In this case, the email will do the trick.

This kind of scenario is temporary until all data is moved to Legacy Api’s local database. And it’s a lot easier to work with a flexible kind of database, such as a NoSQL document database. We like CosmosDb, but MongoDb-like databases should also work fine.

Feature shifting

A kind of operation useful in a microservice environment is to shift features around when adapting the high level architecture of the system and organization of the teams. In this example, we’ll be pulling a feature that lived up to now in another microservice into VNext Api.

Well, it works just the same as pulling features from Legacy Api into VNext Api :

Shifting a feature from Some other Api to VNext Api

Once Caller 3 and Caller 4 call VNext Api instead of Some other Api for this feature, you find yourself in the Step 3 of the smooth pattern. The rest goes the same.

Wrapping up

We’ve found this pattern to be really useful in the past year. We’ve used it in several cases, including to merge a set of really critical microservices in our system into a new one. And we moved features from yet another microservice into this new one. The legacy microservices had been built in a context that had severely changed over the years. We needed to rebuild them from the ground up, but had 2 strong constraints:

these legacy microservices represent our loan application referential, and have 18 different callers. That’s a lot. So stopping the service and performing a brutal migration was simply out of the question
we did not want to replicate the public API of the obsolete microservices simply to make the migration easier, because we would have to perform another complex migration later to migrate all callers from a temporary set of REST endpoints to a more long term public API. Or worse, we could have stopped and stayed with an unfinished project.

So we decided to go along with this migration pattern, which has proven to be actionable and useful. We have heavily invested in the proxy, and this has required a lot of effort to perform well. But we believe this has been worth it, because we would have probably never even started this project otherwise. We were able to start small by moving small callers and fine tuning the new microservice. We’re really happy with the public API of our new microservice. It’s radically different from the old one’s. But all the complex mapping from old to new data structures are encapsulated in the proxy. This is not a concern for the callers. What we did do was to invest in documenting the migration to help other teams rerouting their calls to the new microservice.

The counterpart of having a long but incremental migration is we have had to manage this like a long term project : a lot of team synchronization and calendar management. Espacially since the other parts of the system continue to evolve on their side too.

Yet, we were able to expose a new microservice with new features and still prove retrocompatible. This lead to building really complex parts in the proxy, which we heavily unit tested even though it is temporary code.

But it works!