How to improve implementation to be resilient to network failures – LaunchDarkly

Affected: All SDKs

Overview

The behavior of the LaunchDarkly SDK in its default configuration can be affected by many external factors outside your control. Points of failure can include issues with internal networks, CDNs, or any cloud computing services.

In the case of an outage, all currently connected SDK instances will continue to operate as expected because the SDKs keep the last known flag data from LaunchDarkly cached locally. When new SDK instances are initialized, they will fail to connect to LaunchDarkly. The SDK will continuously attempt to connect, and until it can, it will serve the fallback value when variation is called (the fallback value is the optional last argument of the variation method).

Solution

By default, all SDKs cache the most recent state in memory, but you can cache it locally on the devices. Caching it locally allows you to retain this information without waiting for initialization.

Customers should not block their application or service from starting up in the case that LaunchDarkly fails to initialize. Should the SDK fail to initialize in the event of a fresh start, the SDK will serve the fallback values that have been defined in your code. For more information on this behavior for all SDKs, read: A Deeper Look at LaunchDarkly Architecture: More than Feature Flags

If you are using SDK methods that get all flags on the server-side, you may not be able to predefine the fallback value in your code. For more information on how the all flags behavior works for your specific SDK(s), read: Getting all flags

Mobile SDKs

To handle the inherently unstable nature of mobile environments, all of our mobile SDKs cache the most recent state for the 5 most recent users locally on the device.

For more information on this behavior on Android, read: Identifying and changing contexts
For more information on this behavior on iOS(Swift and Objective C), read: Identifying and changing contexts

Client-side JavaScript-based SDKs

For client-side JavaScript-based SDKs, you cannot change the user object to a new user, or the SDK will serve the fallback value for all flags until it has finished fetching the flags for the new user. This is achieved by configuring the SDK with the bootstrap: “localstorage” option. This is the optimal solution if you expect your users to:

Use your app offline frequently.
Load the app under poor/unpredictable network conditions.

You can also use this solution if you want to quickly use the SDK without waiting for it to initialize.

It is also possible to evaluate the user’s flags server-side and render them into the page you send to your users, then bootstrap your JavaScript-based SDKs with those values. This is optimal in any situation where you serve the page to the user over the internet. This also provides redundancy to all users while bootstrapping from localstorage only offers redundancy to revisiting users.

For more information about bootstrapping, read: Bootstrapping

Server-side SDKs

To add redundancy to server-side SDKs, you have three different options.

Persistent Feature Store

One option is to configure your SDKs to use a persistent feature store. The SDK will update the cache when it receives an update, and it will read from the cache when trying to evaluate flags. This provides the following benefits:

SDK can operate even if LaunchDarkly is unavailable.
You can evaluate flags without having to wait for the SDK to initialize its feature requester

The caveat is that the persistent feature store keeps flags cached in memory, but reaches out to the cache to refresh the in-memory cache. This means that SDKs can be out of sync for up to X seconds, where X is the cache's TTL.

To read more about using a persistent feature store, read: Persistent data stores

Relay Proxy

If you have an extreme number of server-side connections, you may need to set up a few relay proxy instances behind a load balancer. Since the Relay uses a LaunchDarkly SDK to receive updates from LaunchDarkly, the Relay Proxy follows the same rules. If the Relay Proxy starts whilst LaunchDarkly is unavailable, then the Relay won’t be able to provide anything to the SDK instances that connect to it; if LaunchDarkly becomes unavailable while the Relay Proxy is running, it will still serve flags from its in-memory cache. Using the Relay Proxy offers the following benefits:

SDKs can initialize and operate normally.
SDKs could potentially initialize faster if the relay is “closer” than LaunchDarkly. This is usually the case if the relay is in the same architecture as your SDKs.
Reduce network congestion inside your architecture by having only one outgoing streaming connection to LaunchDarkly for each relay instance.

Note that the cost of setting up the Relay Proxy may not be worth it, depending on your use case. To read more about the Relay Proxy, visit: The Relay Proxy

Daemon Mode

Another option is “daemon mode." It involves combining both of the previous solutions. You set up a persistent feature store that provides redundancy if LaunchDarkly is down, configure both the SDK and the Relay Proxy to the persistent store. Then you enable “daemon mode” on your SDKs, which causes them to disable their streaming connection entirely and just rely on the Relay Proxy to populate the persistent store whenever it receives an update. This offers the following benefits:

All of the benefits of both of the above solutions.
Further reduced network congestion due to fewer write requests to the persistent store, and because SDKs do not need to open streaming connections.

Relay Proxy can initialize when LaunchDarkly is unavailable because it can read from the persistent store. This is likely the most optimal solution based on the aforementioned benefits. To read more about daemon mode, visit: Persistent data stores.