eye tee: How does FlexNet/FlexLM triad high availability work, and should it be used?

What's more annoying than setting up a FlexNet license server? Setting up a FlexNet high availability group! In FlexNet terminology, this is called a triad configuration. I'm not sure why it wasn't called something more industry standard like a "cluster". Perhaps because triad sounds cooler.

Welcome to 1994, FlexNet user.

An architecture pattern for improving a system's uptime is to remove single points of failure (SPOFs). A good way of doing this involves reviewing the components of a system, mapping how they communicate, and understanding how single points of failure can be removed. Let's review the license check-out process.

FlexNet has a traditional client-server style architecture. Simply put, FlexNet-protected apps like AutoCAD are configured to contact a FlexNet license server.

Client contacts the FlexNet Server
The client (typically a laptop user in a north pole igloo) starts AutoCAD. AutoCAD looks in a few places (system environment variables, Windows registry, licensing files) for the hostname and port of the FlexNet license server. It will ask the FlexNet license server for the vendor daemon port.
Server responds with vendor daemon port
The FlexNet license manager replies to the client with the network port of the vendor daemon. Client responds with license check-out request The client responds with “AutoCAD license: give me 1.”
Vendor daemon checks license file
The vendor daemon determines whether any valid licenses are available.
Vendor daemon replies with yay/nay

In this architecture, it's clear that the license server is the single point of failure.

How does FlexNet/FlexLM over come this?

FlexNet/FlexLM uses a triad configuration which requires three servers: a primary, secondary and tertiary server. Of the primary and secondary servers, one of these may hold the master role. A master server is elected if a quorum (two out of three servers) is available. The tertiary server cannot become the master, and it can never serve licenses: it acts only as a tie-breaker. If the FLEXenabled application does not support three-server redundancy, the license will be placed on the primary server.

Unlike other application-level clustering systems, there are no other quantities of servers in a triad configuration. You cannot have any less or more than 3 servers.

If the tertiary tie-breaker server cannot serve licenses, why does its Host ID need to be hard coded into licenses?

This is a stupid architectural oversight of the product and another example of why software license management systems only disadvantage legitimate users.

How does a FlexNet triad achieve quorum?

The quorum process is as follows:

FlexNet license manager starts.
The license manager starts and is placed in a waiting for quorum state.
Attempt to establish quorum.
The FlexNet license manager contacts the other license servers listed in the license file. If the license manager finds another more server in a waiting for quorum state, the two servers create a quorum.
Quorum established, master elected.
When a quorum is established, the license servers check the license file. If the file contains the keyword PRIMARY_IS_MASTER, the primary server is elected the master. If the file does not contain this keyword, the server with the earliest start time is elected the master. At this point, the master server will begin serving licenses. The solution is not redundant at this stage: if either server fail, licenses will not be served.
Third server joins.
If a server in the waiting for quorum state and contacts another server which returns a quorum established status, the server will join the quorum. At this point, the solution becomes redundant: any single server failure will not affect license operations.
Quorum established.

Keep in mind that steps 1 to 5 can occur within a second.

How does a FlexNet triad retain quorum?

After quorum is established, the primary, secondary and tertiary FlexNet servers will send a regular cluster heartbeats to each other. Then on the heartbeat frequency, the primary and secondary server will ask itself "Did I receive heartbeats from from both other license servers?". The quantity of heartbeats it receives affects its behavour:

If it received zero heartbeats: The FlexNet server assumes that it is isolated from the other two servers and will shut down its vendor daemon.
If it received one heartbeat from the other primary/secondary server: The FlexNet server assumes the tertiary server is dead. No change in operation.
If it receives one heartbeat from the tertiary server: The FlexNet server assumes the other license server is dead. It will begin to serve licenses.

The heartbeats are sent on the FlexNet server port, not the vendor daemon port. This prevents you from creating dedicated cluster networks.

How long does a FlexNet triad need to wait before failing over?

You can define the heartbeat interval in the license file as HEARTBEAT_INTERVAL=x , where x is the number of seconds in the formula (3*seconds)+(seconds-1). If HEARTBEAT_INTERVAL is not defined, it will default to 20 which equals a timeout of 79 seconds. If the FlexNet server doesn't receive a heartbeat in this period of time, the FlexNet server and vendor daemon will shut down.

The valid values are 0-120 seconds, which means the smallest timeout is 1 second and the longest is 479 seconds (~8 minutes). Increasing HEARTBEAT_INTERVAL will mean that failure is detected slower.

Under what circumstances would I increase or decrease HEARTBEAT_INTERVAL from the default?

This depends on the application and users. Some applications perform license checkout on start, and license return on application closure. Increase the HEARTBEAT_INTERVAL if your network is unreliable, decrease it if you want to detect failure and failover sooner.

Clusters require reliable network connectivity as a means of determining node status, and using triads means that you need to consider other items in the service chain: Windows/Linux firewalls. These aren't new items in the service chain; FlexNet/FlexLM always used them. But now, they have to be reliable and can't interrupt service for more than the HEARTBEAT_INTERVAL.

Should I use a triad?

Avoid it if your availability requirements allow you to.

In theory, triads will remove a single point of failure. In reality, FlexNet/FlexLM is a steaming pile of garbage and the triad functionality introduces more dependencies (reliable network connectivity) and complexity than risks it mitigates.

Generally speaking, I recommend that an application's HA and DR is performed at the highest possible level (e.g. application level) rather than provided by the underlying infrastructure (e.g. VMware HA). My observation is that application vendors understand availability more than infrastructure providers. However, FlexNet/FlexLM is one of the few pieces of software that I actively recommend not using the application-level availability features.

eye tee

Monday, August 1, 2011

How does FlexNet/FlexLM triad high availability work, and should it be used?

No comments:

Post a Comment