eye tee: August 2011

My C lecturer said my linked list implementation was the best linked list code he'd seen written under exam conditions in his career. True story. The question involved writing a linked list in C to manage books in a bookshelf (each book was an instance of a struct). But unlike my fellow students who took time to memorise linked list code, I didn't bother as I was too busy playing Civilization. Priorities people!

If you're sitting a C exam that has linked lists, you can expect a generic question about linked lists. The question is usually in the form "You have a bunch of (objects) in a (collection). Write a linked list to store this data." The objects are usually something like

books in a bookshelves
words in a sentence
phone numbers in a phone book
spells in a spellbook if your lecturer likes Harry Potter.

The logic behind each is the same, and the only thing that changes is the name of the variables and the amount of variables in the struct. If you understand one linked list, you'll understand them all.

Nodes

In algorithms and data structures terminology, the items in a linked list are called nodes. If you're having trouble conceptualising a node, think of it as a business card. Just like a business card, a node can contain a single piece of data on it (like "MIB") or it can contain multiple types of data (like a name, address, phone number).

A business card with a single piece of data

In addition to the useful data you might put in a node, each node also contains a pointer to the next node in the list. If you were a visual person, you might visualize this as a piece of string connecting two business cards.

Linking the nodes

All nodes in a linked list are connected to each sequentially using pointers. This is why they are called linked lists. The first node points to the second node. The second node points to the third node, and so on and so forth until it gets to...

The last node

If it's the last node in the list, the pointer points to NULL. This is important, because checking for NULL is the way we know we're at the end of the list.

Is there any way of skipping to the 27th node?

No. You go to the head node, and then you follow the pointer to the next node. Then you do that 26 more times.

The head node

You'll need a way to find the first node in the list. This is known as the head of a linked list. The head tells you where the linked list starts: if you lose the contents of the head node, the whole list is screwed and you're a buffoon. Don't be a buffoon! Tips for not being a buffoon:

When creating the first node in a linked list, the first node is the head.
Point the head at the first node.
When deleting the first node in a list, point the head at the second node.

The current node

As you iterate through a linked list to look at its contents, you'll need a way of tracking which node you're looking at. This is the current node. To cycle to the next node, we change where the current node points with

Current = Current->Next

What's more annoying than setting up a FlexNet license server? Setting up a FlexNet high availability group! In FlexNet terminology, this is called a triad configuration. I'm not sure why it wasn't called something more industry standard like a "cluster". Perhaps because triad sounds cooler.

Welcome to 1994, FlexNet user.

An architecture pattern for improving a system's uptime is to remove single points of failure (SPOFs). A good way of doing this involves reviewing the components of a system, mapping how they communicate, and understanding how single points of failure can be removed. Let's review the license check-out process.

FlexNet has a traditional client-server style architecture. Simply put, FlexNet-protected apps like AutoCAD are configured to contact a FlexNet license server.

Client contacts the FlexNet Server
The client (typically a laptop user in a north pole igloo) starts AutoCAD. AutoCAD looks in a few places (system environment variables, Windows registry, licensing files) for the hostname and port of the FlexNet license server. It will ask the FlexNet license server for the vendor daemon port.
Server responds with vendor daemon port
The FlexNet license manager replies to the client with the network port of the vendor daemon. Client responds with license check-out request The client responds with “AutoCAD license: give me 1.”
Vendor daemon checks license file
The vendor daemon determines whether any valid licenses are available.
Vendor daemon replies with yay/nay

In this architecture, it's clear that the license server is the single point of failure.

How does FlexNet/FlexLM over come this?

FlexNet/FlexLM uses a triad configuration which requires three servers: a primary, secondary and tertiary server. Of the primary and secondary servers, one of these may hold the master role. A master server is elected if a quorum (two out of three servers) is available. The tertiary server cannot become the master, and it can never serve licenses: it acts only as a tie-breaker. If the FLEXenabled application does not support three-server redundancy, the license will be placed on the primary server.

Unlike other application-level clustering systems, there are no other quantities of servers in a triad configuration. You cannot have any less or more than 3 servers.

If the tertiary tie-breaker server cannot serve licenses, why does its Host ID need to be hard coded into licenses?

This is a stupid architectural oversight of the product and another example of why software license management systems only disadvantage legitimate users.

How does a FlexNet triad achieve quorum?

The quorum process is as follows:

FlexNet license manager starts.
The license manager starts and is placed in a waiting for quorum state.
Attempt to establish quorum.
The FlexNet license manager contacts the other license servers listed in the license file. If the license manager finds another more server in a waiting for quorum state, the two servers create a quorum.
Quorum established, master elected.
When a quorum is established, the license servers check the license file. If the file contains the keyword PRIMARY_IS_MASTER, the primary server is elected the master. If the file does not contain this keyword, the server with the earliest start time is elected the master. At this point, the master server will begin serving licenses. The solution is not redundant at this stage: if either server fail, licenses will not be served.
Third server joins.
If a server in the waiting for quorum state and contacts another server which returns a quorum established status, the server will join the quorum. At this point, the solution becomes redundant: any single server failure will not affect license operations.
Quorum established.

Keep in mind that steps 1 to 5 can occur within a second.

How does a FlexNet triad retain quorum?

After quorum is established, the primary, secondary and tertiary FlexNet servers will send a regular cluster heartbeats to each other. Then on the heartbeat frequency, the primary and secondary server will ask itself "Did I receive heartbeats from from both other license servers?". The quantity of heartbeats it receives affects its behavour:

If it received zero heartbeats: The FlexNet server assumes that it is isolated from the other two servers and will shut down its vendor daemon.
If it received one heartbeat from the other primary/secondary server: The FlexNet server assumes the tertiary server is dead. No change in operation.
If it receives one heartbeat from the tertiary server: The FlexNet server assumes the other license server is dead. It will begin to serve licenses.

The heartbeats are sent on the FlexNet server port, not the vendor daemon port. This prevents you from creating dedicated cluster networks.

How long does a FlexNet triad need to wait before failing over?

You can define the heartbeat interval in the license file as HEARTBEAT_INTERVAL=x , where x is the number of seconds in the formula (3*seconds)+(seconds-1). If HEARTBEAT_INTERVAL is not defined, it will default to 20 which equals a timeout of 79 seconds. If the FlexNet server doesn't receive a heartbeat in this period of time, the FlexNet server and vendor daemon will shut down.

The valid values are 0-120 seconds, which means the smallest timeout is 1 second and the longest is 479 seconds (~8 minutes). Increasing HEARTBEAT_INTERVAL will mean that failure is detected slower.

Under what circumstances would I increase or decrease HEARTBEAT_INTERVAL from the default?

This depends on the application and users. Some applications perform license checkout on start, and license return on application closure. Increase the HEARTBEAT_INTERVAL if your network is unreliable, decrease it if you want to detect failure and failover sooner.

Clusters require reliable network connectivity as a means of determining node status, and using triads means that you need to consider other items in the service chain: Windows/Linux firewalls. These aren't new items in the service chain; FlexNet/FlexLM always used them. But now, they have to be reliable and can't interrupt service for more than the HEARTBEAT_INTERVAL.

Should I use a triad?

Avoid it if your availability requirements allow you to.

In theory, triads will remove a single point of failure. In reality, FlexNet/FlexLM is a steaming pile of garbage and the triad functionality introduces more dependencies (reliable network connectivity) and complexity than risks it mitigates.

Generally speaking, I recommend that an application's HA and DR is performed at the highest possible level (e.g. application level) rather than provided by the underlying infrastructure (e.g. VMware HA). My observation is that application vendors understand availability more than infrastructure providers. However, FlexNet/FlexLM is one of the few pieces of software that I actively recommend not using the application-level availability features.

eye tee

Tuesday, August 16, 2011

The basics of ordered linked lists in C

Monday, August 1, 2011

How does FlexNet/FlexLM triad high availability work, and should it be used?