Tuesday, May 28, 2013

Windows Server 2012 SuperFeature: DHCP App-Level Failover/Loadbalance

Foreward


One of my favorite additions to Windows Server for the 2012 version is application level load balancing and fail-over. This all new implementation does away with the previous solutions: "scope-splitting" and Windows clustering, neither of which I have ever felt comfortable recommending to a client. Splitting scopes doesn't do enough to prevent outages, and clustering is too complex to be a DHCP solution.

Fortunately, Microsoft recognized this gap in their product and released this new app level failover solution that aims to be as simple and straightforward as possible. I've deployed this a couple of times now and I'm blown away. Here's a high level overview of the implementation:

  • A single implementation can have two servers, no more. 
  • Servers can be configured in load-balancing or hot standby mode. 
  • Servers can reside across routing boundaries. (Enables unified management as well!)
  • Failover/Loadbalance Limited to IPv4
  • DHCP supported on server core
  • (Optional) Replication encryption 
  • Limit of one replication relationship type between two partners
And best of all,
  • Easy to set up and maintain.  (With a couple caveats I'll list below)

Hot Standby vs. Load Balancing


The hot standby option utilizes one DHCP server to service requests while the other waits to step in should the primary fail. A percentage (generally single digit) of the scope in question needs to be dedicated to the passive standby server for slack address space to allocate in a failover event where the backup hasn't yet asserted primary status. Microsoft states that hot standby is useful for multiple multi-site deployments wherein the primary would be onsite and a secondary would be located offsite should the primary fail. Here are a couple scenarios well suited to hot standby:

Multi-Site, Single Backup



Two sites backing each other up


The load balancing strategy splits client servicing based off of a MAC address hashing algorithm and will still respond to client requests that the other member in the pair should service in a situation where the client has gone unanswered. Provided you're using a datacenter licensing model and virtualization, most folks will want to utilize load balancing with two DHCP servers per site, generally on different hosts connected to different switches. If needed, load distribution mechanisms like F5s will work with this tech.

Two Sites Each Independently Load Balanced


Now let's set up DHCP failover or load balancing:


Assumptions


  • Basic knowledge re: Windows server 2012 and DHCP
  • Two 2012 servers ready to go and fully patched
Since we need to set up at least two servers, we'll do this twice, once with the GUI and once with Powershell.

DHCP Server Setup (GUI)


  1. Install the DHCP server role by using server manager and selecting Manage->Add Roles and Features


  2. After bypassing the intro screen, select "Role Based or Feature Based Installation" and select your server.


  3. Select the "DHCP Server" role. Admin tools will be auto-selected as needed. 


  4. Click "Next" through the rest of the Wizard. Once it completes, you'll be notified that DHCP configuration needs to be completed.


  5. Launch the DHCP Post-Install configuration wizard and complete the DHCP setup by authorizing the DHCP server. 

 

DHCP Server Setup (Powershell)

Where (Servername) is the FQDN of the server you wish to install, execute the following on a domain connected computer with proper rights on the target machine:
  1. Load the servermanager module:
    Import-Module Servermanager
  2. Install DHCP:
    install-windowsfeature -ComputerName servername.domain.lan -name dhcp -IncludeManagementTools
  3. When complete, authorize in AD:
    Add-DhcpServerInDC -DnsName servername.domainname.lan
Note that in step #3 you must specify the -ipaddress parameter (i.e. -ipaddress 10.0.0.10) if your server has either multiple NICs or has messed up registration in DNS. See here for more info.

Prep for Server Pairing


After authorization, the DHCP services need be restarted due to group add/creates. Do that or reboot the servers in question, whichever is easier. Set up your DHCP scopes as you normally would on one of the two servers. (More info, ignore the 80/20 part) 

Configure DHCP Server Pair (GUI)

  1. Open up the DHCP management GUI and right click on the scope you would like to load balance and select "Configure Failover..."


  2. On the "Introduction to DHCP Failover" screen, select all scopes you would like to configure (or "Select all" for all) and click "Next".


  3. On the "Specify the partner server to use for failover" screen select the other DHCP server. This can be looked up provided the server has been registered in Active Directory. 


  4. On the "Create a new failover relationship" page configure the following:
    1. Relationship name: Configure a name for this partnership; you may want to manipulate this via Powershell so take that into account when considering a very complex name. 
    2. Maximum Client Lead Time: This determines two three things: A) The lease time for a new client request if the server responsible for that client is down and the other answers the request and B) The amount of time one server will wait for a dead partner server before it takes control of the entire IP address block. C) (added 8/5/13) The amount of time one server that had been down must be available to the other before "Partner Down" status will automatically be changed to "Normal" status. (See comments for an example of this) The default of 1 hour is generally good but you may want to tweak depending on your setup. 
    3. Mode= Load Balance / Load Balance Percentage: This determines how much of the total load each server will take. 
    4. Mode= Hot Standby / Role of Partner Server/Addresses reserved for standby server: This determines if the partner sever is the primary or the standby and how much of each scope is reserved for distribution should the primary go down. Be careful that you have enough reserved here so that you won't run out of IP addresses prior to switching to "Partner Down" mode while also ensuring you won't run out of IPs on the primary server due to reserved addresses on the standby.
    5. State Switchover Interval: Selecting this enables either server to enter "Partner Down" state should communication be interrupted for the number of minutes specified after the option (default 60) resulting in the remaining server taking full responsibility for the scope(s). If this is not selected, an admin must manually choose to put the server into partner down state.
    6. Enable Message Authentication and Shared Secret: I highly recommend checking this box and specifying a long (14+ character) shared secret. This will encrypt messages between the two servers by using SHA-256. Should you wish to change the crypto, navigate to "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DHCPServer\Parameters\" in the registry and add/change "FailoverCryptoAlgorithm".


  5. You'll then be met with a failover summary screen. Verify the info and click "Finish". 


  6. You will be shown the failover setup process. 



Configure DHCP Server Pair (Powershell) 


This time around you don't need to do configuration twice, so this section would be in lieu of the previous. Refer to the section above for full descriptions, I'll just do mappings here:

Load Balanced


Add-DhcpServerv4Failover [-Name] (String) [-ScopeId] (IPAddress[]) [-PartnerServer] (String) [-AutoStateTransition (Boolean)] [-ComputerName (String)] [-LoadBalancePercent (UInt32)] [-MaxClientLeadTime (TimeSpan)] [-SharedSecret (String)] [-StateSwitchInterval (TimeSpan)]


Where:

-Name = Name above
-ScopeID = The IP of the scope to be partnered
-PartnerServer = DHCP Server 2
-AutoStateTransition = "State Switchover Interval" above. Note that if the "StateSwitchInterval" argument is used in the powershell command then this value is assumed TRUE, otherwise the default is FALSE
-ComputerName = DHCP Server 1
-Load Balance Percent = The % to be serviced by DHCP Server 1
-Max Client Lead Time = Same as outlined in the GUI section
-SharedSecret = Same as outlined in the GUI section
-StateSwitchInterval = Int; specifies how long to wait until auto transition to Partner Down. Makes AutoStateTransition assumed to be true.

Failover


Add-DhcpServerv4Failover [-Name] (String) [-ScopeId] (IPAddress[]) [-PartnerServer] (String) [-AutoStateTransition (Boolean)] [-ComputerName (String)] [-MaxClientLeadTime (TimeSpan)] [-ReservePercent (UInt32)] [-ServerRole (String)] [-SharedSecret (String)] [-StateSwitchInterval (TimeSpan)]


Where:
-ReservePercent = Same as outlined in the GUI section
-ServerRole = Active or Standby

Important Usage Notes!


  • Server Options are NOT replicated! Take this into account when setting up replication; you may want to specify options at a scope level so that if they are changed you don't need to manually do it on each server. 


  • There have been some reports of replication breaking custom options. See here for more info. 
  • How do "Communication Interrupted" and "Partner Down" states get initiated and what effect do they have? Refer to this handy flowchart I whipped up for reference:  


References: 



MSFT DHCP Team Blog: Hot-Standby
MSFT DHCP Team Blog: Failover Load Balance
Technet: Step by Step Configure DHCP for Failover
MSFT Doco: Understand and Troubleshoot DHCP Failover in Win8 Beta (Still Relevant)

Closing


Thanks for taking the time to read; should you have any questions leave them in the comments!

4 comments:

Gaw said...

Hi Toby, thanks for our article, written much better than MS. One question, once a DHCP server enter the partner down state and the other server take over, when the down partner comes back up, will the two DHCP servers go back to 50/50 automatically?

Toby Meyer said...

Good question @Gaw!
There is some conflicting information in the documentation out there, so to confirm the answer I ran a few tests. The tests revealed that the Maximum Client Lead Time also corresponds to the automatic fail-back period. When one partner enters PartnerDown state, it will stay in that state until the other server has come back up and then the maximum client lead time has elapsed. This holds true regardless of if the transition to PartnerDown state is automated (via State Switchover Interval) or manual.

For example, if the Maximum Client Lead Time were set to 1 hour a sequence may go as follows:

- Server 2 goes down @ 1:00
- Server 1 is manually transitioned from "Communication Interrupted" state to "Partner Down" state @ 1:05.
- Server 2 comes back up and is reachable @ 1:10
- Server 1 & 2 will automatically transition to "Normal" state @ 2:10.

Be careful if lowering it though, remember that this value also configures lease duration when in communication interrupted state; one would not want to flood the remaining DHCP server with DHCP renewal requests due to the temporary lease time being too short.

I'll update the main article with this information. Thanks for the great question!

Luis said...

Hi Toby,

How it works in a multi-scope environment (multiple networks)?

Do you know if the DHCP Relay parameter can be used in this type of configuration mode?

Regards

Luis López

Toby Meyer said...

Hi Luis!

Absolutely; you can replicate all scopes on your server should you desire. Things get a bit trickier using DHCP relay agents on your routers/switches however. For full fault tolerance your routers/switches must support either DHCP broadcast relay to the target subnet or support multiple IP addresses for the DHCP relay target. I know most new Cisco devices do the latter with just a couple commands. I've also tested F5 big-IP as a solution successfully in a Primary/Standby configuration using standard TMOS failover functionality. If you can't facilitate any of the above, be aware you will have to change the target IP on your relay agents should the "primary" fail.

Also note you can have multiple failover relationships but no more than two servers in any given set of relationships. You can have as many scopes as you want in any given failover relationship.

If you have any other questions please post them up. Thanks for reading!