Accidental SQL admin
I promise, I am not becoming a SQL expert/consultant. Honestly I try to know enough about SQL to do what I need from a Citrix perspective and let the true experts handle the complex stuff. Alas, sometimes you have to dig in and get your hands dirty. In this case it was very informative, and a little fun too.
So I had posted previously about fun with the Citrix broker logins to SQL AlwaysOn groups (https://www.icrill.com/single-post/2017/05/04/SQL-AlwaysOn-lessons-learned). My most recent experience is in leveraging SQL licensing to our advantage to provide the level of redundancy required for an active/active Citrix environment.
The background on the environment is a customer is wanting a new XenApp 7.x environment to migrate off the existing 6.5 farm. Customer has two main data centers about 20 miles apart with very high speed, low latency connections between them. A third minor data computing facility with basic server capabilities. The new environment needs to be highly available at all levels, however there is no DBA team nor budget for SQL Enterprise.....it isn't happening.
We could go the route of SQL Mirroring with SQL Standard edition. The problem with that route is that all momentum is going away from mirroring to AlwaysOn. We can do an AlwaysOn availability group with SQL Enterprise that would allow us to fail over all our Citrix databases. Don't forget though, no budget for SQL Enterprise. So what are we to do?
Thankfully Microsoft provided us with a way out. With SQL Standard edition you can still provide AlwaysOn availability groups. There has to be a catch right? You bet! The caveat is that with SQL Standard edition there is only one database allowed per availability group. So in our case with a XenApp site (3 databases - site, logging, and monitoring) and a Provisioning Services farm (1 database), we would need to have 4 availability groups. For those not terribly familiar with SQL or SQL AlwaysOn, an availability group presents itself to the application like a SQL cluster name. That way when you install your application(s) you point to the cluster name and let it determine the active SQL server.
With those requirements in mind, the customer architecture is like so:
Site A: SQL Server 2016 Standard edition (chosen primary for availability groups)
Site B: SQL Server 2016 Standard edition (chosen secondary for availability groups)
Site C: File server witness (since we only have two nodes and want maximum availability)
Since we are going to have 4 availability groups, we are going to require more IP addresses. This is because each SQL server node will need to be individually addressable for each SQL AlwaysOn availability group. Most times this will not be an issue, but if you are tight on IP address space, take note of the additional addressing required.
So with all that knowledge in mind we setup our availability groups, backed up our databases from the standalone instance, restored them to the AlwaysOn availability group, and updated our database connections on the brokers and PVS servers. All done right? WRONG!!!!
We need to test fail over of the SQL environment. I mean what's the point of configuring high availability if you aren't sure it'll work? The obvious first test would be to test what happens if the file share witness becomes unavailable. So we shut down the file server hosting the file share; cluster stays up as expected. OK we are down to two nodes, so if we lose another node the cluster goes down. So customer powers down the node in Site B. We log onto the Citrix environment and Studio/Director report fine. We look at SQL management studio and it reports the database as active. Did we just get very lucky? In a certain sense, yes.
In Server 2012 Microsoft introduced a clustering feature called Dynamic Quorum that neither me nor my customer was familiar with. The basic premise of it is if you plan on taking nodes offline and take them offline gracefully, then the cluster is going to re-balance its quorum to be among only the remaining nodes. If the nodes had truly failed, quorum would be lost. The other cool part of it is dynamic weighting. Rather than botch the explanation of it I'll post it from the article my customer directed me where I read about it(http://windowsitpro.com/windows-server-2012/dynamic-quorum-windows-server-2012):
"When there are only two nodes left, one of the nodes loses its DynamicWeight so only one of the nodes now has a vote (this is chosen randomly). This assures that if the second node crashes (the node without a vote), the first node can stay active (giving you a 50/50 chance of surviving an unplanned failure of a node)."
So even after you exhaust your many layers of redundancy available to us now (hypervisor HA, file share witness, etc) you still have a coin flips chance should something happen that everything will work. For anybody that's worked with Microsoft clustering in the past, that's better odds than when you had a full traditional failover cluster in previous OS iterations.
To recap the architecture and moving parts:
Active/active XenApp with Provisioning Server in main campus and secondary data center
2 instances of SQL Server 2016 Standard edition, one in main campus and one in secondary data center
File share witness in third mini data center location with direct connections to both data centers
4 SQL AlwaysOn Availability Groups for Citrix databases
Pros
Enterprise level redundancy for Citrix environment
Lower cost with SQL Server Standard over SQL Server Enterprise
Databases able to fail over independently of each other
Cons
Longer setup time given each database protected requires new availability group
More IP addresses required
Database strings are not uniform due to each XenApp database pointing to different availability group
Databases able to fail over independently of each other (yes it is a pro and a con, think about it and you'll get it)
In conclusion, the customer is happy with the solution and it works well for the customer. It provides them with a highly redundant supporting infrastructure to their new XenApp environment. It did take some time to learn the ins and outs of what was going on with SQL, but that's because there was thorough testing taking place.
Let me know your thoughts on the setup. If you have questions or think I'm crazy (not ruling it out) feel free to send me a note at benjamin.crill@icrill.com.