Background: I have
mentioned earlier that we were baffled by Amazon's Elastic Load Balancer(ELB). We tested with Nginx as reverse proxy, it seems to be hands down better than ELB. (I will have Nginx performance in another post.) To scale this setup horizontally, we configured multiple Nginx with Weighted Round Robin (WRR) DNS configuration; to alleviate some load off Nginx server.
We later decided to have redundant Nginx server in N+1 setup, so that if one server goes down the redundant takes over. I stumbled on
Heartbeat. Unfortunately there is no good tutorial on how this can be done on Amazon EC2.
Warning: I will keep this tutorial brief. Limited to Heartbeat configuration, as the
Wiki mentioned:
In order to be useful to users, the Heartbeat daemon needs to be combined with a cluster resource manager (CRM) which has the task of starting and stopping the services (IP addresses, web servers, etc.) that cluster will make highly available. Pacemaker is the preferred cluster resource manager for clusters based on Heartbeat.
Goal: We need to make sure auxiliary node takes over, as soon as a main node dies. This is what should happen:
- We assign Elastic IP(EIP) to main node that runs Nginx, initially.
- When main node goes down, we want
- the aux node gets assingned the EIP. (refer /etc/init.d/updateEIP script below)
- the aux node's nginx starts up. (refer /etc/init.d/nginx script below)
- When main node comes back up,
- aux node relinquishes the EIP
- aux node shuts down Nginx
- main node acquires EIP.
- main node starts Nginx.
Preparation: Assuming you have launched two instances of your favorite Unix flavor.
- Get elastic IP. We will use this to switch over.
ec2-allocate-address
ADDRESS 50.17.213.152
- Install Nginx on both machines. I will install from source from here http://nginx.org/en/download.html
wget http://nginx.org/download/nginx-1.2.5.tar.gz
tar xvzf nginx-1.2.5.tar.gz
cd nginx-1.2.5
./configure
make
make install
- Configure Nginx
cd /usr/local/nginx/conf
mv nginx.conf nginx.conf.orig
vi nginx.conf
My nginx.conf
looks like this.
worker_processes 2;
worker_rlimit_nofile 100000;
events {
worker_connections 1024;
}
http {
access_log off;
upstream myapp {
server ip-10-10-225-146.ec2.internal weight=10 max_fails=3 fail_timeout=10s; # Reverse proxy to app01
server ip-10-226-95-45.ec2.internal weight=10 max_fails=3 fail_timeout=10s; # Reverse proxy to app02
server ip-10-244-145-130.ec2.internal weight=10 max_fails=3 fail_timeout=10s; # Reverse proxy to app03
}
server {
listen 80;
add_header X-Whom node_1;
location / {
proxy_pass http://myapp;
}
}
}
This configuration is same on both Nginx server except one thing, the add_header X-Whom node_1;
part. This is to identify which Nginx loadbalancer is serving. This will help to debug the situation later. On the second Nginx, we have add_header X-Whom node_2;
. This line says Nginx to inject a header X-Whom
to each response.
- Install Heartbeat:
#RHEL, CentOS
yum install heartbeat
#Debian, Ubuntu
sudo apt-get install heartbeat-2
Configure Heartbeat: There are three things you need to configure to get stuffs working with Heartbeat.
- ha.cf:
/etc/ha.d/ha.cf
is main configuration file. It has list of nodes, features to be enabled. And the stuffs in it is order sensitive.
- authkeys: It is basically to maintain security of cluster. It provide a mean to authenticate nodes in a cluster.
- haresources: List of resources to be managed by heartbeat. It looks like this
preferredHost service1 service2
. where preferredHost
is the hostname where you prefer the subsequent services to be executed. service1 service2
are the service that stay inside /etc/init.d/
and stops and starts the service when called in /etc/init.d/service1 stop|start
.
When a node is woken up, the service is called from left to right. Means service1
first; and then service2
. And when a node is brought down service2
is terminated first, and then service1
.
---------
For sake of simplicity lets call one node
main node
and the other
aux node
.
On main node,
On aux node,
----
/etc/ha.d/ha.cf
On main node,
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85
On aux node, exact as on main node
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85
where,
deadtime: seconds after which a host is considered dead if not responding.
warntime: seconds after which the late Heartbeat warning will be issued.
initdead: seconds to wait for the other host after starting Heartbeat, before it is considered dead.
udpport: port number used for bcast/ucast communication. The default value for this is 694.
bcast/ucast: interface on which to broadcast/unicast.
auto_failback: if ‘on’, resources are automatically failed back to its primary node.
node: nodes in the HA set-up, dentified by
uname -n
.
----
/etc/ha.d/authkeys must be the same on both the machines. You may generate a random auth-secretkey using,
date|md5sum
authkeys on both the machines:
auth 1
1 sha1 1e8d28a4627ed7f83faf1d57f5b11645
----
/etc/ha.d/haresources
haresources on both the machines:
ip-10-144-75-85 updateEIP nginx
updateEIP
and
nginx
are shell script that I wrote and stored under
/etc/init.d/
.
updateEIP
#!/bin/bash
# description: this script associates $eip to this instance.
# it assumes you have JDK and AWS API-tools installed.
# location of these are given as $JAVA_HOME and $EC2_HOME
# author: Nishant Neeraj
export EC2_HOME=/home/ec2
export JAVA_HOME=/usr/java/default
eip="50.17.213.152"
pk="/mnt/pk-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"
cert="/mnt/cert-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"
function updateEIP(){
instance="$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
echo "Instace ID is: ${instance}"
echo "Assingning $eip to ${instance}..."
/home/ec2/bin/ec2-associate-address -K ${pk} -C ${cert} -i ${instance} ${eip}
echo "done!"
}
param=$1
if [ "start" == "$param" ] ; then
echo "Starting..."
updateEIP
exit 0
elif [ "stop" == "$param" ] ; then
echo "stopping..."
exit 0;
else
echo "no such command $param"
exit 1
fi
nginx
#/bin/bash
function start(){
echo "starting nginx..."
/usr/local/nginx/sbin/nginx
}
function stop(){
echo "stoping nginx..."
/usr/local/nginx/sbin/nginx -s stop
}
param=$1
if [ "start" == "$param" ] ; then
start
exit 0
elif [ "stop" == "$param" ] ; then
stop
exit 0
else
echo "no such command: $param"
exit 1
----
Start services:
on main,
service heartbeat start
/etc/init.d/nginx start
on aux,
You can
tail -f /tmp/ha-debug
on these nodes to watch things rolling.
Test: Time to test. Note that we are just watching nodes, not Nginx service. So, we need to make heartbeat service on 'aux' look like as if 'main' server is down.
Keep
tail
ing the debug file in 'aux' machine. This will show you the transition. Now, it's time to kill 'main' node. On main, so this:
service heartbeat stop
/etc/init.d/nginx stop
You can see the debug file on 'aux' shows that 'aux' is taking over 'main'. Now, to see how service switches back to main node when it comes up. Start just the heartbeat service, you will see: a. aux: EIP gets detached. (not really, but you can), b. aux: nginx stops, c. main: EIP is assigned, d. main: Nginx is started.
Note: While we flip EIPs, you may get disconnected from SSH connection. So, keep looking into AWS web console to get new assigned public DNS for node which we revoked the EIP from. And use EIP to connect to the node which assigned it.
Debug Logs: Here is how log on auxiliary machine look like
heartbeat[9133]: 2012/11/21_15:12:44 info: Received shutdown notice from 'ip-10-144-75-85'.
heartbeat[9133]: 2012/11/21_15:12:44 info: Resources being acquired from ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10055]: 2012/11/21_15:12:44 info: acquire local HA resources (standby).
heartbeat[10055]: 2012/11/21_15:12:44 info: local HA resource acquisition completed (standby).
heartbeat[9133]: 2012/11/21_15:12:44 info: Standby resource acquisition done [foreign].
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10056]: 2012/11/21_15:12:44 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ip-10-36-5-11] to acquire.
heartbeat[10081]: 2012/11/21_15:12:44 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10081]: 2012/11/21_15:12:44 info: Running /etc/ha.d/rc.d/status status
mach_down[10097]: 2012/11/21_15:12:44 info: Taking over resource group updateEIP
ResourceManager[10123]: 2012/11/21_15:12:45 info: Acquiring resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10123]: 2012/11/21_15:12:45 info: Running /etc/init.d/updateEIP start
ResourceManager[10123]: 2012/11/21_15:12:45 debug: Starting /etc/init.d/updateEIP start
Starting...
Instace ID is: i-6ff07d10
Assingning 50.17.213.152 to i-6ff07d10...
ADDRESS 50.17.213.152 i-6ff07d10
done!
ResourceManager[10123]: 2012/11/21_15:12:58 debug: /etc/init.d/updateEIP start done. RC=0
ResourceManager[10123]: 2012/11/21_15:12:58 info: Running /etc/init.d/nginx start
ResourceManager[10123]: 2012/11/21_15:12:58 debug: Starting /etc/init.d/nginx start
starting nginx...
ResourceManager[10123]: 2012/11/21_15:12:58 debug: /etc/init.d/nginx start done. RC=0
mach_down[10097]: 2012/11/21_15:12:58 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[10097]: 2012/11/21_15:12:58 info: mach_down takeover complete for node ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:58 info: mach_down takeover complete.
heartbeat[9133]: 2012/11/21_15:13:16 WARN: node ip-10-144-75-85: is dead
heartbeat[9133]: 2012/11/21_15:13:16 info: Dead node ip-10-144-75-85 gave up resources.
heartbeat[9133]: 2012/11/21_15:13:16 info: Link ip-10-144-75-85:eth0 dead.
...
heartbeat[9133]: 2012/11/21_15:19:59 info: Heartbeat restart on node ip-10-144-75-85
heartbeat[9133]: 2012/11/21_15:19:59 info: Link ip-10-144-75-85:eth0 up.
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status init
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status up
heartbeat[9133]: 2012/11/21_15:19:59 debug: StartNextRemoteRscReq(): child count 1
heartbeat[9133]: 2012/11/21_15:19:59 debug: get_delnodelist: delnodelist=
heartbeat[10262]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10262]: 2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[10278]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10278]: 2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: Status update for node ip-10-144-75-85: status active
heartbeat[10294]: 2012/11/21_15:20:00 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10294]: 2012/11/21_15:20:00 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:00 info: ip-10-36-5-11 wants to go standby [foreign]
heartbeat[9133]: 2012/11/21_15:20:01 info: standby: ip-10-144-75-85 can take our foreign resources
heartbeat[10310]: 2012/11/21_15:20:01 info: give up foreign HA resources (standby).
ResourceManager[10323]: 2012/11/21_15:20:01 info: Releasing resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10323]: 2012/11/21_15:20:01 info: Running /etc/init.d/nginx stop
ResourceManager[10323]: 2012/11/21_15:20:01 debug: Starting /etc/init.d/nginx stop
stoping nginx...
ResourceManager[10323]: 2012/11/21_15:20:01 debug: /etc/init.d/nginx stop done. RC=0
ResourceManager[10323]: 2012/11/21_15:20:01 info: Running /etc/init.d/updateEIP stop
ResourceManager[10323]: 2012/11/21_15:20:01 debug: Starting /etc/init.d/updateEIP stop
stopping...
ResourceManager[10323]: 2012/11/21_15:20:01 debug: /etc/init.d/updateEIP stop done. RC=0
heartbeat[10310]: 2012/11/21_15:20:01 info: foreign HA resource release completed (standby).
heartbeat[9133]: 2012/11/21_15:20:01 info: Local standby process completed [foreign].
heartbeat[9133]: 2012/11/21_15:20:18 WARN: 4 lost packet(s) for [ip-10-144-75-85] [17:22]
heartbeat[9133]: 2012/11/21_15:20:18 info: Other node completed standby takeover of foreign resources.
heartbeat[9133]: 2012/11/21_15:20:18 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:18 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:20 WARN: 1 lost packet(s) for [ip-10-144-75-85] [22:24]
heartbeat[9133]: 2012/11/21_15:20:20 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:28 WARN: 3 lost packet(s) for [ip-10-144-75-85] [24:28]
heartbeat[9133]: 2012/11/21_15:20:28 info: No pkts missing from ip-10-144-75-85!
watch debug logs, watch header
You may want to use
curl -I
to ensure if switching occures. Here is what I see when I kill main node, aux node takes overs, and finally main node comes back:
~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:10:39 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=j1qov13qkfca1pxm6usm3ulyi;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1
~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:13:38 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=1ujszsyow5rc0ws4v2rir06sk;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_2
~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:22:09 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=khbf8ovc6nr1rwr4o3hzmflt;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1
Conclusion: So we have basic HA configuration ready, we need to add toppings to this setup to be able to monitor services and respond them. You need to add a Cluster Resource Management (CRM) layer over it to be useful.