Friday, November 23, 2012

High-Availability Ngnix using Heartbeat on Amazon EC2

Background: I have mentioned earlier that we were baffled by Amazon's Elastic Load Balancer(ELB). We tested with Nginx as reverse proxy, it seems to be hands down better than ELB. (I will have Nginx performance in another post.) To scale this setup horizontally, we configured multiple Nginx with Weighted Round Robin (WRR) DNS configuration; to alleviate some load off Nginx server.

We later decided to have redundant Nginx server in N+1 setup, so that if one server goes down the redundant takes over. I stumbled on Heartbeat. Unfortunately there is no good tutorial on how this can be done on Amazon EC2.

Warning: I will keep this tutorial brief. Limited to Heartbeat configuration, as the Wiki mentioned:
In order to be useful to users, the Heartbeat daemon needs to be combined with a cluster resource manager (CRM) which has the task of starting and stopping the services (IP addresses, web servers, etc.) that cluster will make highly available. Pacemaker is the preferred cluster resource manager for clusters based on Heartbeat.
Goal: We need to make sure auxiliary node takes over, as soon as a main node dies. This is what should happen:
  1. We assign Elastic IP(EIP) to main node that runs Nginx, initially.
  2. When main node goes down, we want
    1. the aux node gets assingned the EIP. (refer /etc/init.d/updateEIP script below)
    2. the aux node's nginx starts up. (refer /etc/init.d/nginx script below)
  3. When main node comes back up,
    1. aux node relinquishes the EIP
    2. aux node shuts down Nginx
    3. main node acquires EIP.
    4. main node starts Nginx. 
Preparation: Assuming you have launched two instances of your favorite Unix flavor.
  1. Get elastic IP. We will use this to switch over.
    ec2-allocate-address
    ADDRESS 50.17.213.152 
  2. Install Nginx on both machines. I will install from source from here http://nginx.org/en/download.html
    wget http://nginx.org/download/nginx-1.2.5.tar.gz
    tar xvzf nginx-1.2.5.tar.gz 
    cd nginx-1.2.5
    ./configure 
    make
    make install
  3. Configure Nginx
    cd /usr/local/nginx/conf
    mv nginx.conf nginx.conf.orig
    vi nginx.conf

    My nginx.conf looks like this.
    worker_processes  2;
    worker_rlimit_nofile 100000;
    
    events {
        worker_connections  1024;
    }
    
    http {
    
      access_log off;
    
      upstream myapp {
        server ip-10-10-225-146.ec2.internal weight=10 max_fails=3 fail_timeout=10s;  # Reverse proxy to app01
        server ip-10-226-95-45.ec2.internal weight=10 max_fails=3 fail_timeout=10s;   # Reverse proxy to app02
        server ip-10-244-145-130.ec2.internal weight=10 max_fails=3 fail_timeout=10s; # Reverse proxy to app03
      }
    
      server {
        listen 80;
        add_header X-Whom node_1;
        location / {
          proxy_pass http://myapp;
        }
      }
    }
    This configuration is same on both Nginx server except one thing, the add_header X-Whom node_1; part. This is to identify which Nginx loadbalancer is serving. This will help to debug the situation later. On the second Nginx, we have add_header X-Whom node_2;. This line says Nginx to inject a header X-Whom to each response.
  4. Install Heartbeat:
    #RHEL, CentOS
    yum install heartbeat
    
    #Debian, Ubuntu
    sudo apt-get install heartbeat-2

Configure Heartbeat: There are three things you need to configure to get stuffs working with Heartbeat.
  1. ha.cf: /etc/ha.d/ha.cf is main configuration file. It has list of nodes, features to be enabled. And the stuffs in it is order sensitive. 
  2. authkeys: It is basically to maintain security of cluster. It provide a mean to authenticate nodes in a cluster. 
  3. haresources: List of resources to be managed by heartbeat. It looks like this preferredHost service1 service2. where preferredHost is the hostname where you prefer the subsequent services to be executed. service1 service2 are the service that stay inside /etc/init.d/ and stops and starts the service when called in /etc/init.d/service1 stop|start.

    When a node is woken up, the service is called from left to right. Means service1 first; and then service2. And when a node is brought down service2 is terminated first, and then service1.
---------
 For sake of simplicity lets call one node main node and the other aux node.

On main node,
uname -n
ip-10-144-75-85

On aux node,
uname -n
ip-10-36-5-11

----
/etc/ha.d/ha.cf
On main node,
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85

On aux node, exact as on main node
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85

where,
  deadtime: seconds after which a host is considered dead if not responding.
  warntime: seconds after which the late Heartbeat warning will be issued.
  initdead: seconds to wait for the other host after starting Heartbeat, before it is considered dead.
  udpport: port number used for bcast/ucast communication. The default value for this is 694.
  bcast/ucast: interface on which to broadcast/unicast.
  auto_failback: if ‘on’, resources are automatically failed back to its primary node.
  node: nodes in the HA set-up, dentified by uname -n.
----

/etc/ha.d/authkeys must be the same on both the machines. You may generate a random auth-secretkey using, date|md5sum

authkeys on both the machines:
auth 1
1 sha1 1e8d28a4627ed7f83faf1d57f5b11645
----

/etc/ha.d/haresources

haresources on both the machines:
ip-10-144-75-85 updateEIP nginx
updateEIP and nginx are shell script that I wrote and stored under /etc/init.d/.

updateEIP
#!/bin/bash

# description: this script associates $eip to this instance.
# it assumes you have JDK and AWS API-tools installed.
# location of these are given as $JAVA_HOME and $EC2_HOME
# author: Nishant Neeraj

export EC2_HOME=/home/ec2
export JAVA_HOME=/usr/java/default

eip="50.17.213.152"
pk="/mnt/pk-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"
cert="/mnt/cert-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"

function updateEIP(){
        instance="$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
        echo "Instace ID is: ${instance}"

        echo "Assingning $eip to ${instance}..."
        /home/ec2/bin/ec2-associate-address -K ${pk} -C ${cert} -i ${instance} ${eip}

        echo "done!"
}

param=$1

if [ "start" == "$param" ] ; then
  echo "Starting..."
  updateEIP
  exit 0
elif [ "stop" == "$param" ] ; then
  echo "stopping..."
  exit 0;
else
  echo "no such command $param"
  exit 1
fi

nginx
#/bin/bash

function start(){
  echo "starting nginx..."
  /usr/local/nginx/sbin/nginx 
}

function stop(){
  echo "stoping nginx..."
  /usr/local/nginx/sbin/nginx -s stop
}

param=$1

if [ "start" == "$param" ] ; then
  start
  exit 0
elif [ "stop" == "$param" ] ; then
  stop
  exit 0
else
  echo "no such command: $param"
  exit 1
----

Start services:
on main,
service heartbeat start
/etc/init.d/nginx start

on aux,
service heartbeat start

You can tail -f /tmp/ha-debug on these nodes to watch things rolling.

Test: Time to test. Note that we are just watching nodes, not Nginx service. So, we need to make heartbeat service on 'aux' look like as if 'main' server is down.

Keep tailing the debug file in 'aux' machine. This will show you the transition. Now, it's time to kill 'main' node. On main, so this:
service heartbeat stop
/etc/init.d/nginx stop

You can see the debug file on 'aux' shows that 'aux' is taking over 'main'. Now, to see how service switches back to main node when it comes up. Start just the heartbeat service, you will see: a. aux: EIP gets detached. (not really, but you can), b. aux: nginx stops, c. main: EIP is assigned, d. main: Nginx is started.

Note: While we flip EIPs, you may get disconnected from SSH connection. So, keep looking into AWS web console to get new assigned public DNS for node which we revoked the EIP from. And use EIP to connect to the node which assigned it.

Debug Logs: Here is how log on auxiliary machine look like
heartbeat[9133]: 2012/11/21_15:12:44 info: Received shutdown notice from 'ip-10-144-75-85'.
heartbeat[9133]: 2012/11/21_15:12:44 info: Resources being acquired from ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10055]: 2012/11/21_15:12:44 info: acquire local HA resources (standby).
heartbeat[10055]: 2012/11/21_15:12:44 info: local HA resource acquisition completed (standby).
heartbeat[9133]: 2012/11/21_15:12:44 info: Standby resource acquisition done [foreign].
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10056]: 2012/11/21_15:12:44 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ip-10-36-5-11] to acquire.
heartbeat[10081]: 2012/11/21_15:12:44 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10081]:    2012/11/21_15:12:44 info: Running /etc/ha.d/rc.d/status status
mach_down[10097]:    2012/11/21_15:12:44 info: Taking over resource group updateEIP
ResourceManager[10123]:    2012/11/21_15:12:45 info: Acquiring resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10123]:    2012/11/21_15:12:45 info: Running /etc/init.d/updateEIP  start
ResourceManager[10123]:    2012/11/21_15:12:45 debug: Starting /etc/init.d/updateEIP  start
Starting...
Instace ID is: i-6ff07d10
Assingning 50.17.213.152 to i-6ff07d10...
ADDRESS    50.17.213.152    i-6ff07d10
done!
ResourceManager[10123]:    2012/11/21_15:12:58 debug: /etc/init.d/updateEIP  start done. RC=0
ResourceManager[10123]:    2012/11/21_15:12:58 info: Running /etc/init.d/nginx  start
ResourceManager[10123]:    2012/11/21_15:12:58 debug: Starting /etc/init.d/nginx  start
starting nginx...
ResourceManager[10123]:    2012/11/21_15:12:58 debug: /etc/init.d/nginx  start done. RC=0
mach_down[10097]:    2012/11/21_15:12:58 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[10097]:    2012/11/21_15:12:58 info: mach_down takeover complete for node ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:58 info: mach_down takeover complete.
heartbeat[9133]: 2012/11/21_15:13:16 WARN: node ip-10-144-75-85: is dead
heartbeat[9133]: 2012/11/21_15:13:16 info: Dead node ip-10-144-75-85 gave up resources.
heartbeat[9133]: 2012/11/21_15:13:16 info: Link ip-10-144-75-85:eth0 dead.
...
heartbeat[9133]: 2012/11/21_15:19:59 info: Heartbeat restart on node ip-10-144-75-85
heartbeat[9133]: 2012/11/21_15:19:59 info: Link ip-10-144-75-85:eth0 up.
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status init
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status up
heartbeat[9133]: 2012/11/21_15:19:59 debug: StartNextRemoteRscReq(): child count 1
heartbeat[9133]: 2012/11/21_15:19:59 debug: get_delnodelist: delnodelist=
heartbeat[10262]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10262]:    2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[10278]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10278]:    2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: Status update for node ip-10-144-75-85: status active
heartbeat[10294]: 2012/11/21_15:20:00 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10294]:    2012/11/21_15:20:00 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:00 info: ip-10-36-5-11 wants to go standby [foreign]
heartbeat[9133]: 2012/11/21_15:20:01 info: standby: ip-10-144-75-85 can take our foreign resources
heartbeat[10310]: 2012/11/21_15:20:01 info: give up foreign HA resources (standby).
ResourceManager[10323]:    2012/11/21_15:20:01 info: Releasing resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10323]:    2012/11/21_15:20:01 info: Running /etc/init.d/nginx  stop
ResourceManager[10323]:    2012/11/21_15:20:01 debug: Starting /etc/init.d/nginx  stop
stoping nginx...
ResourceManager[10323]:    2012/11/21_15:20:01 debug: /etc/init.d/nginx  stop done. RC=0
ResourceManager[10323]:    2012/11/21_15:20:01 info: Running /etc/init.d/updateEIP  stop
ResourceManager[10323]:    2012/11/21_15:20:01 debug: Starting /etc/init.d/updateEIP  stop
stopping...
ResourceManager[10323]:    2012/11/21_15:20:01 debug: /etc/init.d/updateEIP  stop done. RC=0
heartbeat[10310]: 2012/11/21_15:20:01 info: foreign HA resource release completed (standby).
heartbeat[9133]: 2012/11/21_15:20:01 info: Local standby process completed [foreign].
heartbeat[9133]: 2012/11/21_15:20:18 WARN: 4 lost packet(s) for [ip-10-144-75-85] [17:22]
heartbeat[9133]: 2012/11/21_15:20:18 info: Other node completed standby takeover of foreign resources.
heartbeat[9133]: 2012/11/21_15:20:18 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:18 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:20 WARN: 1 lost packet(s) for [ip-10-144-75-85] [22:24]
heartbeat[9133]: 2012/11/21_15:20:20 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:28 WARN: 3 lost packet(s) for [ip-10-144-75-85] [24:28]
heartbeat[9133]: 2012/11/21_15:20:28 info: No pkts missing from ip-10-144-75-85! 
watch debug logs, watch header

You may want to use curl -I to ensure if switching occures. Here is what I see when I kill main node, aux node takes overs, and finally main node comes back:
~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:10:39 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=j1qov13qkfca1pxm6usm3ulyi;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1

~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:13:38 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=1ujszsyow5rc0ws4v2rir06sk;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_2

~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:22:09 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=khbf8ovc6nr1rwr4o3hzmflt;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1

Conclusion: So we have basic HA configuration ready, we need to add toppings to this setup to be able to monitor services and respond them. You need to add a Cluster Resource Management (CRM) layer over it to be useful.

9 comments:

  1. من المعروف ان نظافة المنازل من بين اهم الخدمات الضرورية في التنظيف والترقية بمنازلنا الى مصاف المنازل العصرية والتي عرفت تنظيفا عصريا من شانه ان يوفر لساكنته ظروفا حياتية راقية ولابد من توفر هيئة تختص في المجال مع تقديم ضمانات متكاملة من شانه ان تجعل الافراد يضمنون نجاح خدمة التنظيف ولا حاجة للمزيد من ضياع الاموال عبر طلبات خدمة نظافة المنازل التي لا تلبي حاجياتهم ولا تتماشى مع رغباتهم لأن العديد من شركات تنظيف المنازل تسعى فقط الى عرض خدماتها عبر طرق ترويجية فحسب من اجل كسب المال فقط دون مراعاة الوازع الاخلاقي والضمير المهني الذي يحتم على مدراء مثل هاته الشركات ان يسعوا خلف ارضاء العملاء ليس اكثر من هذا عبر اتقان تنظيف المنازل . شركة تنظيف خزانات بالقصيم
    شركة مكافحة حشرات بالدمام
    شركة تنظيف بابها
    شركة تسليك مجارى بتبوك
    شركة تنظيف منازل بتبوك
    شركة تنظيف بالطائف
    شركة نقل عفش بالخرج

    ReplyDelete
  2. هل تبحث عن افضل شركة تقدم خدمات مزلة فى الاحساء نحن فى شركة ركن كلين نقدم لك افضل الخدمات التى لا ماثيل لها نقوم بتنظيف ومكافحة الحشرات وتسليك المجارى كما اننا نقضى على الحشرات تمام وبدون اى اضرار اتصل بنا وسوف نصلك اينما كنت
    شركة تنظيف منازل بالاحساء
    شركة تنظيف فلل بالاحساء
    شركة تنظيف شقق بالاحساء
    شركة تنظيف مجالس بالاحساء
    شركة تنظيف خزانات بالاحساء
    شركة تنظيف بيارات بالاحساء
    شركة تسليك مجارى بالاحساء
    شركة رش مبيدات بالاحساء
    شركة مكافحة حشرات بالاحساء
    شركة مكافحة النمل الابيض بالاحساء

    ReplyDelete
  3. من المعروف ان نظافة المنازل من بين اهم الخدمات الضرورية في التنظيف والترقية بمنازلنا الى مصاف المنازل العصرية والتي عرفت تنظيفا عصريا من شانه ان يوفر لساكنته ظروفا حياتية راقية ولابد من توفر هيئة تختص في المجال مع تقديم ضمانات متكاملة من شانه ان تجعل الافراد يضمنون نجاح خدمة التنظيف ولا حاجة للمزيد من ضياع الاموال عبر طلبات خدمة نظافة المنازل التي لا تلبي حاجياتهم ولا تتماشى مع رغباتهم لأن العديد من شركات تنظيف المنازل تسعى فقط الى عرض خدماتها عبر طرق ترويجية فحسب من اجل كسب المال فقط دون مراعاة الوازع الاخلاقي والضمير المهني الذي يحتم على مدراء مثل هاته الشركات ان يسعوا خلف ارضاء العملاء ليس اكثر من هذا عبر اتقان تنظيف المنازل شركة تنظيف خزانات بالرياض
    شركة تنظيف مجالس بالرياض
    شركة نظافة عامة بالرياض
    شركة نقل عفش بالرياض
    شركة مكافحة حشرات بالرياض
    شركة تنظيف شقق بالرياض
    شركة تنظيف منازل بالرياض
    شركة رش مبيد بالرياض

    ReplyDelete
  4. لان مشكلة انسداد الأحواض و مواسير الصرف و المجاري ، من المشكلات المزعجة للغاية و الغير محتملة ، نظرًا لما تسببه من رائحة كريهة لا تطاق سواء
    شركة تسليك مجارى بالدمام فريق عمل خبير لحل مشكلة المجاري باقوي الالات العالمية فى تسليك المجاري نصلك في اي مكان خصومات شركة تسليك مجارى بالخبر عزيزى العميل اذا اردت الحصول على مجارى نظيفة وخالية من الاوساخ والانسدادات التى تعتبر من الشركات الرائدة فى

    شركة تتنظيف منازل بالقطيف
    شركة رش مبيدات بالقطيف

    ReplyDelete