Friday, November 23, 2012

High-Availability Ngnix using Heartbeat on Amazon EC2

Background: I have mentioned earlier that we were baffled by Amazon's Elastic Load Balancer(ELB). We tested with Nginx as reverse proxy, it seems to be hands down better than ELB. (I will have Nginx performance in another post.) To scale this setup horizontally, we configured multiple Nginx with Weighted Round Robin (WRR) DNS configuration; to alleviate some load off Nginx server.

We later decided to have redundant Nginx server in N+1 setup, so that if one server goes down the redundant takes over. I stumbled on Heartbeat. Unfortunately there is no good tutorial on how this can be done on Amazon EC2.

Warning: I will keep this tutorial brief. Limited to Heartbeat configuration, as the Wiki mentioned:
In order to be useful to users, the Heartbeat daemon needs to be combined with a cluster resource manager (CRM) which has the task of starting and stopping the services (IP addresses, web servers, etc.) that cluster will make highly available. Pacemaker is the preferred cluster resource manager for clusters based on Heartbeat.
Goal: We need to make sure auxiliary node takes over, as soon as a main node dies. This is what should happen:
  1. We assign Elastic IP(EIP) to main node that runs Nginx, initially.
  2. When main node goes down, we want
    1. the aux node gets assingned the EIP. (refer /etc/init.d/updateEIP script below)
    2. the aux node's nginx starts up. (refer /etc/init.d/nginx script below)
  3. When main node comes back up,
    1. aux node relinquishes the EIP
    2. aux node shuts down Nginx
    3. main node acquires EIP.
    4. main node starts Nginx. 
Preparation: Assuming you have launched two instances of your favorite Unix flavor.
  1. Get elastic IP. We will use this to switch over.
    ec2-allocate-address
    ADDRESS 50.17.213.152 
  2. Install Nginx on both machines. I will install from source from here http://nginx.org/en/download.html
    wget http://nginx.org/download/nginx-1.2.5.tar.gz
    tar xvzf nginx-1.2.5.tar.gz 
    cd nginx-1.2.5
    ./configure 
    make
    make install
  3. Configure Nginx
    cd /usr/local/nginx/conf
    mv nginx.conf nginx.conf.orig
    vi nginx.conf

    My nginx.conf looks like this.
    worker_processes  2;
    worker_rlimit_nofile 100000;
    
    events {
        worker_connections  1024;
    }
    
    http {
    
      access_log off;
    
      upstream myapp {
        server ip-10-10-225-146.ec2.internal weight=10 max_fails=3 fail_timeout=10s;  # Reverse proxy to app01
        server ip-10-226-95-45.ec2.internal weight=10 max_fails=3 fail_timeout=10s;   # Reverse proxy to app02
        server ip-10-244-145-130.ec2.internal weight=10 max_fails=3 fail_timeout=10s; # Reverse proxy to app03
      }
    
      server {
        listen 80;
        add_header X-Whom node_1;
        location / {
          proxy_pass http://myapp;
        }
      }
    }
    This configuration is same on both Nginx server except one thing, the add_header X-Whom node_1; part. This is to identify which Nginx loadbalancer is serving. This will help to debug the situation later. On the second Nginx, we have add_header X-Whom node_2;. This line says Nginx to inject a header X-Whom to each response.
  4. Install Heartbeat:
    #RHEL, CentOS
    yum install heartbeat
    
    #Debian, Ubuntu
    sudo apt-get install heartbeat-2

Configure Heartbeat: There are three things you need to configure to get stuffs working with Heartbeat.
  1. ha.cf: /etc/ha.d/ha.cf is main configuration file. It has list of nodes, features to be enabled. And the stuffs in it is order sensitive. 
  2. authkeys: It is basically to maintain security of cluster. It provide a mean to authenticate nodes in a cluster. 
  3. haresources: List of resources to be managed by heartbeat. It looks like this preferredHost service1 service2. where preferredHost is the hostname where you prefer the subsequent services to be executed. service1 service2 are the service that stay inside /etc/init.d/ and stops and starts the service when called in /etc/init.d/service1 stop|start.

    When a node is woken up, the service is called from left to right. Means service1 first; and then service2. And when a node is brought down service2 is terminated first, and then service1.
---------
 For sake of simplicity lets call one node main node and the other aux node.

On main node,
uname -n
ip-10-144-75-85

On aux node,
uname -n
ip-10-36-5-11

----
/etc/ha.d/ha.cf
On main node,
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85

On aux node, exact as on main node
logfile /tmp/ha-log
debugfile /tmp/ha-debug
logfacility local0
keepalive 2
deadtime 30
initdead 120
udpport 694
ucast eth0 10.36.5.11
ucast eth0 10.144.75.85
auto_failback on
node ip-10-36-5-11
node ip-10-144-75-85

where,
  deadtime: seconds after which a host is considered dead if not responding.
  warntime: seconds after which the late Heartbeat warning will be issued.
  initdead: seconds to wait for the other host after starting Heartbeat, before it is considered dead.
  udpport: port number used for bcast/ucast communication. The default value for this is 694.
  bcast/ucast: interface on which to broadcast/unicast.
  auto_failback: if ‘on’, resources are automatically failed back to its primary node.
  node: nodes in the HA set-up, dentified by uname -n.
----

/etc/ha.d/authkeys must be the same on both the machines. You may generate a random auth-secretkey using, date|md5sum

authkeys on both the machines:
auth 1
1 sha1 1e8d28a4627ed7f83faf1d57f5b11645
----

/etc/ha.d/haresources

haresources on both the machines:
ip-10-144-75-85 updateEIP nginx
updateEIP and nginx are shell script that I wrote and stored under /etc/init.d/.

updateEIP
#!/bin/bash

# description: this script associates $eip to this instance.
# it assumes you have JDK and AWS API-tools installed.
# location of these are given as $JAVA_HOME and $EC2_HOME
# author: Nishant Neeraj

export EC2_HOME=/home/ec2
export JAVA_HOME=/usr/java/default

eip="50.17.213.152"
pk="/mnt/pk-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"
cert="/mnt/cert-2ITPGLG6XXXXXXXXXXXXQUEK2PUOVFJB.pem"

function updateEIP(){
        instance="$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
        echo "Instace ID is: ${instance}"

        echo "Assingning $eip to ${instance}..."
        /home/ec2/bin/ec2-associate-address -K ${pk} -C ${cert} -i ${instance} ${eip}

        echo "done!"
}

param=$1

if [ "start" == "$param" ] ; then
  echo "Starting..."
  updateEIP
  exit 0
elif [ "stop" == "$param" ] ; then
  echo "stopping..."
  exit 0;
else
  echo "no such command $param"
  exit 1
fi

nginx
#/bin/bash

function start(){
  echo "starting nginx..."
  /usr/local/nginx/sbin/nginx 
}

function stop(){
  echo "stoping nginx..."
  /usr/local/nginx/sbin/nginx -s stop
}

param=$1

if [ "start" == "$param" ] ; then
  start
  exit 0
elif [ "stop" == "$param" ] ; then
  stop
  exit 0
else
  echo "no such command: $param"
  exit 1
----

Start services:
on main,
service heartbeat start
/etc/init.d/nginx start

on aux,
service heartbeat start

You can tail -f /tmp/ha-debug on these nodes to watch things rolling.

Test: Time to test. Note that we are just watching nodes, not Nginx service. So, we need to make heartbeat service on 'aux' look like as if 'main' server is down.

Keep tailing the debug file in 'aux' machine. This will show you the transition. Now, it's time to kill 'main' node. On main, so this:
service heartbeat stop
/etc/init.d/nginx stop

You can see the debug file on 'aux' shows that 'aux' is taking over 'main'. Now, to see how service switches back to main node when it comes up. Start just the heartbeat service, you will see: a. aux: EIP gets detached. (not really, but you can), b. aux: nginx stops, c. main: EIP is assigned, d. main: Nginx is started.

Note: While we flip EIPs, you may get disconnected from SSH connection. So, keep looking into AWS web console to get new assigned public DNS for node which we revoked the EIP from. And use EIP to connect to the node which assigned it.

Debug Logs: Here is how log on auxiliary machine look like
heartbeat[9133]: 2012/11/21_15:12:44 info: Received shutdown notice from 'ip-10-144-75-85'.
heartbeat[9133]: 2012/11/21_15:12:44 info: Resources being acquired from ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10055]: 2012/11/21_15:12:44 info: acquire local HA resources (standby).
heartbeat[10055]: 2012/11/21_15:12:44 info: local HA resource acquisition completed (standby).
heartbeat[9133]: 2012/11/21_15:12:44 info: Standby resource acquisition done [foreign].
heartbeat[9133]: 2012/11/21_15:12:44 debug: StartNextRemoteRscReq(): child count 1
heartbeat[10056]: 2012/11/21_15:12:44 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ip-10-36-5-11] to acquire.
heartbeat[10081]: 2012/11/21_15:12:44 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10081]:    2012/11/21_15:12:44 info: Running /etc/ha.d/rc.d/status status
mach_down[10097]:    2012/11/21_15:12:44 info: Taking over resource group updateEIP
ResourceManager[10123]:    2012/11/21_15:12:45 info: Acquiring resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10123]:    2012/11/21_15:12:45 info: Running /etc/init.d/updateEIP  start
ResourceManager[10123]:    2012/11/21_15:12:45 debug: Starting /etc/init.d/updateEIP  start
Starting...
Instace ID is: i-6ff07d10
Assingning 50.17.213.152 to i-6ff07d10...
ADDRESS    50.17.213.152    i-6ff07d10
done!
ResourceManager[10123]:    2012/11/21_15:12:58 debug: /etc/init.d/updateEIP  start done. RC=0
ResourceManager[10123]:    2012/11/21_15:12:58 info: Running /etc/init.d/nginx  start
ResourceManager[10123]:    2012/11/21_15:12:58 debug: Starting /etc/init.d/nginx  start
starting nginx...
ResourceManager[10123]:    2012/11/21_15:12:58 debug: /etc/init.d/nginx  start done. RC=0
mach_down[10097]:    2012/11/21_15:12:58 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[10097]:    2012/11/21_15:12:58 info: mach_down takeover complete for node ip-10-144-75-85.
heartbeat[9133]: 2012/11/21_15:12:58 info: mach_down takeover complete.
heartbeat[9133]: 2012/11/21_15:13:16 WARN: node ip-10-144-75-85: is dead
heartbeat[9133]: 2012/11/21_15:13:16 info: Dead node ip-10-144-75-85 gave up resources.
heartbeat[9133]: 2012/11/21_15:13:16 info: Link ip-10-144-75-85:eth0 dead.
...
heartbeat[9133]: 2012/11/21_15:19:59 info: Heartbeat restart on node ip-10-144-75-85
heartbeat[9133]: 2012/11/21_15:19:59 info: Link ip-10-144-75-85:eth0 up.
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status init
heartbeat[9133]: 2012/11/21_15:19:59 info: Status update for node ip-10-144-75-85: status up
heartbeat[9133]: 2012/11/21_15:19:59 debug: StartNextRemoteRscReq(): child count 1
heartbeat[9133]: 2012/11/21_15:19:59 debug: get_delnodelist: delnodelist=
heartbeat[10262]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10262]:    2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[10278]: 2012/11/21_15:19:59 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10278]:    2012/11/21_15:19:59 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: Status update for node ip-10-144-75-85: status active
heartbeat[10294]: 2012/11/21_15:20:00 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
harc[10294]:    2012/11/21_15:20:00 info: Running /etc/ha.d/rc.d/status status
heartbeat[9133]: 2012/11/21_15:20:00 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:00 info: ip-10-36-5-11 wants to go standby [foreign]
heartbeat[9133]: 2012/11/21_15:20:01 info: standby: ip-10-144-75-85 can take our foreign resources
heartbeat[10310]: 2012/11/21_15:20:01 info: give up foreign HA resources (standby).
ResourceManager[10323]:    2012/11/21_15:20:01 info: Releasing resource group: ip-10-144-75-85 updateEIP nginx
ResourceManager[10323]:    2012/11/21_15:20:01 info: Running /etc/init.d/nginx  stop
ResourceManager[10323]:    2012/11/21_15:20:01 debug: Starting /etc/init.d/nginx  stop
stoping nginx...
ResourceManager[10323]:    2012/11/21_15:20:01 debug: /etc/init.d/nginx  stop done. RC=0
ResourceManager[10323]:    2012/11/21_15:20:01 info: Running /etc/init.d/updateEIP  stop
ResourceManager[10323]:    2012/11/21_15:20:01 debug: Starting /etc/init.d/updateEIP  stop
stopping...
ResourceManager[10323]:    2012/11/21_15:20:01 debug: /etc/init.d/updateEIP  stop done. RC=0
heartbeat[10310]: 2012/11/21_15:20:01 info: foreign HA resource release completed (standby).
heartbeat[9133]: 2012/11/21_15:20:01 info: Local standby process completed [foreign].
heartbeat[9133]: 2012/11/21_15:20:18 WARN: 4 lost packet(s) for [ip-10-144-75-85] [17:22]
heartbeat[9133]: 2012/11/21_15:20:18 info: Other node completed standby takeover of foreign resources.
heartbeat[9133]: 2012/11/21_15:20:18 info: remote resource transition completed.
heartbeat[9133]: 2012/11/21_15:20:18 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:20 WARN: 1 lost packet(s) for [ip-10-144-75-85] [22:24]
heartbeat[9133]: 2012/11/21_15:20:20 info: No pkts missing from ip-10-144-75-85!
heartbeat[9133]: 2012/11/21_15:20:28 WARN: 3 lost packet(s) for [ip-10-144-75-85] [24:28]
heartbeat[9133]: 2012/11/21_15:20:28 info: No pkts missing from ip-10-144-75-85! 
watch debug logs, watch header

You may want to use curl -I to ensure if switching occures. Here is what I see when I kill main node, aux node takes overs, and finally main node comes back:
~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:10:39 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=j1qov13qkfca1pxm6usm3ulyi;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1

~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:13:38 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=1ujszsyow5rc0ws4v2rir06sk;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_2

~$ curl -I 50.17.213.152
HTTP/1.1 200 OK
Server: nginx/1.2.5
Date: Wed, 21 Nov 2012 15:22:09 GMT
Content-Type: text/html
Content-Length: 0
Connection: keep-alive
Set-Cookie: JSESSIONID=khbf8ovc6nr1rwr4o3hzmflt;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
X-Whom: node_1

Conclusion: So we have basic HA configuration ready, we need to add toppings to this setup to be able to monitor services and respond them. You need to add a Cluster Resource Management (CRM) layer over it to be useful.

Wednesday, October 24, 2012

Adding First Slave to Running MySQL

Problem: We had a MySQL server running for sometime now, we wanted to add a slave to it.

Steps: It's really simple.
  1. Configure server-id and binary logging: vi /etc/my.cnf. If there exists none, create one.
    [mysqld]
     
    datadir=/var/lib/mysql
    socket=/var/lib/mysql/mysql.sock
    user=mysql
     
    # master id
    server-id=1
     
    # binary logging for replication)
    log-bin=mysql-bin
     
     
    # Default to using old password format for compatibility with mysql 3.x
    # clients (those using the mysqlclient10 compatibility package).
    old_passwords=1
     
    # Disabling symbolic-links is recommended to prevent assorted security risks;
    # to do so, uncomment this line:
    # symbolic-links=0
     
    [mysqld_safe]
    log-error=/var/log/mysql/mysqld.log
    pid-file=/var/run/mysqld/mysqld.pid
    restart MySQL service. service mysql restart
  2. Create a replicator user in master:
    GRANT REPLICATION SLAVE ON *.* TO 'replicator' IDENTIFIED BY 'replicatorPassword';
    FLUSH PRIVILEGES;
  3. Get master binary log position:
    mysql> SHOW MASTER STATUS;
    +------------------+----------+--------------+------------------+
    | File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    +------------------+----------+--------------+------------------+
    | mysql-bin.000013 |     107|              |                  |
    +------------------+----------+--------------+------------------+
    1 row in set (0.00 sec)

  4. Take a MySQL dump:
    mysqldump --user USERNAME --password=PASSWORD --add-locks --flush-privileges \ 
    --add-drop-table --complete-insert --extended-insert --single-transaction  --database DBNAME \
    -h MASTER_HOST_IP_DNS > backup.sql
    Stop the master MySQL at this point to avoid data alteration.
  5. Configure slave: Keep the configuration same as master #1, except give it a different server-id. You may not include log-bin in slave configuration. But it is good idea to keep it. It will be useful in case where you want to make this slave master.

    Restart slave and login to it. Drop the database which you have taken dump of, if exists; and recreate it. Load the db dump.
    mysql -u USERNAME -pPASSWORD -h SLAVE_IP_DNS DBNAME < backup.sql
  6. Setup replication: Login to slave MySQL. Configure replication:
    CHANGE MASTER TO
       MASTER_HOST='MASTER_IP_OR_DNS',
       MASTER_USER='replicator',
       MASTER_PASSWORD='replicatorPassword',
       MASTER_LOG_FILE='mysql-bin.000013',
       MASTER_LOG_POS=107; 
        
    START SLAVE;


    And start master.
Done!

Tuesday, October 23, 2012

Key Learning from AWS Outage

"A service outage for Amazon’s Web Service (AWS) affected websites like Reddit, Flipboard, FastCompany, Heroku, Airbnb, Imgur, Pinterest and others late last night. It also seems that the outage affected some of Amazon’s storage units on the US East coast, and thus these websites which are hosted by AWS went down." - FirstPost

This was pretty exciting night of US Prez debate on October 22, 2012. We were having all our infrastructure and code-base made of solid steel. At least that's what we thought, until we realized our infrastructure wasn't as solid as Amazon claimed. About six hours before the the debate, 3PM EDT we started to see smaller glitches. And we realized that we can't reboot one of our EBS backed servers. We quickly launched another, but then ELB (Elastic Load Balancer) was was paralyzed. Then complete infrastructure started to fall like dominoes.

What went down? Some of the AWS's most marketed and claimed to be the best alternatives available, failed.
  1. ELB: Elastic load balancers hit hard. This is the gateway to our 20+ servers. It was hosed. It was unpredictably throwing instance in-service/out-of-service. And complete blackouts. We launched new ELBs, no avail.

    On one of the ELBs, we had to continuously re-save the instances in ELB to get them active once ELB marks them as unhealthy and throws out. (They were healthy).

    If you think the last para was bad. Prepare for the worse. One other load balancer just went inactive. Showing registering instances. Forever. Nothing worked. Then we launched another ELB, in the healthy zone us-east-1a, same problem for next 4 hour.
  2. EBS Backed Instances:  They were complete zombies. A couple if EBS backed instances would just do not do anything. SSH access, Nagios' NRPE checks all gone.
  3. EBS Volumes: Talk about tall claims, talk about EBS. Amazon market it as a better alternative to instance store backed. And this is not the first time EBS gave us a pain in the neck. It's slow for very fast IO like we do on Cassandra. It's expensive. And opposite to what Amazon says, it's not rock solid yet. The only guarantee you get is that your EBS volume will not be terminated unexpectedly. So, you will have your data safe. But, as we found, it may be unavailable/inaccessible for long enough time to spoil the game. In fact, one of the EBS volumes attached to our MySQL could not be detached until now (about 10 hours). And I do not know what's going to happen. We terminated the instance, tried force-detach. No luck.
How did we survive? We were little overcautious.
  1. Copy of a copy of a copy: We just needed one MySQL server. And stuffs were cached efficiently in MemcacheD. So, we initially thought that having one instance-store backed AMI with MySQL data directory in external EBS volume is safe enough. Since AWS says EBS won't lose data. Later, our CTO suggested to have a slave, created in much the same way. It would just be an extra layer of security.

    It turns out that we could not use MySQL master. No access. Pending queries. The data containing EBS volume was inaccessible. Cloudwatch was showing no R/RW active while we were loading MySQL with decent load. After 90 minutes of exercise, we realized that shit hit the fan. we turned our slave into new master. One front fixed.

    We were lucky that slave was not screwed the same way as master did. It was quite likely that slave would fail. It had the same configuration, lived in same availability zone.
  2. Never break a protocol: One of very crucial server was having EBS root device (not a instance-store with extra EBS volume like the last case). This machine went into coma. This was a complete setup of Wordpress, with all Wordpress tweaking, plug-ins, custom code, Apache mod-rewrite, mod-proxy el al, and MySQL.

    The survival trick lies in one of consistent behavior: whenever make any change to instance, create an AMI of it. So, we relaunched the the server from latest AMI. But data? Our daily backup script was just ready for this particular situation. So we launched a new instance, took the latest MySQL-dump file. Loaded. Updates DNS records. back up.
  3. ELB Outage: ELB outage made us helpless. There is nothing much that one can do with ELB. We were lucky that the ELB on top of our main application was not completely dead. ELB was just going crazy and marking instances out of service -- randomly. So, I jot down to the ELB screen  and refresh it regularly. If instances go out of service, I will re-save, and ELB get back up.

    You must be wondering if our instances were sick, or health-check was failing. None. These were instance-store machines, unaffected by the outage. The direct access to machines were fast and correct. There are two parts of it. 1. In past, I have observed ELB does not scale, with a surge of traffic. It ramp up slowly. I observed that during load tests. High latency of connection via ELB compared to direct access was also observed. 2. ELBs were sick during the outage as per Amazon's status.
Key Takeaway:
  1. EBS ain't that great. Stick to instance store, fault tolerant design. 
  2. Backup. Everything that matters, should be backed up in such a way they can be reloaded really fast when  required.
  3. Spread out. If you can, scatter instances in different availability zone. The best system would have the system designed in such a way that if an availability zone goes complete berserk, the user wouldn't feel the impact.
  4. ELB can be SPOF. If ELB is entry point to your application. It's single point of failure. Even if there is no downtime, we have seen poor scaling/slow ramp up on surge traffic. I am not sure if I have another alternative. But I am actively looking.

Monday, October 22, 2012

Dutch National Flag Problem

Here is the problematic flag:


It certainly does not says what the problem is. It's actually a three way partitioning problem.

Definition: An array A[1..n] needs to be rearranged in such a way that moving from left to right, we see three subgroups in this order: (-∞, low), [low, high], (high, ∞).

Algorithm: I came to this solution while working on a StackOverflow question about rearranging an array in such that all the elements less than zero are in left zero; and all the elements greater than zero lies to the right of it. My initial intuition was that I can do it by the improvising the partition part of Quicksort mechanism. It turns out while partitioning works for single pivot element. If you have a dupe of pivot, it may not work.

So, here is algorithm for three way partitioning:
A[1..n]

def three-way-part (A, low, high):
  left = 0
  right = n+1
  i = 1
  
  while i < right :
    if A[i] < low :
      swap(i, ++left)
    else if A[i] > high :
      swap(i, --right)
    else :
      i++
Complexity: Time complexity: O(n), space complexity: O(1)

Footnote:
[1] View longer discussion at Wikipedia
[2] View/download Java code from GitHub

Monday, October 15, 2012

Red Black Tree: Insertion

If this is your first attempt to understand red-black tree, please watch this video lecture. It will give you a solid foundation. Next, before you start to this tutorial, make sure:
  1. You are not in a hurry. (In that case, just use the code from the link in footnote)
  2. You do not get impatient.
  3. You do not bother why it works. It works. Once you get the algorithm and code it, it will be much simpler to visualize how it works. But, for 'why' -- you may need to research.
So, here we go

One Aim: Approximately balanced tree creation. From root to leaf, no (simple) path is more than twice as long as any other.

Two Rules: (CLRS says five, but others are trivial)
  1. Color Constraint: Red nodes will have both of it's children black. Black nodes can have either.
  2. Number Constraint: For each node, simple path to all the leaves will have same numbers of black children.
Three Situations: Couple of points to be noted before we see the situations
  1. Wherever I refer Uncle, it means (same as in common life) -- "sibling of parent node", "the other child of node's parent's parent".
  2. We assume as NIL nodes of all the leaves are black.
  3. We always insert red node. So, We are not breaking #2: Number Constraint by adding a black node, but we are breaking constaint #1: Color Constraint.
  4. Keep the root node black. You will see, it makes life easier. It does not breaks any of the constraints.
Situations:
  1. A Red uncle is a good uncle: If newly inserted node has red uncle, we can fix the situation just by fixing colors. Invert parent, grand parent, and uncle's colors. Now we need to check if grandparent is satisfying the Color Constraint.

  2. A black uncle on opposite side wants one rotation: If the uncle is black, we got to rotate. A slightly better case is when the newly inserted node is left child of its parent and uncle is right child of its parent or vice versa. In that case, you make a left-rotation or a right-rotation (rotations will be discussed later in this post). And then fix colors.

  3. A black uncle on the same side is double as dangerous: A case when newly inserted node is right child and the black uncle is also right child or both are left children, you need to make a left-rotation if inserted node is right-child or right rotation if newly inserted node is left child. This will make it situation 2.

The good thing about finding a black uncle is your quest ends there. No need to look further up the tree.

Conceptually, you are done here. All left now is some slight minor details about plumbing rotation and a couple of fine prints in algorithm like some times rotation may cause root node reassignment, how we treat a NIL node a black node, etc.

Rotation: A rotation is basically to swap a parent and child such that the BST property still holds. So, why do we call it rotation? Well, the process mutates the tree in such a way that it looks, to a casual observer, as if the tree just did a Toe Loop Jump. (well, this is the best I can explain.)

People claim there are two type of rotations -- right, when left child swaps with it's parent (i.e. moves to right); and left rotation, when a right child swaps with it's parent. Basically, if you start with a tree and right rotate a child, then left rotate the original parent the tree stays unchanged. (Figure out how and why the nodes are snipped and pasted the way they are.)



The algorithm: (Please note the below is not a valid Python code, I have use Python-esque code so that code highlighter shows key areas)
root = NIL

def insert (node)
  if root = NIL
    root = node
    root.color = BLACK
    return
    
  parent = root
  current = root
  
  while current != NIL
    parent = current
    if node.value < root.value 
      current = current.left
    else
      current = current.right
      
  if node.value < parent.value
    parent.left = node
  else
    parent.right = node
    
  node.color = RED
  fixTree(node)
  
def fixTree (node)
  
  if node.parent == RED
    uncle = getUncle(node)
    
      
    #case 1: Uncle Red
    if uncle.color = RED
      node.parent.color = BLACK
      node.parent.parent.color = RED
      uncle.color = BLACK
      fixTree(node.parent.parent)
    
    #Case 2, 3: Sirius Black.
    else if parentInLeftSubTree(node)
      #we are in left subtree, uncle is right child
      #case 3: rotate to case 2
      if node.parent.right = node
        rotateLeft(node)
        
      #case 2: color fix and rotate
      node.color = BLACK
      node.parent = RED
      rotateRight(node)
        
    else
      #we are in right subtree, uncle is left child
      #case 3: rotate to case 2
      if node.parent.left = node
        rotateRight(node)
        
      #case 2: color fix and rotate
      node.color = BLACK
      node.parent = RED
      rotateLeft(node)
      
    root.color = BLACK
  
def getUncle (node)
  if node = NIL or node.parent = NIL
    return NIL
    
  parent = node.parent
  grandParent = parent.parent
  if grandParent = NIL
    return NIL
    
  if parent = grandParent.left
    return grandParent.right
  return 
    return grandParent.left
    
def parentInLeftSubTree (node)
    if node.parent = node.parent.parent.left    
      return true
    else
      return false

"""
 *         g                               g
 *        /                               /
 *       x        Left Rotation(y)       y
 *      / \      ---------------->      / \
 *  alpha  y     <----------------     x   gamma
 *        / \    Right Rotation(x)    / \
 *     beta gamma                 alpha  beta
"""
 
def rotateLeft(y)
   x = y.parent
   g = x.parent
   alpha = x.left
   beta = y.left
   gamma = y.right
   
   y.parent = g
   if g != NIL
     if g.left = x
       g.left = y
     else
       g.right = y
   else
     root = y
     
   x.parent = y
   y.left = x 
   
   x.right = beta
   if beta != NIL
     beta.parent = x
   
def rotateRight(x)
  y = x.parent
  g = y.parent
  alpha = x.left
  beta = x.right
  gamma = y.right
  
  x.parent = g
  if g != NIL
    if g.left = y
      g.left = x
    else
      g.right = x
  else
    root = x
    
  x.right = y
  y.parent = x
  
  y.left = beta
  if beta != NIL
    beta.parent = y
Footnotes:

[1] View a foundation video lecture. it worth watching.
[2] Simulate your input on of a red-black tree creation -- here
[3] View/download Java code for red-black tree from GitHub
[4] I haven't referred much from this chapter, but it may worth reading: Introduction to Algorithms: CLRS, chapter 13: Red-Black Trees
[5] The execution of the code for a sorted input a = {8, 11, 14, 18, 22, 23, 31, 37, 41, 47, 60, 74, 84, 87, 88, 97, 99}, generates the following tree