Maximum heap size for 32 bit or 64 bit JVM looks easy to determine by looking at addressable memory space like 2^32 (4GB) for 32 bit JVM and 2^64 for 64 bit JVM. Confusion starts here because you can not really set 4GB as maximum heap size for 32 bit JVM using -Xmx JVM heap options. You will get could not create the Java virtual machine Invalid maximum heap size: -Xmx error. There could be many different reason why maximum heap space for JVM is less than there theoretical limit and vary from one operating system to other e.g. different in Windows, Linux and Solaris. I have seen couple of comments on my post 10 points on Java Heap Space regarding what is maximum heap space for Java or 32 bit JVM or 64 bit JVM and why Windows allows only upto 1.6G memory as maximum heap space etc. In this Java article, I have collected some of the frequently asked questions around maximum heap space on both 32 and 64 bit JVM and tried to explain them.

Read more: http://javarevisited.blogspot.com/2013/04/what-is-maximum-heap-size-for-32-bit-64-JVM-Java-memory.html#ixzz2Qod1y0k4

Maximum Java Heap Space on 32 and 64 bit JVM

Here is list of some confusions I have seen on Java programmers regarding maximum heap space of 32 and 64 bit Java Virtual Machines :

  1. What is maximum heap size for 32 bit JVM?  2GB or 4GB?
  2. Why my JVM not able to start on windows when maximum heap space around 1600M?
  3. Why Linux or Solaris allow more maximum heap size than windows for same, 32 bit JVM?
  4. Can we set more than 4GB as maximum heap size for 32 bit JVM running on 64 bit or x64 operating system?
  5. What is maximum heap size for 64 bit or x64 JVM, Is it 8GB or 16GB?
  6. Can I specify more than 1GB as heap space if physical memory is less than 1GB?

If you also have similar confusion on JVM maximum heap space no matter whether its for your own Java application or any Java web or application server like Tomcat, JBoss or WebLogic, This discussion applies to all of them. Read more: http://javarevisited.blogspot.com/2013/04/what-is-maximum-heap-size-for-32-bit-64-JVM-Java-memory.html#ixzz2QodADADD

What is maximum heap size for 32 bit JVM?  2GB or 4GB?

This confusion comes because of sign bit, many programmer think in terms of signed integer and they think maximum addressable memory (size of address bus) for 32 bit architecture is 2^32-1 or 2GB and this confusion is supported by fact that you can not provide maximum heap space as 2GB on windows machine. But this is wrong. Memory is nothing to do with signed or unsigned bit as there is no negative memory address. So theoretical limit for maximum heap size on 32 bit JVM is 4GB and for 64 bit JVM its 2^64.

Why JVM not able to start on Windows XP when maximum heap space around 1600M?

This problem is most obvious on Windows platform like Windows XP, which tries to allocate a contiguous chunk of memory as requested by -Xmx JVM parameters

Windows reserves some space for his own and seems also allocate memory around half of memory address bar, which consequently reduces contiguous memory space somewhere less than 2GB, around 1500 to 1600M and when you give more than this size, JVM throws error as.

Could not create the Java virtual machine. Invalid initial heap size: -Xms1.5G

Remember, this limit on heap space is due to windows operating system’s own behavior. You can set maximum heap space, more than this size in Linux or Solaris. Though maximum heap size for 32 bit or 64 bit JVM will always be less than theoretical limit of addressable memory. By the way you can get this error due to many reasons, see How to fix

Invalid Initial and Maximum heap size in JVM for more details.

Why Linux or Solaris allow more maximum heap size than windows for same, 32 bit JVM?

This  point is also related to second. Though there could be multiple reasons for that but I think It could be because of Windows trying to allocate contiguous chunk of memory as Java heap space. Happy to hear your opinion on this.

Can we set more than 4GB as maximum heap size for 32 bit JVM running on 64 bit or x64 operating system?

This is a tricky question as you are running 32 bit JVM on x64 server. In my opinion you can set upto 4GB for 32 bit JVM but not more than that. Though x64 Servers has more memory for his needs and since every process can have upto 2^64 bit it may look perfectly OK for 32 bit JVM to accept 4GB as maximum heap size . In practice, I have tried both Linux and Solaris servers setting maximum heap size as 4G but it didn’t accept. Solaris goes more close to 4GB by allowing upto 3.6G (approx).

What is maximum heap size for 64 bit or x64 JVM, Is it 8GB or 16GB?

This question mostly arise because of available physical memory on machine. As no system currently have 2^64 bit of physical memory or RAM and often high end servers has memory around 8G, 16GB or 32GB. Theoretical maximum memory for x64 machines is 2^64 bit but again its depend on how much your operating systems allows.I read some where that Windows allowed maximum of 32GB for 64 bit JVM.

Can I specify more than 1GB as heap space if physical memory is less than 1GB ?

Theoretically yes, because operating system can use virtual memory and swap pages between physical memory and virtual memory, when there is no room in physical memory. Practically, if you are running on windows than it depends how far you can go, I have run Java program with -Xmx1124M even though my machine has less than 1GB RAM.

That’s all on what is maximum Java heap space for 32 bit and 64 bit JVM. As you see maximum heap size depends upon host operating system. Solaris and Linux provides more heap space than windows and that could be one of the many reason that Java Server application mostly run on UNIX based systems. Let me know what’s your thought and experience  on maximum Java heap space for  x86 and x64 JVM running on both x86 and x64 machines.

Read more: http://javarevisited.blogspot.com/2013/04/what-is-maximum-heap-size-for-32-bit-64-JVM-Java-memory.html#ixzz2QodIYznV

Preparing the nodes

For each node of the cluster (master and slaves) do the following:
superuser@master:~$ sudo apt-get install sun-java6-jdk

superuser@master:~$ sudo addgroup hadoop
superuser@master:~$ sudo adduser --ingroup hadoop hadoop

Create hadoop directory in all nodes and give appropriate permissions:

superuser@master:~$ sudo mkdir /usr/local/hadoop
superuser@master:~$ sudo chown -R hadoop:hadoop /usr/local/hadoop

superuser@slave1:~$ .... (execute the above 5 commands for each slave node)

Establishing authentication from master to slave nodes

The next step is to setup ssh key authentication from the master to each slave node. It is imperitive that the following commands be executed as the hadoop user on the master node.

By default, if a user from NodeA wants to login to a remote NodeB by using SSH, he will be asked the password for NodeB for authentication. However, it is impossible to input the authentication password every time the masternode wants to operate on a slave node. Under this circumstance, we must adopt public key authentication. Simply speaking, every node will generate a pair of public key and private key, and NodeA can login to NodeB without password authentication only if NodeB has a copy of NodeA’s public key. In other words, if NodeB has NodeA’s public key, NodeA is a trusted node to NodeB.

In hadoop cluster, all the slavenodes must have a copy of master node’s public key.  In the following section we will discuss how to generate the keys and how to make master node authenticated.

Generate keys:

Login to each node with the account “hadoop” and run the following command:

hadoop@master:~$ ssh-keygen -t rsa

hadoop@slave_i:~$ ssh-keygen -t rsa

....

This command is used to generate the pair of public and private keys. \-t” specifies the type of keys, here we use RSA algorithm. When questions are asked, simply press “enter” to continue. Then two files “id_rsa” and “id_rsa.pub” are created under the folder /home/hadoop/.ssh/

Establish authentications:

Now we can copy the public key of masternode to all the slavenodes. Login to masternode with account “hadoop” and run the following command:

hadoop@master:~$ cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
hadoop@master:~$ scp /home/hadoop/.ssh/id_rsa.pub ip_address_of_slavenode_i:/home/hadoop/.ssh/master.pub

The second command should be executed several times until the public key is copied to all the slavenodes. Please note that “ip_address_of_slavenode_i” should be replaced with the domain name of slavenode i.

hadoop@slave_i:~$ cat /home/hadoop/.ssh/master.pub >> /home/hadoop/.ssh/authorized_keys

Then we can login from the masternode to each slavenode with account “hadoop” and run the following command:

ssh ip_address_of_slavenode_i

to test whether masternode can login to slavenodes without password authentication.

E.g.

hadoop@master:~$ ssh slave1

Hadoop installation on Master node

You have to download the last version of Hadoop from the Apache Download Mirrors () and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hadoop user and group, for example:

superuser@master:~$ cd /usr/local
superuser@master:/usr/local$ sudo wget http://apache.cc.uoc.gr/hadoop/core/...-0.20.2.tar.gz
superuser@master:/usr/local$ sudo tar xzf hadoop-0.20.2.tar.gz
superuser@master:/usr/local$ sudo mv hadoop-0.20.2 hadoop
superuser@master:/usr/local$ sudo chown -R hadoop:hadoop hadoop

Hadoop configuration in the master node

Go to the hadoop/conf folder:

hadoop@master:~$ cd /usr/local/hadoop/conf

hadoop-env.sh

The only required environment variable we have to configure is JAVA_HOME. Open <HADOOP_INSTALL>/conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change

 # The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

conf/masters

The conf/masters file defines the master nodes of our multi-node cluster. In our case, this is just the master machine.

On master, update <HADOOP_INSTALL>/conf/masters that it looks like this:

master_hostname

conf/slaves

This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

On master, update <HADOOP_INSTALL>/conf/slaves that it looks like this:

master_hostname

slave1_hostname

slave2_hostname

...

conf/*-site.xml

You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example /tmp/datastore/hadoop-${user.name}. Hadoop will expand ${user.name} to the system user which is running Hadoop, so in our case this will be hadoop and thus the final path will be /tmp/datastore/hadoop-hadoop.

Note: Depending on your choice of location, you might have to create the directory manually with sudo mkdir /your/path; sudo chown hadoop:hadoop /your/path in case the hadoop user does not have the required permissions to do so (otherwise, you will see a java.io.IOException when you try to format the name node in the next section).

e.g.

superuser@master:~$ sudo mkdir /tmp/datastore
superuser@master:~$ sudo chown -R hadoop:hadoop /tmp/datastore

<!– In: conf/core-site.xml –>
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value>
<description>A base for other temporary directories</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://master_hostname:54310</value>
<description>The name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.  The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

E.g. in our case hadoop.tmp.dir is /tmp/datastore/hadoop-${user.name}

Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case. Replace master with the DNS of the master machine.

<!– In: conf/mapred-site.xml –>
<property>
<name>mapred.job.tracker</name>
<value>master_hostname:54311</value>
<description>The host and port that the MapReduce job tracker runs
at.  If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

<property>
<name>mapred.map.tasks</name>
<value>50</value>
<description>As a rule of thumb, use 10x the number of slaves (i.e., number of tasktrackers).
</description>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>50</value>
<description>As a rule of thumb, use 2x the number of slave processors (i.e., number of tasktrackers).
</description>
</property>

</configuration>

Third, we change the dfs.replication variable (in conf/hdfs-site.xml) which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of slave nodes (more precisely, the number of datanodes) that you have available, you will start seeing a lot of (Zero targets found,
forbidden1.size=1) type errors in the log files.

The default value of dfs.replication is 3. dfs.replication should be always <= number of nodes.

<!– In: conf/hdfs-site.xml –>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Remote copy hadoop folder from master node to slave nodes

Run the following command from the master node for each slave node:

hadoop@master:~$ scp -r /usr/local/hadoop ip_address_of_slavenode_i:/usr/local/hadoop
This should copy our entire set-up to the node without having to go through unzipping and all the subsequent steps from that. You can do this after configuring hadoop so that you don’t have to configure each slave independently, but only do this if you don’t plan to have any custom settings per node. Please also do not do this if you’ve already attempted to start hadoop.

Formatting the namenode

Before we start our new multi-node cluster, we have to format Hadoop’s distributed filesystem (HDFS) for the namenode. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop namenode, this will cause all your
data in the HDFS filesytem to be erased.

To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable on the namenode), run the command:

hadoop@master:~$ /usr/local/hadoop/bin/hadoop namenode -format

Starting the multi-node cluster

Starting the cluster is done in two steps. First, the HDFS daemons are started: the namenode daemon is started on master, and datanode daemons are started on all slaves (here: master and slave). Second, the MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves (here: master and slave).

1) Starting HDFS daemons

Run the command <HADOOP_INSTALL>/bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

In our case, we will run bin/start-dfs.sh on master:

hadoop@master:~$ /usr/local/hadoop/bin/start-dfs.sh

2) Starting MapReduce daemons

Run the command <HADOOP_INSTALL>/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

In our case, we will run bin/start-mapred.sh on master:

hadoop@master:~$ /usr/local/hadoop/bin/start-mapred.sh

The following URLs are useful to check the dfs health and track the jobs:

http://master_hostname:50030/ (dfs health)
http://master_hostname:50070/ (jobtracker)

Customize
your font to achieve a desired look.

1. Asci summation

Now we will examine some hash functions suitable for storing strings of characters. We start with a simple summation function.

int h(String x, int M) {
   char ch[];
   ch = x.toCharArray();
   int xlength = x.length();

   int i, sum;
   for (sum=0, i=0; i < x.length(); i++)
     sum += ch[i];
   return sum % M;
 }

This function sums the ASCII values of the letters in a string. If the hash table size M is small compared to the resulting summations, then this hash function should do a good job of distributing strings evenly among the hash table slots, because it gives equal weight to all characters in the string. This is an example of the folding approach to designing a hash function. Note that the order of the characters in the string has no effect on the result. A similar method for integers would add the digits of the key value, assuming that there are enough digits to

  1. keep any one or two digits with bad distribution from skewing the results of the process and
  2. generate a sum much larger than M.

As with many other hash functions, the final step is to apply the modulus operator to the result, using table size M to generate a value within the table range. If the sum is not sufficiently large, then the modulus operator will yield a poor distribution. For example, because the ASCII value for “A” is 65 and “Z” is 90, sum will always be in the range 650 to 900 for a string of ten upper case letters. For a hash table of size 100 or less, a reasonable distribution results. For a hash table of size 1000, the distribution is terrible because only slots 650 to 900 can possibly be the home slot for some key value, and the values are not evenly distributed even within those slots.

Further Optimization :

Here is a much better hash function for strings.

// Use folding on a string, summed 4 bytes at a time
long sfold(String s, int M) {
     int intLength = s.length() / 4;
     long sum = 0;
     for (int j = 0; j < intLength; j++) {
       char c[] = s.substring(j * 4, (j * 4) + 4).toCharArray();
       long mult = 1;
       for (int k = 0; k < c.length; k++) {
	 sum += c[k] * mult;
	 mult *= 256;
       }
     }

     char c[] = s.substring(intLength * 4).toCharArray();
     long mult = 1;
     for (int k = 0; k < c.length; k++) {
       sum += c[k] * mult;
       mult *= 256;
     }

     return(Math.abs(sum) % M);
   }

This function takes a string as input. It processes the string four bytes at a time, and interprets each of the four-byte chunks as a single long integer value. The integer values for the four-byte chunks are added together. In the end, the resulting sum is converted to the range 0 to M-1 using the modulus operator.

For example, if the string “aaaabbbb” is passed to sfold, then the first four bytes (“aaaa”) will be interpreted as the integer value 1,633,771,873, and the next four bytes (“bbbb”) will be interpreted as the integer value 1,650,614,882. Their sum is 3,284,386,755 (when treated as an unsigned integer). If the table size is 101 then the modulus function will cause this key to hash to slot 75 in the table. Note that for any sufficiently long string, the sum for the integer quantities will typically cause a 32-bit integer to overflow (thus losing some of the high-order bits) because the resulting values are so large. But this causes no problems when the goal is to compute a hash function.

The reason that hashing by summing the integer representation of four letters at a time is superior to summing one letter at a time is because the resulting values being summed have a bigger range. This still only works well for strings long enough (say at least 7-12 letters), but the original method would not work well for short strings either. Another alternative would be to fold two characters at a time.

2. String hash function

In the String class, for example, the hash code h of a string s of length n is calculated as

\( \texttt{h} \;=\; \texttt{s[0]}*31^{n-1} + \texttt{s[1]}*31^{n-2} + \cdots + \texttt{s[n-1]} \)

or, in code,

int h = 0; for (int i = 0; i < n; i++) { h = 31*h + s.charAt(i); } 

For example the hash code of hello uses the Unicode values of its characters

h e l l o
$104$ $101$ $108$ $108$ $111$

to give the value

\( 99162322 \;=\; 104*31^{4} + 101*31^{3} + 108*31^{2} + 108*31 + 111 \)

In general the arithmetic operations in such expressions will use 32-bit modular arithmetic ignoring overflow. For example

Integer.MAX_VALUE + 1 = Integer.MIN_VALUE

where

Integer.MAX_VALUE = $2147483647$
Integer.MIN_VALUE = $-2147483648$

Note that, because of wraparound associated with modular arithmetic, the hash code could be negative, or even zero. It happened to be positive in this case because hello is a fairly short string. Thus, for example, the hash code for helloa, which is 31 times the hash code for hello plus 97, would not be $3074032079$, which is outside the range of 32-bit signed integers, but $3074032079-2^{32} = -1220935217$.

Introduction of Python

Posted: June 25, 2012 in Interviews

In an object-based application, most objects are passive. A passive object just sits there waiting for one of its methods to be invoked. A passive object’s private member variables can only be changed by the code in its own methods, so its state remains constant until one of its methods is invoked. In a multithreaded environment like Java, threads can run within objects to make the objects active. Objects that are active make autonomous changes to themselves.

Sometimes in modeling a system, it becomes apparent that if some of the objects were active, the model would be simplified. Earlier in this book, classes that implemented Runnable were instantiated, passed to one of the constructors ofThread, and then start() was invoked. This style required a user of a class to know that a thread needed to be started to run within it, creating a burden on the user of the class. In addition, because the user of the class created theThread object for it, a reference to Thread was available for misuse. The user of the class could erroneously set the priority of the thread, suspend it at a bad time, or outright stop the thread when the object it was running in was in an inconsistent state. Having to activate objects externally is both inconvenient and potentially hazardous. In this chapter, I’ll show you how to have an active object transparently create and start up its own internal thread.

Simple Self-Running Class

The class SelfRun, shown in  demonstrates a simple example of an active object. During construction, it automatically starts an internal thread running.

Code Listing SelfRun.java—A Simple Self-Running Class
1: public class SelfRun extends Object implements Runnable {
 2:     private Thread internalThread;
 3:     private volatile boolean noStopRequested;
 4:
 5:     public SelfRun() {
 6:         // other constructor stuff should appear here first ...
 7:         System.out.println("in constructor - initializing...");
 8:
 9:         // Just before returning, the thread should be
10:         // created and started.
11:         noStopRequested = true;
12:         internalThread = new Thread(this);
13:         internalThread.start();
14:     }
15:
16:     public void run() {
17:         // Check that no one has erroneously invoked
18:         // this public method.
19:         if ( Thread.currentThread() != internalThread ) {
20:             throw new RuntimeException("only the internal " +
21:                 "thread is allowed to invoke run()");
22:         }
23:
24:         while ( noStopRequested ) {
25:             System.out.println("in run() - still going...");
26:
27:             try {
28:                 Thread.sleep(700);
29:             } catch ( InterruptedException x ) {
30:                 // Any caught interrupts should be habitually
31:                 // reasserted for any blocking statements
32:                 // which follow.
33:                 Thread.currentThread().interrupt();
34:             }
35:         }
36:     }
37:
38:     public void stopRequest() {
39:         noStopRequested = false;
40:         internalThread.interrupt();
41:     }
42:
43:     public boolean isAlive() {
44:         return internalThread.isAlive();
45:     }
46: }

Introducing Hadoop

Posted: May 16, 2012 in Interviews
Tags: ,

Hadoop is the Apache Software Foundation top-level project that holds the various Hadoop subprojects that graduated from the Apache Incubator. The Hadoop project provides and supports the development of open source software that supplies a framework for the development of highly scalable distributed computing applications. The Hadoop framework handles the processing details, leaving developers free to focus on application logic.

The introduction on the Hadoop project web page (http://hadoop.apache.org/) states:

  • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:
    • Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
    • HBase builds on Hadoop Core to provide a scalable, distributed database.
    • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
    • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
    • Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets.

The Hadoop Core project provides the basic services for building a cloud computing environment with commodity hardware, and the APIs for developing software that will run on that cloud. The two fundamental pieces of Hadoop Core are the MapReduce framework, the cloud computing environment, and he Hadoop Distributed File System (HDFS).

The Hadoop Core MapReduce framework requires a shared file system. This shared file system does not need to be a system-level file system, as long as there is a distributed file system plug-in available to the framework. While Hadoop Core provides HDFS, HDFS is not required. In Hadoop JIRA (the issue-tracking system), item 4686 is a tracking ticket to separate HDFS into its own Hadoop project. In addition to HDFS, Hadoop Core supports the CloudStore (formerly Kosmos) file system (http://kosmosfs.sourceforge.net/) and Amazon Simple Storage Service (S3) file system (http://aws.amazon.com/s3/). The Hadoop Core framework comes with plug-ins for HDFS, CloudStore, and S3. Users are also free to use any distributed file system that is visible as a system-mounted file system, such as Network File System (NFS), Global File System (GFS), or Lustre.

When HDFS is used as the shared file system, Hadoop is able to take advantage of knowledge about which node hosts a physical copy of input data, and will attempt to schedule the task that is to read that data, to run on that machine. This book mainly focuses on using HDFS as the file system.

Hadoop Core MapReduce

The Hadoop Distributed File System (HDFS)MapReduce environment provides the user with a sophisticated framework to manage the execution of map and reduce tasks across a cluster of machines. The user is required to tell the framework the following:

  • The location(s) in the distributed file system of the job input
  • The location(s) in the distributed file system for the job output
  • The input format
  • The output format
  • The class containing the map function
  • Optionally. the class containing the reduce function
  • The JAR file(s) containing the map and reduce functions and any support classes

If a job does not need a reduce function, the user does not need to specify a reducer class, and a reduce phase of the job will not be run. The framework will partition the input, and schedule and execute map tasks across the cluster. If requested, it will sort the results of the map task and execute the reduce task(s) with the map output. The final output will be moved to the output directory, and the job status will be reported to the user.

MapReduce is oriented around key/value pairs. The framework will convert each record of input into a key/value pair, and each pair will be input to the map function once. The map output is a set of key/value pairs—nominally one pair that is the transformed input pair, but it is perfectly acceptable to output multiple pairs. The map output pairs are grouped and sorted by key. The reduce function is called one time for each key, in sort sequence, with the key and the set of values that share that key. The reduce method may output an arbitrary number of key/value pairs, which are written to the output files in the job output directory. If the reduce output keys are unchanged from the reduce input keys, the final output will be sorted.

The framework provides two processes that handle the management of MapReduce jobs:

  • TaskTracker manages the execution of individual map and reduce tasks on a compute node in the cluster.
  • JobTracker accepts job submissions, provides job monitoring and control, and manages the distribution of tasks to the TaskTracker nodes.

Generally, there is one JobTracker process per cluster and one or more TaskTracker processes per node in the cluster. The JobTracker is a single point of failure, and the JobTracker will work around the failure of individual TaskTracker processes.

The Hadoop Distributed File System

HDFS is a file system that is designed for use for MapReduce jobs that read input in large chunks of input, process it, and write potentially large chunks of output. HDFS does not handle random access particularly well. For reliability, file data is simply mirrored to multiple storage nodes. This is referred to as replication in the Hadoop community. As long as at least one replica of a data chunk is available, the consumer of that data will not know of storage server failures.

HDFS services are provided by two processes:

  • NameNode handles management of the file system metadata, and provides management and control services.
  • DataNode provides block storage and retrieval services.

There will be one NameNode process in an HDFS file system, and this is a single point of failure. Hadoop Core provides recovery and automatic backup of the NameNode, but no hot failover services. There will be multiple DataNode processes within the cluster, with typically one DataNode process per storage node in a cluster.

Behavioral patterns are concerned with the interaction and responsibility of objects. They help make complex behavior manageable by specifying the responsibilities of objects and the ways they communicate with each other.The following Behavioral patterns are described by GoF:

  • Chain of Responsibility
  • Command
  • Interpreter
  • Iterator
  • Mediator
  • Memento
  • Observer
  • State
  • Strategy
  • Template Method
  • Visitor


Chain of Responsibility 

The Chain of Responsibility pattern’s intent is to avoid coupling the sender of a request to its receiver by giving multiple objects a chance to handle the request. The request is passed along the chain of receiving objects until an object processes it. shows the UML.The Chain of Responsibility pattern’s intent is to avoid coupling the sender of a request to its receiver by giving multiple objects a chance to handle the request. The request is passed along the chain of receiving objects until an object processes it. shows the UML.


Figure  UML for the Chain of Responsibility pattern

Benefits  Following are the benefits of using the Chain of Responsibility pattern:

  • It reduces coupling.
  • It adds flexibility when assigning responsibilities to objects.
  • It allows a set of classes to act as one; events produced in one class can be sent to other handler classes within the composition.

Applicable Scenarios  The following scenarios are most appropriate for the Chain of Responsibility pattern:

  • More than one object can handle a request and the handler is unknown.
  • A request is to be issued to one of several objects and the receiver is not specified explicitly.
  • The set of objects able to handle the request is to be specified dynamically.

J2EE Technology Feature  The J2EE technology feature associated with the Chain of Responsibility pattern is RequestDispatcher in the servlet/JSP API.

Example Code  The following example Java code demonstrates the Chain of Responsibility pattern:

ConcreteHandler1.java

public class ConcreteHandler1 implements HandlerIF {

public void processRequest(Request parm) {

// Start the processing chain here…

switch (parm.getType()) {

case Request.EQUITY_ORDER: // This object processes equity orders

handleIt(parm);          // so call the function to handle it.

break;

case Request.BOND_ORDER:   // Another object processes bond orders so

System.out.println(“Creating 2nd handler.”); // pass request along.

new ConcreteHandler2().processRequest(parm);

break;

}

}

private void handleIt(Request parm) {

System.out.println(“ConcreteHandler1 has handled the processing.”);

}

}

 

ConcreteHandler2.java

public class ConcreteHandler2 implements HandlerIF {

public void processRequest(Request parm) {

// You could add on to the processing chain here…

handleIt(parm);

}

private void handleIt(Request parm) {

System.out.println(“ConcreteHandler2 has handled the processing.”);

}

}

 

HandlerIF.java

public interface HandlerIF {

public void processRequest(Request request);

}

Request.java

public class Request {

// The universe of known requests that can be handled.

public final static int EQUITY_ORDER = 100;

public final static int BOND_ORDER   = 200;

// This objects type of request.

private int type;

public Request(int parm) throws Exception {

// Validate the request type with the known universe.

if ((parm == EQUITY_ORDER) || (parm == BOND_ORDER))

// Store this request type.

this.type = parm;

else

throw new Exception(“Unknown Request type “+parm+”.”);

}

public int getType() {

return type;

}

}

 

ChainOfResponsibilityPattern .java

public class ChainOfResponsibilityPattern {

public static void main(String[] args) {

System.out.println(“Chain Of Responsibility Pattern Demonstration.”);

System.out.println(“———————————————————————”);

try {

// Create Equity Order request.

System.out.println(“Creating Equity Order request.”);

Request equityOrderRequest = new Request(Request.EQUITY_ORDER);

// Create Bond Order request.

System.out.println(“Creating Bond Order request.”);

Request bondOrderRequest = new Request(Request.BOND_ORDER);

// Create a request handler.

System.out.println(“Creating 1st handler.”);

HandlerIF handler = new ConcreteHandler1();

// Process the Equity Order.

System.out.println(“Calling 1st handler with Equity Order.”);

handler.processRequest(equityOrderRequest);

// Process the Bond Order.

System.out.println(“Calling 1st handler with Bond Order”);

handler.processRequest(bondOrderRequest);

} catch (Exception e) {

System.out.println(e.getMessage());

}

System.out.println();

}

}

 

Yesterday I appear for interview in VMware, Pune. 

The office is located in Wakdewadi in new Bajaj building. There were 4 technical rounds.  I spend almost 5 hrs there for interview.

Below are the technical Interview Questions, they ask in interview.

1. Design the HashMap using Array. Write down the program for same.

2. How to find the duplicate elements from Array without using extra space

3. Design the popular Producer/Consumer problem. There are couple of constraints.  – The Consumer consume the data in order. – There would be no event lose

4. How to find the linkList is circular linkList without min and Fwd pointer ? What is the complexity ?

5. Difference between SOAP and REST, where to use what?

6. Difference between Abstract Factory and Factory design patterns.

7. Java Related Que like

  • Volatile,
  • Abstract Class Vs Interface,
  • Collection,
  • Threads,
  • Dead Lock,
  • How to design own implementation of  Array List