HADOOP QUESTIONS:

Monetized analytics helps businesses to take important and better decisions and helps earn revenues. However, Big Data analytics is also used to derive revenue beyond the insights the insights it provides you might be able to get a unique data set that is valuable for other companies.

The sqoop merge tool works hand in hand with the incremental import last modified mode.each import creates new file , so if you want to keep the table data together in one file, you use the merge tool.

An Oozie activity is any possible entity that can tracked in Oozie functional subsystems and Hadoop jobs .The Oozie the oozie SLA defines ,stores , information for any oozie activity.

Ensemble methods refers to process of generating multiple models and combining them to solve a specific problem. The process that we follow in an ensemble method is quite similar to what we follow in our day-to-day life. We take opinions from different experts before arriving at a final decision.

Database sharding can be defined as a partitioning scheme for large databases distributed across various servers, and is responsible for new levels of database performance and scalability.It divides a database into smaller part called “shards” and replicates those across a number of distributed servers.

The is the big data to applied to a set of apprehensions that use several core database technologies. A polyglot is often used to solve a complex problem by breaking it into fragments and applying different database modelling techniques.

The input split defines the unit of work in a MapReduce program.But the input split does not describe the way to access the unit of work.The Recordreader class loads all required data from its source it source and converts it into a key/value pair, it is pairs that can be read by the mapper.

The visualization Layer handles the task of interpreting and visualizing Big Data.It can be described as viewing a piece of information from different perspectives,interpreting it in different manners.

SSH is a password-less secure communication where data packets are sent across the slave. It has some format into which data is sent across. SSH is not only between masters and slaves but also between two hosts.

In case of distributed databases, the three important aspects of the CAP theorem are consistently (A) , Availability (A),and partition tolerance (P) . The first one refers to the number of nodes that should respond to a read request before it is considered as a successful operation. The second is the number of nodes that should respond to a write request before its considered a successful operation. The third is the number of nodes where the data is replicated or copied.

The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

The BloomMapFile is a class that extends MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase format.

Never. The namenode needs to formatted only once in the beginning . Reformating of the namenode will lead to lost of the data on entire. The namenode is the only system that needs to be formatted only once. It will create the directory structure for file system metadata and create namespaceID for the entire file system.

Data Definition Language (DDL) is used to describe data and data structures of a database. Hive has it’s own DDL, such as SQL DDL, which is used for managing, creating, altering and dropping databases, tables and other objects in a database.

Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:

a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired SLA information for any Oozie activity.

Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.

Checkpoint functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of. U can say … THE checkpoint node is the replacement for the secondary namenode

Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-coordinator applications.

Split means operator partitions a given relation into two or more relations Flatten means operator is used for un-nesting as well as collecting tuples . The FLATTEN operator seems syntactically similar to a user-defined function statement.

The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

Static partition: the name of the partition is hard coded in the insert statement. Dynamic: hive will automatically determine the partition based on the value of the partition filed.

Ans: By using -Djava.library. path option on the command or else by setting LD_LIBRARY_PATH in .bashrc file

Ans: class path will contain list of directories containing jar files required to stop/start deamons.

Ex: HADOOP_HOME/share/Hadoop/common/lib contains all the common utility jar files

Ans: You can either do it programmatically by using method setNumReduce tasks in the jobconf Class or set it up as a configuration setting.

Ans: Three are few ways to do this. Look the below illustration .

Hadoop fs -setrep -w5 -R hadoop- test

Hadoop fs-Ddfs.replication=5 -cp hadoop- test/test.csv hadoop- test/test_with_rep5.csv

Ans: the cluster is in a safe mode. The administrator needs to wait f Depending on the data size the data replication will take some time. Hadoop cluster still needs to copy data around and if data size is big enough it is not uncommon that replication will take from a few minutes to few hours.

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes. There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

Ans: Deploy namenode and job tracker on the master node, and deploy data nodes and tasktrackers on multiple slave nodes. There is a need for only one namenode and job tracker on the system. The number of data nodes depends on the available hardware.

Ans: There are no requirements for data nodes. However, the name nodes require a specified amount of RAM to store filesystem image in memory based on the design of the primary name node and secondary name node, entire filesystem information will be stored in memory. Therefore, both name nodes to have enough memory to contain the entire filesystem image.

Operetionalized Analytics means making Analytics an important part of the business process. For instance, an insurance company can use a model to predict the probability of a claim being fraudulent.

Monetized Analytics helps businesses to take important and better decisions and helps earn revenues.

Split means operator partitions a given relation into two or more relations Flatten means operator is used for un-nesting as well as collecting tuples . The FLATTEN operator seems syntactically similar to a user-defined function statement.

The Oozie co-odinator is used to specify the conditions for a workflow in the form of predicates. It triggers the execution of the workflow at the specified time, on regular intervals, or on the basis of the available data.

The Oozie bundle is a top-level abstraction. In other words, it is a bundle or set of co-ordinator applications.The Oozie bundle enables a user to start, stop, suspend, resume , or rerun a job at the bundle level. It provides better operational control over the set of co-ordinator applications.

Oozie can recover workflow jobs in two ways. First, when action starts successfully, Oozie applies the MapReduce retry mechanisms for recovery. On the other hand, if an action fails to start, Oozie uses some other recovery techniques according to the nature of the failure.

Any asynchronous action in the Hadoop cluster can be executed in the form of Hadoop MapReduce jobs. This makes Oozie scalable. When you use Hadoop to perform processing/computation tasks triggered by a workflow action, workflow jobs must wait until the completion of these tasks before moving to next node in the workflow.

Oozie SLA specifies the quality of an Oozie application in measurable terms. SLA can be determined after taking the business requirements and nature of software into consideration.

Checkpoint functionality, the Backup node maintains the current state of all the HDFS block metadata in memory, just like the name node. If you r using the Backup node you can’t run the checkpoint node there is no need to do so, because the checkpointing process is already being taken care of. U can say … THE checkpoint node is the replacement for the secondary namenode

Ans: the purpose of Zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop nodes. Zookeeper also help to:

a. Manage configuration across nodes
b. Implement reliable messaging
c. Implement redundant services
d. Synchronize process execution

An Oozie activity is any possible entry that can be tracked in Oozie functional subsystems and Hadoop jobs. The Oozie SLA defines, and tracks the desired SLA information for any Oozie activity.

The RecordReader class generates key/value pairs from data within the boundaries create by the input split. In the input file, we have a start and a corresponding end. The start is a byte and tells the RecordReader to start gererating key/value pairs.

The results thus obtained must always be placed in a business context to take it as a final process of validation. Lets assume that an executive is 99 percent confident confident that the change in a process would result in the fetching of a 10 persent hike in the revenue.

The monitorying layer consists of a number of monitoring systems. These systems remains automatically aware of all the configurations and functions of different operating systems and hardware.

A container is nothing but a set of physical resource on a single node. A container consists of memory , CPU cores, and disks. Depending upon the resources in a node , a node can have multiple containers that are assigned to a specific Application Manager .

Zookeeper : it does the co-ordination work between client and Hbase master
Hbase master : Hbase Master monitors the region server
RegionServer : RegionServer monitors the Region.
Region : it contains in memory data store(MemStore) and Hfile
Catalog Tables : Catalog tables consist of ROOT and META

This invokes an action in specified shell script located on an Oozie server node(not HDFS).

The RecordReader class generates key/value pairs from data within the boundaries create by the input split. In the input file, we have a start and a corresponding end. The start is a byte and tells the RecordReader to start gererating key/value pairs.

TeraSort is a popular benchmark that measures the amount of time to sort one terabyte of randomly distributed data on a given computer system. It is commonly used to measure MapReduce performance of an Apache™ Hadoop® cluster
More on this :
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Hbase runs the master and it’s informational HTTP server at 60000 and 60010 respectively and regionservers at 60020 and their informational HTTP server at 60030.

Classpath will consist of a list of directories containing jar files to stop or start daemons.

The three main hdfs-site.xml properties are:

1. dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk or onto the remote.

2. dfs.data.dir which gives you the location where the data is going to be stored.

3. fs.checkpoint.dir which is for secondary Namenode.

Edge nodes are the interface between the hadoop cluster and the outside network. For this reason , they are sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools.

In Hadoop-2.x, we have two Namenodes – Active “Namenode” and Passive “Namenode”. Active “Namenode” is the “Namenode” which works and runs in the cluster. Passive “Namenode” is a standby “Namenode”, which has similar data as active “Namenode”. When the active “Namenode” fails, the passive “Namenode” replaces the active “Namenode” in the cluster. Hence, the cluster is never without a “Namenode” and so it never fails.

Monetized analytics helps businesses to take important and better decisions and helps earn revenues. However, Big Data analytics is also used to derive revenue beyond the insights the insights it provides you might be able to get a unique data set that is valuable for other companies.

The sqoop merge tool works hand in hand with the incremental import last modified mode.each import creates new file , so if you want to keep the table data together in one file, you use the merge tool.

An Oozie activity is any possible entity that can tracked in Oozie functional subsystems and Hadoop jobs .The Oozie the oozie SLA defines ,stores , information for any oozie activity.

To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

A container is nothing but a set of physical resource on a single node. A container consists of memory , CPU cores, and disks. Depending upon the resources in a node , a node can have multiple containers that are assigned to a specific Application Manager

Cloud computing provides fault tolerance by offering uninterrupted services to customers, especially in cases of component failure. The responsibility of handing the workload is shifted to other components of the cloud.

Apache Hadoop is an ecosystem used for processing large amounts data through the MapReduce data processing model. This ecosystem was originally developed by Google. Hadoop supports distributed processing of large amounts of data though the core MapReduce processing mechanism.

Intelligent Keys the data stored in Hbase is ordered by row key, and the row key is the only native index provided by the system, careful intelligent design of the row key can make a huge difference.

Kick-off time is the refers to the time a bundle application starts. And Bundle action means It refers to the start of a coordinator job of a coordinator application by the Oozie server.

It is the refers to the identification of anomalies, identification of an event that shows a difference between the actual observation and what u expected in your data.

Yarn is backward compatible, which means that the code developed using MapReduce can run on Yarn hadoop2 without any or some minor changes. This is very important feature as application that are developed using MapReduce usually cater to a large user base and run on widespread distributed systems.

Sometimes, the current is unable to call-back URL about the tasks completion for some reason, such as a transient network failure. In this case , Oozie uses the polling mechanism.In this mechanism, Oozie select a task to execute and complete itself.

Hive is a batch-oriented and data -warehousing layer created on the basic elements of Hadoop , such as HDFS and mapreduce.This layer plays an important role in mining of big data . Hive offers a simple SQL -lite-implementation call hiveQL to SQL users without losing access through mappers and reducers

A cloud that is owned by a company that the one which can be either an individual user or a company using it is known as a public cloud. In this cloud, there is no need for the organization(customers) to control or manage the resources, they are being administered by a third party.

Compute Node : this is the computer or machine where your actual business logic will be executed .

Storage Node : this is the computer or machine where your files system reside to storage the processing data. In most of the cases compute node and storage Node would be the same machine .

Beside using the jps command, to check whether Namenode are working you can also use /etc/init.d/hadoop-0.20-namenode status.

Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

A cloud that is owned and managed by a company than the one using it is known as a public cloud . In this cloud, there is no need for the organizations(customers) to control or manage the resources, they r being administered by a third party.

The cloud that remains entirely in the ownership of the organization using it known as a private cloud. In other words, in this cloud computing infrastructure is solely designed for single organization and can’t be accessed by other organizations.However, the organization may allow this cloud to be used by its employees, partners, and customers.

org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.

org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.

org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

Consider a case where a customer files to get the insurance money of a car claiming that it is destroyed in a fire. However, the customers on file Indicate that most of the valuable items were removed from the car prior to the fire.this might indicate that the car was torched on purpose.

This is a big marketing and analytics platform for mobile and web apps. It’s developer is Localytics, in Boston. It supports cross platform and web-based applications . Localytics supports push messaging, business analytics, and acquisition campaigns management.

Hive allows most of SQL queries , but hbase not allows SQL queries directly .

Hive doesn’t support record level update , insert and deletion operations on table , but hbase can do it.

Hive is a data warehosue framework where as hbase is a NOSQL data base.

Hive runs on top mapreduce , Hbase run on top of HDFS. Note: hive support for update and delete in latest version.