My Big Data: 2015

Tuesday, December 1, 2015

Cassandra cqlsh

1. cqlsh is a command line utility for issuing query statemets to cassandra or altering schema's in cassandra.
install/bin/cqlsh

2. Some of the options that you can pass to the cqlsh command are

3. cqlsh has some commands that are not there in Cassandra Query Language
Below are some such commands

4. Copying data to or from a specified table and csv file

5. Default keyspaces in Cassandra:
a) system_traces
b) system

6. cqlsh commands and CQL commands

Monday, November 30, 2015

Cassandra Nodetool

Introduction to Cassandra Nodetool:

1. Nodetool is the command line utility for managing cassandra cluster.

/install/bin/nodetool

2. Command to connect to the node other than that you are currently on use the below command.

$ bin/nodetool -h 'hostname' -p 'jmx_port' [command] [options]
> jmx_port is configured in cassandra-env.sh
> default jmx port is 7199.
Example:

3. Nodetool supports over 60 commands including:
> status
> info
> ring
Example:

4. Sample outout of nodetool info command.

5.Additional nodetool commands

Installing, Configuring and Running Cassandra locally

1. Prepare the Operating System
a) Install latest Java 7
b) Configure JAVA_HOME

c) Install JNA ( Java Native Access) libraries

d) Synchronize clocks on each node by using NTP protocol.

e) Disable SWAP (sudo swapoff -all)

2. Select and install a Cassandra distribution.

There are three distributions

a) Cassandra Opensource

b) DSE(Datastax Enterprise)

c) DSC(Datastax Community)

Directory Structure after you install Cassandra

3.Configure Cassandra for the single node

Configuration files include

a) cassandra.yaml



b) cassandra-env.sh



c) logback.xml



d) cassandra-rackdc.properties

e) cassandra-topology.properties

f) bin/cassandra-in.sh

4.Start and Stop the cassandra instance.

a) Starting the instance

b) Stopping the instance

c) System logs

Summary

Saturday, November 28, 2015

Key Features And Benefits Of Cassandra

Cassandra provides the following features and benefits.

Massively scalable architecture
Active everywhere design
Linear scalable performance
Continuous availability
Transparent fault detection and recovery
Flexible and dynamic data model
Strong data protection
Tunable data consistency
Multi-data center replication
Data compression
CQL

Tuesday, November 3, 2015

11. Loading the multi-delimiter data into hive table
There are four steps that I follow to load this kind of data
a) Creating a single column table in hive
hive> create table multi_temp(content String);

b) Loading data from local file system to the hive single column table.

c) Creating the deired table in hive

d) Loading the data from single column table to the desired table

12. Loading XML data into hive table.
We can use the same four step approach that we have used for the multi-deimiter data of which the first three steps are same.
4. Loading data from single column data to the desired table.

Till now we have executed all the queries performed operations in hive terminal. We can also do this by writing a script and executing it from local terminal.

13. Loading nested XML data into hive table.

14. Creating and Executing Hive Scripts.
a) creating the hive script

b) Executing the hive script.

Monday, October 12, 2015

Hive basics for Begginers #1

1. How to get into hive terminal
$ hive

2. Command to display the databases in hive
hive> show databases;

Note: Hive has a default database 'default', if you don't specify any databases it takes the default database.

3. Command to use database;

hive> use <database_name>

4. Command to list tables in a database

hive> show tables;

5. Create table syntax in hive

hive> create table emp(eid int, ename string, salary int, gender string, dept_no int );

6.Load data into Hive table from Local File System

hive> load data local inpath '<Local_Directory_Path>' into table <hive_table_name>;

emp.txt is a coma delimited file, hence we are getting Null's in all the columns. To overcome this we have to modify our query as below

7. Creating a table in hive which can accept the coma delimited file

hive>create table emp_temp(eid int, ename string, salary int, gender string, dept_no int )

> row format delimited fields terminated by ',';

8. Loading data into the above created hive table

hive>load data local inpath '/tmp/HadoopPractice/emp.txt' into table emp_temp;

Note: In the above output you don't see headers(colum names) for the column. So use the below command to set the headers

hive> set hive.cli.print.header=true;

9. Now as we have data in our table lets do some analysis on the data

hive> select sum(salary) as salaries_sum from emp_temp;

10. Import data from one hive table to another hive table;

hive> insert overwrite table <to_table_name> select eid,ename from <from_table_name>;

________________________________________________________________________________

More Hive Commands

hive> describe <table_name>;

hive>describe extended <table_name>;

hive>show functions;

hive> set hive.cli.print.header=true;

hive> describe function <function_name>

hive> load data inpath '/tmp/HadoopPractice/emp.txt' into table emp_temp; (Loading data from HDFS to Hive).

Note: when you load data from HDFS to Hive, the file in HDFS is actually deleted.

By default all the hive databases are stored in user/hive/warehouse/

[training@localhost /]$ hadoop fs -ls /user/hive/warehouse/;

Thursday, October 1, 2015

Basic HDFS Commands for Begginers

1. To display files and directories in HDFS
$ hadoop fs -ls

2. To create a directory in HDFS
$ hadoop fs -mkdir HDFSPractice

3. To dispaly the contents of the directory
$ hadoop fs -ls <Directory_Name>

Since, our directory is new we don't have any files in the directory. So let's add some files to our directory.

4. Loading files into HDFS from our local file system
$hadoop fs -copyFromLocal /usr/hadoopPractice/employee.txt HDFSPractice/

Note: 1. hadoop is case sensitive so copyFromLocal is different from copyfromLocal.

Now lets display the contents of the HDFSPractice directory as in point 3.

5. Remove/Delete file from HDFS
$ hadoop fs -rm HDFSPractice/employee.txt

6. Loading the file in our HDFS to our local file system
$ hadoop fs -copyToLocal HDFSPractice/employee.txt /usr/

Goals of HDFS
1. Very large distributed file system
---- 10k nodes, 10PB data, 100 million files.
2. User Space, runs on heterogeneous OS
3.Optimized for batch processing
----- Locations of data exposed so that the computations can move to where data resides
----- Provides very high aggregate bandwidth
4.Assumes commodity hardware
----- Files are replicated to handle hardware failures.
----- Detects the failures and recovers from them.

Thursday, September 3, 2015

Hive

How to handle XML data ?
Using hive function xpath,xpath_string
step 1: Create single column table
create table xmldataTable(col1 string);
step 2 : Load data into single column table
load data local inpath 'xmldata' into table xmldataTable;
step 3: create the required table
create table xml_table2(name string, age int,gend string);

step 4: load data from single column table to final table
insert overwrite table xml_table2 select xpath_string(col1, 'rec/name'),xpath_string(col1, 'rec/age'),xpath_string(col1,'rec/sex ') from xmldataTable ;

hive> select * from (select * from emp unionall select * from emp2)e

[cloudera@localhost ~]> hive -e 'select eid from emp'

Running hive scripts
[cloudera@localhost ~]$ gedit hivesc.hive
hivesc.hive file
use nareshdb;

[cloudera@localhost ~]$ hive -f hivesc