Tuesday 3 March 2020

AWS LAMBDA Function Runtime upgrade from Python 2.7 runtime to 3.7.x

Why should this upgrade happen ?

Python has announced the 2 series run time has attained its EOL , The message from the python foundation is

We have decided that January 1, 2020, will be the day that we sunset Python 2. That means that we will not improve it anymore after that day, even if someone finds a security problem in it. You should upgrade to Python 3 as soon as you can.

For whom this is for ?

AWS lambda function users who is running on 2.7 framework

Getting ready for some action

1. Run command line utility 2to3

verify python 3 is the defaulted version on the operating system by typing in

$ python — version

Python 3.7.6

2to3 utility

2to3 is command line utlity that ships with python3's default insallation (This was tested on MAC OS) Alternatively if this is missing there are numerous ways to get this installed for windows or linux from google

Target the lambda function’s python file and run this utility so this utlity can upgrade the code as needed to Python 3 ‘s syntax.

If there is no change the utility outputs as below

$ 2to3 lambda_function.py

RefactoringTool: Skipping optional fixer: buffer

RefactoringTool: Skipping optional fixer: idioms

RefactoringTool: Skipping optional fixer: set_literal

RefactoringTool: Skipping optional fixer: ws_comma

RefactoringTool: No changes to lambda_function.py

RefactoringTool: Files that need to be modified:

RefactoringTool: lambda_function.py

If there is a change the Utility outputs as below

$ 2to3 lambda_function.py

RefactoringTool: Skipping optional fixer: buffer

RefactoringTool: Skipping optional fixer: idioms

RefactoringTool: Skipping optional fixer: set_literal

RefactoringTool: Skipping optional fixer: ws_comma

RefactoringTool: Refactored lambda_function.py

— — lambda_function.py (original)

+++ lambda_function.py (refactored)

@@ -1,5 +1,5 @@

def greet(name):

- print “Hello, {0}!”.format(name)

-print “What’s your name?”

-name = raw_input()

+ print(“Hello, {0}!”.format(name))

+print(“What’s your name?”)

+name = input()

greet(name)

RefactoringTool: Files that need to be modified:

RefactoringTool: lambda_function.py

Use -w keyword as an argument to write it to a file.

$ 2to3 -w lambda_function.py

RefactoringTool: Skipping optional fixer: buffer

RefactoringTool: Skipping optional fixer: idioms

RefactoringTool: Skipping optional fixer: set_literal

RefactoringTool: Skipping optional fixer: ws_comma

RefactoringTool: Refactored lambda_function.py

— — lambda_function.py (original)

+++ lambda_function.py (refactored)

@@ -1,5 +1,5 @@

def greet(name):

- print “Hello, {0}!”.format(name)

-print “What’s your name?”

-name = raw_input()

+ print(“Hello, {0}!”.format(name))

+print(“What’s your name?”)

+name = input()

greet(name)

RefactoringTool: Files that were modified:

RefactoringTool: lambda_function.py

2. Package the python file

The lambda python file has been replaced with all the python3 required syntax. Use the regular process to checkin the file to repository and build them with python3 boto3 libraries .

ref : https://docs.aws.amazon.com/lambda/latest/dg/python-package.html

3 upload and deploy the build to s3 or deploy them using pipelines or terraform to test them on lambda with 3.7 runtime. This sucessfully worked in our tests.

4. Rollout the same process for every other lambda_function.py files which are due for an upgrade.

More references & usages for 2to3 utility can be found at this link

Friday 8 February 2019

Some useful big-data ecosystem interview questions

so here are some good big-data ecosystem related questions shared for the community for their interview preparations.

HDFS

What are the main components and node types that make HDFS ?
Briefly explain the purpose and function of FSI image and Editlog ?
What is the difference between secondary name node and standby name node?
what are the options for replicating and HDFS cluster for example production and DR cluster ?
How are corrupted HDFS files detected and subsequently recovered ?
Describe HDFS tiered storage concepts?

KafKA

Describe functions of Kafka Mirror maker ?
how do you integrate Kafka mirror maker into an existing architecture ?

Spark

What are definitions in Spark for , Jobs , Stages ,Task.
Describe each of execution modes that spark supports

Yarn

What is yarn used for ? Describe each of its core components ?

Cassandra

Explain how Cassandra stores data
What is difference between partition and clustering ?

Thanks for the interview panel to share a copy of these questions

Wednesday 26 September 2018

Access VBA with MSSQL Server - Passing Parameter via VBA code to MSSQL SQL SERVER query

As a Part of Consulting Life, you never knew what projects you will be working so from an open source world to Microsoft world here is a quick little transition story.

Tasks Explored

Migrate Access to SQL server

on that, the first task is to demonstrate access VBA code can pass a parameter to SQL query using ADODB connection string.

FYI : SSMA Wont work with SQL 2009 R2 as its EOF so have to live with precision Manual Migration.

What was used

MSSQL server 2008 r2 or any
access 2010 or any
firewall exceptions for 1433 added, named pipes enabled.
MSSQL server Pubs database with standard stored proc

Reference

https://docs.microsoft.com/en-us/sql/ado/reference/ado-api/append-and-createparameter-methods-example-vb?view=sql-server-2017

VBA Code

'BeginAppendVB

'To integrate this code

'replace the data source and initial catalog values

'in the connection string

Public Sub Main()

On Error GoTo ErrorHandler

'recordset, command and connection variables

Dim Cnxn As ADODB.Connection

Dim cmdByRoyalty As ADODB.Command

Dim prmByRoyalty As ADODB.Parameter

Dim rstByRoyalty As ADODB.Recordset

Dim rstAuthors As ADODB.Recordset

Dim strCnxn As String

Dim strSQLAuthors As String

Dim strSQLByRoyalty As String

'record variables

Dim intRoyalty As Integer

Dim strAuthorID As String

' Open connection

Set Cnxn = New ADODB.Connection

strCnxn = "Provider='sqloledb';Data Source='MySqlServer';" & _

"Initial Catalog='Pubs';Integrated Security='SSPI';"

Cnxn.Open strCnxn

' Open command object with one parameter

Set cmdByRoyalty = New ADODB.Command

cmdByRoyalty.CommandText = "byroyalty"

cmdByRoyalty.CommandType = adCmdStoredProc

' Get parameter value and append parameter

intRoyalty = Trim(InputBox("Enter royalty:"))

Set prmByRoyalty = cmdByRoyalty.CreateParameter("percentage", adInteger, adParamInput)

cmdByRoyalty.Parameters.Append prmByRoyalty

prmByRoyalty.Value = intRoyalty

' Create recordset by executing the command

Set cmdByRoyalty.ActiveConnection = Cnxn

Set rstByRoyalty = cmdByRoyalty.Execute

' Open the Authors Table to get author names for display

' and set cursor client-side

Set rstAuthors = New ADODB.Recordset

strSQLAuthors = "Authors"

rstAuthors.Open strSQLAuthors, Cnxn, adUseClient, adLockOptimistic, adCmdTable

' Print recordset adding author names from Authors table

Debug.Print "Authors with " & intRoyalty & " percent royalty"

Do Until rstByRoyalty.EOF

strAuthorID = rstByRoyalty!au_id

Debug.Print " " & rstByRoyalty!au_id & ", ";

rstAuthors.Filter = "au_id = '" & strAuthorID & "'"

Debug.Print rstAuthors!au_fname & " " & rstAuthors!au_lname

rstByRoyalty.MoveNext

Loop

' clean up

rstByRoyalty.Close

rstAuthors.Close

Cnxn.Close

Set rstByRoyalty = Nothing

Set rstAuthors = Nothing

Set Cnxn = Nothing

Exit Sub

ErrorHandler:

' clean up

If Not rstByRoyalty Is Nothing Then

If rstByRoyalty.State = adStateOpen Then rstByRoyalty.Close

End If

Set rstByRoyalty = Nothing

If Not rstAuthors Is Nothing Then

If rstAuthors.State = adStateOpen Then rstAuthors.Close

End If

Set rstAuthors = Nothing

If Not Cnxn Is Nothing Then

If Cnxn.State = adStateOpen Then Cnxn.Close

End If

Set Cnxn = Nothing

If Err <> 0 Then

MsgBox Err.Source & "-->" & Err.Description, , "Error"

End If

End Sub

'EndAppendVB

Database : Microsoft Pubs

Stored proc definition

CREATE PROCEDURE [dbo].[byroyalty] @percentage int

select au_id from titleauthor

where titleauthor.royaltyper = @percentage

Connection String

' Open connection

Set Cnxn = New ADODB.Connection

strCnxn = "Provider=SQLOLEDB; Data Source=localhost;Database=pubs;Trusted_Connection=Yes;"

Cnxn.Open strCnxn

Viola It works !

Saturday 17 September 2016

Kafka Consumer Plugin Integration with PDI 6.1

Configuration of Kafka consumer Plugin for PDI (Pentaho data integration)

Issue Background:

Kafka consumer plugin does not work out of the box when they get integrated with PDI Spoon plugins however the Kafka Producer works out of the box.So many would be wondered how to fix this so here we go with some workaround

Tested Environment:

· PDI 6.1 EE or CE
· pentaho-kafka-producer-v1.6.zip
· pentaho-kafka-consumer-v1.4.zip

Producer and Consumer Installation

· Download the latest Kafka Consumer & Producer from GitHub - https://github.com/RuckusWirelessIL

· where ever you have your PDI installed extract the Plugins to the appropriate directory

Example:

C:\pentaho\pdi6.1\data-integration\plugins\steps or to your matching Linux path

Note: if you don’t find the steps folder under data-integration\plugins please create a new folder named – “steps”

Open up the C:\pentaho\pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer\lib and check if there are 17 jar files listed out as below

Procedure to Copy & Update Library Files

· compare *.jar files from C:\pentaho \pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer\lib against pdi6.1\data-integration\lib.

Perform the file copy & replace operation as per the below table instruction.

· Edit plugin .xml under \pdi6.1\data-integration\plugins\steps\pentaho-kafka-consumer and update the version number values matching to jar files under \data-integration\lib .

Sample – Untouched Original File- Plugin.xml

</libraries>

Sample - modified Plugin.XML

</libraries>

· Restart Spoon and test the producer and consumer.

Friday 10 June 2016

Data Modelling in Impala - Cloudera aka BIG DATA

Data Modelling in Impala - Cloudera aka BIG DATA - STAR Schema / ER Schema - Using Parquet Format using Snappy / GZIP compression

Background for the research :

Being a SQL fanatic from relational RDBMS world I was more curious to explore how the data modeling
[ table creation ] is done inside Big data world . This data model can be of any schema like ER , Star , Datavault. End of the day i see these schemas as plain tables with some extra bits n bolts with some relationship keys and data organisation among them.

The Goal was to create some table structure inside Impala with some good amount of data . On a later perspective this can be re implemented in star schema / ER / Data vault for bigger picture and use case.

Planned to loaded 46 million records to my impala table and have analysed query performance with sum and average functions for the entire dataset . The query performance results has been published in this post.

I selected Impala because theories talk that they can handle bulk volume of data with efficient query mechanism and this can implemented for any Cube Analytics / BI Reporting needs with enormous data volume without any MAPREDUCE JOBS.

End of this Lengthy post you will be able to figure out how things works inside Impala . I have tried my best to organise topics to cover the below important areas.

Impala background & theory
Impala environmental recommendation
DDL for Internal table & External Tables
Parquet & CSV File Format based DDL
Gzip & Snappy Codec for Parquet.
HDFS file Ingestion- ingesting sample data to Imapala - 46 million rows
Querying Impala
Query Performance Analysis.
Indyco - Data modeller for Impala

Disclaimer::

What ever information presented here could be reproduced after tons of research and reading from many different source to help the opensource community to gain knowledge and share thoughts . I take no responsibility if there is any inaccurate information conveyed. I am a human i make mistakes in understanding theory but in the end whatever i produced here works upto some experimental level. Some software download or sample data links may not work over time so i take no responsibility to update this post over time. So enjoy reading and spreading the knowledge as it is and your are free to reproduce any contents published here.

Prerequisite & Knowledge recommendation:

SQL DDL knowledge
Linux Knowledge
HDFS & hadoop eco system general exposure

My Environmental Setup

Laptop Ram - 16 GB ( This is very much needed as 10 GB is Minimum required for Cloudera Enterprise sandbox & only enterprise edition has impala on it )
Vm - Oracle Virtual BOX
Cloudera Enterprise Sandbox - cloudera-quick-start-vm-5.7.0-0-virtualbox
Processor - 2 core CPU dedicated for VM

Bingo you should be all set to run cloudera sandbox without any hurdles !

Launching Impala

Impala can be launched & executed via HUE web interface - To be mentioned cloudera has made life easy for us to run various big data contents .Just launch firefox and the browser has been bookmarked with all the necessary URL's to get started. Impala can also be executed using terminal - just launch terminal & execute - impala-shell

Impala Introduction & some theory :

Impala is a MPP SQL engine running inside cloudera big data ecosystem . It supports Native ANSI-92 SQL (compatible with Hive SQL) - Refer - Imapala SQL Reference So Happy go SQL folks.But be aware Impala has been designed for efficient query retrieval . Hence it supports DML like select , orderby , groupby , joins , views and aggreagate functions .

Imapala does not support operations like delete , update , foreign key , constraints , indexes , roll back . Henceforth OLTP scope is limited .

Internal table creation without file formats :

By default table creation statement creates a internal tables under the hdfs location - /user/hive/warehouse/uk_regions

CREATE TABLE uk_regions (name STRING, abbr STRING);

INSERT INTO uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/uk_regionsFound 2 items

-rw-r--r-- 1 impala supergroup 210 2016-06-01 12:54 /user/hive/warehouse/uk_regions/2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0.

drwxrwxrwx - impala supergroup 0 2016-06-01 12:54

/user/hive/warehouse/uk_regions/_impala_insert_staging

During the table creation process i have not mentioned explicitly any table storage format so all the contents gets a default storage type 2345e31cd0a1c8d2-3c2fbcda5d8d159d_1304749469_data.0. The same can be re written in MySQL as follows and this is given just to understand the difference in ddl .

CREATE TABLE `uk_regions`( `name` varchar(20) NOT NULL, `abbr` varchar(20) NOT NULL , PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Internal Table creation with multiple file formats:
Impala supports text , rc , sequence , parquet , avro file format with their appropriate compression codecs . File formats helps impala to store and retrieve data from hdfs efficiently either columnar or row based ordering.The below documentation table helps to understand more in depth about different format and their use cases for insert & create statements

Parquet File format :

Impala also helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is especially good for queries scanning particular columns within a table, for example to query "wide" tables with many columns, or to perform aggregation operations such as SUM() and AVG() that need to process most or all of the values from a column. Each data file contains the values for a set of rows (the "row group"). Within a data file, the values from each column are organized so that they are all adjacent, enabling good compression for the values from that column. Queries against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O.

Internal table creation in parquet format:

CREATE TABLE parquet_uk_regions (name STRING, abbr STRING) STORED AS PARQUET ;
INSERT INTO parquet_uk_regions VALUES ("Sheffield" ,"SHF"), ("Leeds","LDS"), ("London","LDN");

system logs :

[cloudera@quickstart ~]$ hadoop fs -ls /user/hive/warehouse/parquet_uk_regions/Found 2 items

drwxrwxrwx - impala supergroup 0 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/_impala_insert_staging

-rw-r--r-- 1 impala supergroup 379 2016-06-01 15:05 /user/hive/warehouse/parquet_uk_regions/c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

Here the file content gets stored as parquet = c044e267795afc59-89f2340de627089b_1709498250_data.0.parq

External table :
Impala lets you to create external table . The main difference between internal vs external tables are that it leverages end-user to hold their data anywhere in hdfs directory structure. To explain this more clear internal tables creates and hold data only in this location as explained earlier.

/user/hive/warehouse/parquet_uk_regions/

So this means all the table content are placed under this directory /hive/warehouse and here parquet_uk_region is a table name however a External Table let user to create folders in hadoop in any location as per his requirement as like the below example

/cheran/CSV

Here CSV is my table name and all the data related to csv tables are placed under that. Another major advantages for external tables are when you drop them in impala-shell/ hue data does not not get deleted in HDFS.Before creating external tables we need create directory and set necessary permissions for impala to execute any sort of query execution or operation from HDFS .

hadoop fs -mkdir /user/cheran

hadoop fs -mkdir /user/cheran/csv

hadoop fs -chmod 777 /user/cheran/csv

sudo -u hdfs hadoop fs -chown -R impala:supergroup
/user/cheran/csv

Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

External table creation in CSV format : The reason to create csv based ddl is because our downloaded sample data is in csv format . This ddl is capable of storing / reading csv files from hdfs.

create external table if not exists logs (

TimeSeconds double,

parsedDate timestamp,

dateTimeStr string,

ipLayerProtocol int,

ipLayerProtocolCode string,

firstSeenSrcIp string,

firstSeenDestIp string,

firstSeenSrcPort int,

firstSeenDestPor int,

moreFragment int,

contFragment int,

durationSecond int,

firstSeenSrcPayloadByte bigint,

firstSeenDestPayloadByte bigint,

firstSeenSrcTotalByte bigint,

firstSeenDestTotalByte bigint,

firstSeenSrcPacketCoun int,

firstSeenDestPacketCoun int,

recordForceOut int)

row format delimited fields terminated by ',' lines terminated by '\n'

location '/user/cheran/csv/';

table & sample data referenced from : http://raffy.ch/

Now we have tables and as a next level lets download some sample data which matches the above ddl. The Public dataset what i used here has 46 million records so its worthy to check how cloudera can handle this volume against time .

Steps to get the sample data and move to HDFS

wget /home/cloudera/Downloads/ http://www.vacommunity.org/dl388

This is is a 500 MB file download and it has 46 million records in 3 zip files . Unzip this to a local file location

unzip /home/cloudera/Downloads/VAST2013MC3_NetworkFlow.zip -d /home/cloudera/Downloads/

When extracted the 500 mb data this occupied 5.6 gb of my diskspace

hdfs dfs -put /home/cloudera/Downloads/nf/nf-chunk*.csv /user/cheran/csv/

Moved all the CSV files to external table location under HDFS . Thats all you need to do you can start querying the data from impala-shell, The below are my query response time when handled with 46 million records

CSV tables Query performance

[quickstart.cloudera:21000] > select count(*) from logs;

Query: select count(*) from logs

+----------+

| count(*) |

+----------+

| 46138313 |

+----------+

Fetched 1 row(s) in 19.8

Query: select avg(durationsecond) from logs

+---------------------+

| avg(durationsecond) |

+---------------------+

| 19.98479690738564 |

+---------------------+

Fetched 1 row(s) in 23.3s

CSV tables file storage inside HDFS

When i queried the hdfs file system

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/csv/

5.4 G 5.4 G /user/cheran/csv

i can see all the csv files occupied 5.4 gb to hold the 46 million records . Please note impala can do better query response if all these csv files are converted to parquet format.

External table creation in Parquet format

Just before doing this lets setup required folders & permissions in HDFS

hadoop fs -mkdir /user/cheran

hadoop fs -mkdir /user/cheran/log_parquet_format

hadoop fs -chmod 777 /user/cheran/log_parquet_format

sudo -u hdfs hadoop fs -chown -R impala:supergroup
/user/cheran/log_parquet_format

Please Note : Without executing the above commands impala will thrown some file permission exception when you execute any queries on top of this directory

External table creation in parquet format : This ddl will store / read parquet files.

CREATE EXTERNAL TABLE IF NOT EXISTS log_parquet_format (

TimeSeconds double,

parsedDate timestamp,

dateTimeStr string,

ipLayerProtocol int,

ipLayerProtocolCode string,

firstSeenSrcIp string,

firstSeenDestIp string,

firstSeenSrcPort int,

firstSeenDestPor int,

moreFragment int,

contFragment int,

durationSecond int,

firstSeenSrcPayloadByte bigint,

firstSeenDestPayloadByte bigint,

firstSeenSrcTotalByte bigint,

firstSeenDestTotalByte bigint,

firstSeenSrcPacketCoun int,

firstSeenDestPacketCoun int,

recordForceOut int)

STORED AS PARQUETFILE

LOCATION '/user/cheran/log_parquet_format/';

Converting all the existing CSV table data to into Parquet format:

Default Compression Codec for parquet format is snappy , this can be changed to Gzip as per requirement.

set compression_codec=Snappy;

insert overwrite table log_parquet4 select * from logs;

set compression_codec=gzip;

insert overwrite table log_parquet4 select * from logs;

This Helps to store the parquet file in compressed format using gzip or snappy compression . cloudera mostly recommends snappy because of their query retrieving capability . I have tried both snappy and gzip to see how they are different in terms of occupying storage space

Parquet tables Query performance

select sum(durationsecond) from log_parquet_format;
+---------------------+
| sum(durationsecond) |
+---------------------+
| 922064755 |
+---------------------+
Fetched 1 row(s) in 1.85s

[quickstart.cloudera:21000] > select avg(durationsecond) from log_parquet_format;
Query: select avg(durationsecond) from log_parquet_format
+---------------------+
| avg(durationsecond) |
+---------------------+
| 19.98479690738564 |
+---------------------+

Fetched 1 row(s) in 1.95s

CSV tables file storage inside HDFS

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/gzip/
448.4 M 448.4 M /user/cheran/gzip

[cloudera@quickstart Downloads]$ hadoop fs -du -s -h /user/cheran/log_parquet_formatt/
798.6 M 798.6 M /user/cheran/log_parquet_format

Getting Started with Indyco :

I have tried Indyco tool it has some cool features . Indyco has efficient data governance & modelling capabilities . It lets you to create physical / logical / conceptual data models and deploy them to Impala's through a jdbc driver . It also supports parquet file format based ddl creation . So Guys try them out and let me know your thoughts. You may not get this tool with a straight open download link .you have to request download link from Indyco for evaluation.

Conclusion

So we have discussed elaborately how to create tables using ddl and also introduced you for Idyco. Now its all your requirements & hard work to model a ER / STAR / Datavault .

Guys let me know your thoughts and challenges you face whilst you do data modelling . Best of Luck with all your Impala Adventures.