Hadoop Knowledge base share: 2020

Thursday, November 12, 2020

Apache derby Replace function installation for Hive metastore host migration

I was working on Hive data migration from one host to another host. source host hive metastore was configured with derby, and after we moved metastore_db folder into destination server, still hive tables had reference to old hdfs host location.

I followed below steps to replace old hdfs server reference with new host

launch the derby client and download and install the jar.

wget http://archive.apache.org/dist/commons/lang/binaries/commons-lang3-3.9-bin.zip

unzip commons-lang3-3.9-bin.zip

./ij

connect 'jdbc:derby:;databaseName=/data/hive/metastore_db/';

call SQLJ.INSTALL_JAR('file:/var/tmp/commons-lang3-3.9/commons-lang3-3.9.jar', 'APP.commons_lang', 0);

call SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.database.classpath', 'APP.commons_lang');

create function replace (sourceString varchar(8000), searchString varchar(8000), replaceString varchar(8000)) returns varchar(8000) parameter style java no sql language java external name 'org.apache.commons.lang3.StringUtils.replace';

Find out the existing old host matching records

select sd_id,LOCATION from sds where LOCATION like '%test.cloudapp.net%';

create sql file to export the matching tables like below

vi /var/tmp/exportid.sql

-----------------------------------------------------------------------------

connect 'jdbc:derby:;databaseName=/data/hive/metastore_db/';

select sd_id from sds where LOCATION like '%test.cloudapp.net%';

-----------------------------------------------------------------------------

Then run below command to export the output

java -cp /home/derby/lib/*:. org.apache.derby.tools.ij /var/tmp/exportid.sql >/var/tmp/output.txt

edit the out output.txt and remove the lines except SD_ID.

Create Update SQL file script to update the records

This shell script create the Update sql file

for x in `cat /var/tmp/output.txt`;do echo "UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=$x;>>/var/tmp/newsql.sql";done

append below line in top of the newsql.sql file and finale sql file should be like below

------------------------------------------------------------------------------------------------------------------

connect 'jdbc:derby:;databaseName=/data/hive/metastore_db/';

call SQLJ.INSTALL_JAR('file:/var/tmp/commons-lang3-3.9/commons-lang3-3.9.jar', 'APP.commons_lang', 0);

call SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY('derby.database.classpath', 'APP.commons_lang');

UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=150;

UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=151;

UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=152;

UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=153;

UPDATE sds SET LOCATION=replace(LOCATION,'test.cloudapp.net','dummy.test.com') WHERE SD_ID=154;

------------------------------------------------------------------------------------------------------------------

java -cp /home/derby/lib/*:. org.apache.derby.tools.ij /var/tmp/newsql.sql

ij> select replace(LOCATION,'test.cloudapp.net', 'dummy.test.com') from sds;

-----------------------------------------------------------------------------------------------------------------------------

hdfs://dummy.test.com:9000/data/hive/warehouse/students

hdfs://dummy.test.com:9000/data/hive/warehouse/students1

2 rows selected

ij> UPDATE sds SET LOCATION=replace(LOCATION, 'test.cloudapp.net', 'dummy.test.com') WHERE SD_ID=1;

1 row inserted/updated/deleted

Thursday, April 2, 2020

Python script to automate Yarn pool utilization report mail

i was working on simple requirement to send the each Yarn pool utilization hourly to management team.

Although cloudera offer cluster utilization report which will provide detailed report about Yarn utilization, there are some limitation. so i used my scripting knowledge to implement on my own.

please find below python and script and the email output.

It will generate report like below.

From: ClouderaAdministrator-
Sent: Thursday, April 2, 2020 8:00:01 AM (UTC+00:00) Monrovia, Reykjavik
To: admins; admins@tanu.com
Subject: Prod Cluster Yarn Utilization report


Report generation time:	2020-04-02 04:00:01.043615
Total Resource availabe:	1440 Vcores,	7.03TB
Total Resource Utilized(value):	916 Vcores,	3.95TB
Total Resource Utilized(%):	63.61 %,	56.24%

Queue Name	Max Allocation(cpu)	Max Allocation(memory)	Cpu Utilizied(%)	Memory Utilized(%)
root.default	1.0	1.0	0.0	0.0
root.dev1	1.0	1.0	0.0	0.0
root.hive	1.0	1.0	0.0	0.0
root.qa1	60.0	60.0	23.0	23.0
root.admin1	1.0	1.0	0.0	0.0
root.dev2	1.0	1.0	0.0	0.0
root.bigdadmin	1.0	1.0	0.0	0.0
root.qa2	1.0	1.0	0.0	0.0
root.cloudera	1.0	1.0	0.0	0.0
root.bigdata	1.0	1.0	0.0	0.0
root.reporting	10.0	10.0	6.0	10.0
root.testing	1.0	1.0	1.0	0.0
root.informatica	1.0	1.0	0.0	0.0
root.admin2	1.0	1.0	0.0	0.0
root.banking	5.0	5.0	5.0	4.0
root.informatica	1.0	1.0	0.0	0.0
root.java	1.0	1.0	0.0	1.0
root.users	1.0	1.0	0.0	0.0

Friday, February 7, 2020

Cloudera HIVE BDR schedule via Python script

Worked on new user case or may be to reduce my workload i force to write this script. replicating all the hive databases from production to DR cluster is very slow taking one day to complete.

thought of splitting 100 databases into multiple batches like 10 on each hive bdr configuration. i felt creating 10 batches and updating configuration of more of painful work. so wrote below script to do that,

Tuesday, February 4, 2020

Python Script to Monitor and send email Cloudera Services and Roles health status

I know cloudera has in built feature to send the cluster status via email. but in our environment mail function is not working through cloudera for some reason. tried all the configuration but no luck.

We could not contact Mail server support team to resolve issue. fortunately mail command is working from Unix server. so thought of writing own monitoring script using cloudera cm_client python module and send the alert to support group.

Try below my script

Monday, February 3, 2020

Sorting Yarn running jobs using python

Although hadoop Yarn resource manager provide nice UI interface to sorting yarn jobs, sometimes it is difficult to filter the ideal session with custom sorting.

Below my script will sort the ideal session running for more than hours (3600000) and print with the application ID.

we can manually kill the session using yarn command or we could also automate in the same script.

App Name: dev-claim_report1
Application id: application_1580724162250_1127
Total epsed time: 3.0 hours
queue Name: root.devqueue-1
Allocated memory: 1651 gb
('Tracking Url: ', u'http://devmn-02.tanu.com:8088/proxy/application_1580724162250_1127/')
No long running jobs!

App Name: Spark shell
Application id: application_1580724162250_1151
Total epsed time: 1.0 hours
queue Name: root.devqueue-2
Allocated memory: 55 gb
('Tracking Url: ', u'http://devmn-02.tanu.com:8088/proxy/application_1580724162250_1151/')

App Name: Spark shell
Application id: application_1580724162250_1152
Total epsed time: 1.0 hours
queue Name: root.devqueue-3
Allocated memory: 55 gb
('Tracking Url: ', u'http://devmn-02.tanu.com:8088/proxy/application_1580724162250_1152/')

App Name: dev-claim_report2
Application id: application_1580724162250_1141
Total epsed time: 1.0 hours
queue Name: root.devqueue-4
Allocated memory: 1 gb
('Tracking Url: ', u'http://devmn-02.tanu.com:8088/proxy/application_1580724162250_1141/')

Kudu tablet servers Metric check using python

Recently got chance to work on Kudu issue. developers started getting below error. after going through lot of documents i come to know that, cluster did not plan according to the kudu recommendation.

dropped due to backpressure. The service queue is full; it has 50 items

Below is kudu recommendation.

Scale

Recommended maximum number of tablet servers is 100.
Recommended maximum number of masters is 3.
Recommended maximum amount of stored data, post-replication and post-compression, per tablet server is 8TB.
Recommended maximum number of tablets per tablet server is 2000, post-replication.
Maximum number of tablets per table for each tablet server is 60, post-replication, at table-creation time.

I need to find the numbers of tablets per server to developer, so that they can cleanup the tables or reduce the partition to meet the recommendation,

Below is my python script which will connect tablet server metrics and print the details

+------------------------------------------+-----------------+---------------+
| Server | Running tablets | Total_tablets |
+------------------------------------------+-----------------+---------------+
| devkn-01.tanu.com:8050 | 1790 | 11235 |
+------------------------------------------+-----------------+---------------+
| devkn-02.tanu.com:8050 | 1787 | 10970 |
+------------------------------------------+-----------------+---------------+
| devkn-03.tanu.com:8050 | 1924 | 11349 |
+------------------------------------------+-----------------+---------------+
| devkn-04.tanu.com:8050 | 1923 | 11325 |
+------------------------------------------+-----------------+---------------+
| devkn-05.tanu.com:8050 | 1838 | 11297 |
+------------------------------------------+-----------------+---------------+
| devkn-06.tanu.com:8050 | 1924 | 11299 |
+------------------------------------------+-----------------+---------------+
| devkn-07.tanu.com:8050 | 1788 | 11050 |
+------------------------------------------+-----------------+---------------+
| devkn-08.tanu.com:8050 | 1790 | 10564 |
+------------------------------------------+-----------------+---------------+
| devkn-09.tanu.com:8050 | 1921 | 10758 |
+------------------------------------------+-----------------+---------------+
| devkn-10.tanu.com:8050 | 1923 | 10899 |
+------------------------------------------+-----------------+---------------+
| devkn-11.tanu.com:8050 | 1868 | 9254 |
+------------------------------------------+-----------------+---------------+
| devkn-12.tanu.com:8050 | 2269 | 8101 |
+------------------------------------------+-----------------+---------------+
| devkn-13.tanu.com:8050 | 1802 | 10467 |
+------------------------------------------+-----------------+---------------+
| devkn-14.tanu.com:8050 | 1927 | 10875 |
+------------------------------------------+-----------------+---------------+
| devkn-15.tanu.com:8050 | 1601 | 10867 |
+------------------------------------------+-----------------+---------------+
| devkn-16.tanu.com:8050 | 2017 | 10088 |
+------------------------------------------+-----------------+---------------+
| devkn-17.tanu.com:8050 | 1793 | 10391 |
+------------------------------------------+-----------------+---------------+
| devkn-18.tanu.com:8050 | 1683 | 11631 |
+------------------------------------------+-----------------+---------------+
| devkn-19.tanu.com:8050 | 1946 | 9793 |
+------------------------------------------+-----------------+---------------+
| devkn-20.tanu.com:8050 | 1719 | 10488 |
+------------------------------------------+-----------------+---------------+
| devkn-21.tanu.com:8050 | 1703 | 9213 |
+------------------------------------------+-----------------+---------------+
| devkn-22.tanu.com:8050 | 1740 | 9920 |
+------------------------------------------+-----------------+---------------+
| devkn-23.tanu.com:8050 | 1827 | 9953 |
+------------------------------------------+-----------------+---------------+
| devkn-24.tanu.com:8050 | 1929 | 10094 |
+------------------------------------------+-----------------+---------------+

Friday, January 3, 2020

Cloudera cluster creation on google compute instance

Wanted to quickly launch my cloudera cluster in google cloud(since i had some free credit wanted to try effectively) similar to on prem cluster like single sign on all the linux nodes.

initially i though of integrate all linux servers with Active directory + SSSD client but later i moved to MIT kerberos + Open LDAP client + SASL passthrough.

This script has 2 part

PART 1: will create No. of gcp instances, create hadoop users/groups, install SASL/openldap/MIT kerberos/Cloudera agent and Manager

PART 2: Will add the hosts into cloudera manager,create cluster/add hdfs and zookeeper services.(still working on adding more services)

Part 2 of this script can be easily scale up with any cloud providers(AWS,AZURE) as long as cloudera manager url is exposed to internet.