Tuesday, October 7, 2008

cluserssh : life saver for cluster system administrator

As an administrator of cluster system, I have to repeat same work on each node occasionally or very often. If cluster have some of useful software installed, specifically such as management tools or administration GUI or something, job can be done nicely and elegantly. Otherwise, it's disaster. Just imagine, typing bunch of same commands over and over again in each node.

Well, there is a life saver for these under budgeted cluster system administrator.

It is 'clusterssh'.

Few days ago, I had a job assigned and it was installing a new Matlab 2008a on 8 linux machines. It is not a cluster system though, but, what I have to do is basically install Matlab 2008a 8 times, which means I have to login each node and initiate installation procedure and type all the parameters in each step of installation process again and again.

I used 'clusterssh' instead of normal ssh and I saved a lot of time and typing same thing over and over.

It's never been easier. Just type this command in your xterm;

$cssh root@machine1 root@machine2 root@machine3 root@machine4 root@machine5

then, it magically fires up 5 xterm windows concurrently and one small 'command window'. Then, click that command window to put cursor focus on it, and type any command. Now real magic starts.

For example, if you type 'ls', then you will see that command is sent to each node, ran and showed the result.

It's kind like 'one shot, kills multiple birds'

So, what I did to install Matlab 2008a on 8 Linux machines was simply that;

(1) put Matlab dvd on machine 1
(2) export dvd through NFS
(3) cssh to 7 machines
(4) mount it using NFS
(5) go through Matlab installation only once
(6) Boom! It's done.

Friday, August 29, 2008

Enhanced bhist source code

One of popular batch scheduler in Supercomputing area is LSF. As many other forks out there, I used LSF for a while until we switched to other one. bhist is one of LSF command to displays historical information about jobs.

I was assigned to develop a shell script to display job list
consumed more than certain wall time clock of certain user. So, I've got 2 input parameters; wall clock time and user id. And output should be the list of jobs used more than given wall time clock and it should contain detail information about job. Finally, the output should be suitable for printing and reporting.

This is what I came up with after spending one or two days. To align column or row for printing layout, it includes lots of intentional tabs and white spaces.

Here is sample output;

Extended bhist

=====================================================================================
DATE JOBID USER JOB_NAME PEND PSUSP RUN SSUSP TOTAL
-------------------------------------------------------------------------------------
Aug 29 13:33:51 16464 xxx test1 10 0 82 0 92
Aug 29 13:46:51 16465 xxx test1 10 0 4402 0 4412
...
...
...
Aug 29 15:20:37 16471 xxx test1 8 0 13639 0 13647
=====================================================================================
Total CPU time: 569:53:52


Source code

#!/bin/sh
#
# NAME : ebhist.new
# Display output of bhis with time & date information of each job of LSF.
#
# AUTHOR : Brian Kim
# Supercomputer Center
#
# DATE : SEPTEMBER 12, 2005
#

function print_title {
echo $1 $2 $3
echo "Extended bhist : " $1 $2 $3
echo
}

function print_usage {
echo "Usage: ebhist WALLTIME ACCOUNT"
echo
echo " Display job information of [ACCOUNT]"
echo " which consumed more than [WALLTIME] second"
echo
echo "Example:"
echo " ebhist 1000 guest"
echo
}

function print_heading {
# echo "\tDATE\tJOBID\tUSER\tJOB_NAME\tPEND\tPSUSP\tRUN\tUSUSP\tSSUSP\tUNKWN\tTOTAL"
# echo "==============================================================================="
echo "====================================================================================="
echo "DATE\t\tJOBID\tUSER\tJOB_NAME\tPEND\tPSUSP\tRUN\tSSUSP\tTOTAL"
echo "-------------------------------------------------------------------------------------"
# echo "-------------------------------------------------------------------------------"
}

function print_footer {
a=$1
hh=`expr $a \/ 3600`
tmp=`expr $a \% 3600`
mm=`expr $tmp \/ 60`
ss=`expr $tmp \% 60`
if [ $ss -lt 10 ]
then
ss="0"$ss
fi
if [ $mm -lt 10 ]
then
mm="0"$mm
fi
if [ $hh -lt 10 ]
then
hh="0"$hh
fi
echo "====================================================================================="
# echo "==============================================================================="
echo "\t\t\t\t\t\tTotal CPU time: "$hh:$mm:$ss
# echo $hh:$mm:$ss

}
function set_parameter {
min_cpu_time=$1
id=$2
}

# Start of program


print_title $0 $1 $2

if [ $# -ne 2 ]
then
print_usage
exit
fi

set_parameter $1 $2
print_heading

MYSUM=0
my_date=""
bhist -a -u $id | \
while read jobid user job_name pend psusp run ususp ssusp unkwn total
do
case $jobid in
"") continue ;;
JOBID) continue ;;
Summary) continue ;;
esac

if [ $run -lt $min_cpu_time ]
then
continue
else
if [ ${#job_name} -gt 7 ]
then
NUM_TAB='\t'
else
NUM_TAB='\t\t'
fi
MYSUM=`expr $MYSUM + $run`
# echo $MYSUM $run
# echo `bhist -l $jobid | grep Submitted | cut -d: -f1-3` "\t$jobid\t$user\t$job_name$NUM_TAB$pend\t$psusp\t$run\t$ususp\t$ssusp\t$unkwn\t$total"
my_date=`bhist -l $jobid | grep Submitted | cut -c5-19`
my_date_length=`echo $my_date | wc -c`
if [ my_date_length -lt 16 ]
then
PADDING=" "
else
PADDING=""
fi

echo $my_date$PADDING" $jobid\t$user\t$job_name$NUM_TAB$pend\t$psusp\t$run\t$ssusp\t$total"
fi
done

print_footer $MYSUM
# End of program

Tuesday, July 22, 2008

What is biggest file on my hard disk?

su + sort

About ten years ago, whenever my hard disk space is running low, I used to run shell script using 'du' command to find out which file is taking biggest space and select victim file/directory to kill and save some room on my hard disk. Basically, the script uses 'du' command for each directory recursively from /root directory, after than, 'sort' command sorts that result descending order. Then the first item on the list is taking biggest space and so on.

KDirStat
Now, my SUSE linux has a gui application called 'KDirStat'. It exactly does same thing and shows the result even graphical way to help you more visually figure out which file you should delete to get some extra room on your hard disk .:)

For my experiments, I have 3 Matlab installed on my hard disk and each of it takes about 2GB. Because of these 3 Matlabs, my hard disk is running low and only 1GB is left. Now, I removed oldest version of Matlab and it resulted in 3G available disk space.

Friday, March 7, 2008

Ubuntu/VirtualBox/OpenSuSE10.3/Full screen

Ubuntu 7.10/VirtualBox1.5.2_OSE/SuSE 10.3 x86_64

I have been playing with VirtualBox on OpenSuSE 10.3 and have been successful with Windows guest but, Ubuntu guest has been irritated me few days by resisting to be full screen mode. I have installed 'Guest Addition' and did everything I have seen from internet forum. But, all didn't work.

Finally, I found out the reason why it is. When I was installing 'Guest Addition', I just clicked the icon of installation script' file and I thought it's done. Today, I opened a terminal window and typed installation script name to run it and it turns out that I should be a root to install it. I've totally forgot about this.

So I used 'sudo' command to run it under root privilege and it works fine now.

CentOS5/Remote desktop

CentOS5/Remote desktop

Using remote desktop on CentOS5(RHEL5) looks simple until you face

'unable to connect to host: No route to host (113)'

error message.

This is caused by firewall setting of target machine, which is normally a machine you want to connect to.

Don't panic!

It's very simple. Login to the target machine and click the menu 'System/Administration/Security level and Firewall'

and click 'Other ports' then add 5900 port.

Once you have done so far, try again.