Complex Scripting in a Heterogeneous Environment

Introduction

I've uploaded this article for a few reasons. Let me first describe what the article is about, and then describe why I've uploaded it.

This article contains code used to solve a real-life problem faced at work (with proprietry information removed, obviously!). We have some Solaris servers running a mission critical Domino application that leaks memory. I wrote up this solution to monitor the server (using standard Solaris and Domino-specific commands). There were some restrictions. The application developers didn't want lsof to run if the CPU utilisation was less than 50% idle. There were some Domino specific metrics that needed to be gathered, as well as standard Solaris metrics (prstat, vmstat, etc). These files should be collected at six-hourly intervals (i.e. four times per day), on a rolling basis (i.e. this Tuesday's files would overwrite last Tuesdays). An intermediate server will then fetch these files once per day, and then shift them over to a network share on a Windoze server for analysis. As these files would eventually reside on a Windoze box, conversion would have to occur at some point. The actual monitoring should be easy to disable by the application developers (i.e. without editing the root cronjob).

With these requirements in mind, comes the reason that I've uploaded the code presented here. It shows *many* scripting techniques, best practices (except for storing passwords in files, ahem), a couple of languages (standard shell scripting and Perl), working in a heterogeneous environment (even though I am a Unix/Linux System Administrator, sadly I have to provide the Windoze jockeys with data). It shows how a solution to a fairly complex problem can be created easily using *nix.

The Solution

I decided to tackle this in stages. We connect to the Solaris servers via a Linux "drop-box" that has firewall rules tailored to suit. On the Solaris boxes, I would cron a job running as root that gathers the required data. I would build in a simple algorithm so that if the applications guys wanted to disable the monitoring (as "lsof is too resource hungry", apparently - well it is on an ill-equipped E280R running a badly written Domino app...) they could do so by touching a file somewhere. I'd then get this to deposit files in the Domino administrators home directory. Once per day, a perl script on the Linux drop-box is to connect to each of the three Solaris application servers, and grab the files. They will then be bundled, compressed into a zip file, and shifted across to the Windoze server. Somewhere along the line, unix2dos should be used so the files can be easily opened by the analysts on their desktops.

The Implementation

OK, I'm not going to dwell too much here. The code is easy to read. The main points to note are as follows: The applications guys can simply disable the monitoring script by running:

    
$ touch /var/tmp/monitor_control.tmp
    
    

And then re-enable as they wish with

    
$ rm -f /var/tmp/monitor_control.tmp
    
    

The lsof check will only execute if CPU is greater than 50% idle. They didn't end up using the NSD lsof which is why this check isn't built into the nsd_lsof_mode() function. If there is less than 51% idle the script will sleep for 60 seconds and then try again (in our case up to a maximum of 30 minutes).

The first component of the solution is the main statistics gathering script, which resides on each of the Solaris / Domino application servers, with the various variables tailored to suit each server.

    
#!/usr/bin/ksh

#
# Variable Initialisation
#
PROGNAME=`basename $0`
NOTES_USER=notes
DATA_DIR=/n1/${NOTES_USER}
LOG_DIR=/export/home/${NOTES_USER}/logdir
LSOF=/usr/local/bin/lsof
GZIP=/usr/bin/gzip
NSD=/opt/lotus/bin/nsd
PRSTAT=/usr/bin/prstat
VMSTAT=/usr/bin/vmstat
WHOAMI=/usr/ucb/whoami
RM=/usr/bin/rm
AWK=/usr/bin/awk
SED=/usr/bin/sed
DATE=/usr/bin/date
PWD=/usr/bin/pwd
CHOWN=/usr/bin/chown
TOUCH=/usr/bin/touch
CONTROL_FILE=/var/tmp/monitor_control.tmp
DATE_DAY=`${DATE} +%a`
DATE_HOUR=`${DATE} +%H`
PATH=/usr/bin:/usr/sbin:/usr/local/bin
export PATH

#
# Function Definition
#
usage() {
  {
     echo "Usage: ${PROGNAME} [[-l|-n|-i|-p|-v]|-a]"
     echo "   -l     Run lsof"
     echo "   -n     Run nsd -lsof"
     echo "   -i     Run nsd -info"
     echo "   -p     Run prstat"
     echo "   -v     Run vmstat"
     echo "   -a     Run all diagnostics"
  } >&2
  exit 1
}

all_mode() {
  prstat_mode
  vmstat_mode
  lsof_mode
  nsd_lsof_mode
  nsd_info_mode
  exit 0
}

prstat_mode() {
 echo "Executing prstat_mode()"
 PRSTAT_LOG="${LOG_DIR}/prstat_${DATE_DAY}_at_${DATE_HOUR}"
 if [ -f "${PRSTAT_LOG}.gz" ]; then
    ${RM} -f ${PRSTAT_LOG}.gz
 else
    ${TOUCH} ${PRSTAT_LOG}
 fi
 # Single snapshot of top 10 processes
 ${PRSTAT} -n 10 1 1 > ${PRSTAT_LOG}
 ${GZIP} -f ${PRSTAT_LOG}
}

vmstat_mode() {
 echo "Executing vmstat_mode()"
 VMSTAT_LOG="${LOG_DIR}/vmstat_${DATE_DAY}_at_${DATE_HOUR}"
 if [ -f "${VMSTAT_LOG}.gz" ]; then
    ${RM} -f ${VMSTAT_LOG}.gz
 else
    ${TOUCH} ${VMSTAT_LOG}
 fi
 # Ten second sample
 ${VMSTAT} 1 10 > ${VMSTAT_LOG}
 ${GZIP} -f ${VMSTAT_LOG}
}

lsof_mode() {
 echo "Executing lsof_mode()"
 MIN_COUNTER=0
 MAX_MIN=30
 FLAG=0
 while [ "${MIN_COUNTER}" -lt "${MAX_MIN}" ]; do
    CURRENT_CPU_IDLE=`${VMSTAT} 1 2 | ${SED} -n '$p' | ${AWK} '{print $NF}'`
    echo "CURRENT_CPU_IDLE->[${CURRENT_CPU_IDLE}]"
    if [ "${CURRENT_CPU_IDLE}" -gt "50" ]; then
       LSOF_LOG="${LOG_DIR}/lsof_${DATE_DAY}_at_${DATE_HOUR}"
       if [ -f "${LSOF_LOG}.gz" ]; then
          ${RM} -f ${LSOF_LOG}.gz
       else
          ${TOUCH} ${LSOF_LOG}
       fi
       WORKING_DIR=`${PWD}`
       cd ${DATA_DIR}
       ${LSOF} > ${LSOF_LOG}
       cd ${WORKING_DIR}
       ${GZIP} -f ${LSOF_LOG}
       FLAG=1
       break
    else
       sleep 60
       (( MIN_COUNTER = MIN_COUNTER + 1 ))
    fi
 done
 if [ "${FLAG}" -eq "0" ]; then
    echo "Error: LSOF not executed due to excessive CPU utilisation" >&2
 fi
}

nsd_lsof_mode() {
 echo "Executing nsd_lsof_mode()"
 NSD_LSOF_LOG="${LOG_DIR}/nsd_lsof_${DATE_DAY}_at_${DATE_HOUR}"
 if [ -f "${NSD_LSOF_LOG}.gz" ]; then
    ${RM} -f ${NSD_LSOF_LOG}.gz
 else
    ${TOUCH} ${NSD_LSOF_LOG}
 fi
 WORKING_DIR=`pwd`
 cd ${DATA_DIR}
 ${NSD} -lsof > ${NSD_LSOF_LOG}
 cd ${WORKING_DIR}
 ${GZIP} -f ${NSD_LSOF_LOG}
}

nsd_info_mode() {
 echo "Executing nsd_info_mode()"
 NSD_INFO_LOG="${LOG_DIR}/nsd_info_${DATE_DAY}_at_${DATE_HOUR}"
 if [ -f "${NSD_INFO_LOG}.gz" ]; then
    ${RM} -f ${NSD_INFO_LOG}.gz
 else
    ${TOUCH} ${NSD_INFO_LOG}
 fi
 WORKING_DIR=`pwd`
 cd ${DATA_DIR}
 ${NSD} -info > ${NSD_INFO_LOG}
 cd ${WORKING_DIR}
 ${GZIP} -f ${NSD_INFO_LOG}
}

#
# Various Checks
#
# First, check we're running this script as root
[[ `"${WHOAMI}"` != "root" ]] && {
  echo "Error: You must be root to run this script" >&2
  exit 1
}

# If our control file exists, do not run - this enables us to disable
# the job by just touching the file and not having to edit the crontab as root
[[ -f "${CONTROL_FILE}" ]] && {
  echo "Control file exists - job disabled..." >&2
  exit 1
}

[[ ! -d "${LOG_DIR}" ]] && {
  echo "Error: Output directory ${LOG_DIR} does not exist" >&2
  exit 1
}

if [ "$#" -eq "0" -o "$#" -gt "4"  ]; then
  usage
fi

#
# Argument processing
#
while [ "$#" -gt "0" ]; do
  case "${1}" in
     -p) PRSTAT_MODE="true"
         shift
         ;;
     -v) VMSTAT_MODE="true"
         shift
         ;;
     -l) LSOF_MODE="true"
         shift
         ;;
     -n) NSD_LSOF_MODE="true"
         shift
         ;;
     -i) NSD_INFO_MODE="true"
         shift
         ;;
     -a) ALL_MODE="true"
         shift
         ;;
     *)  usage
         ;;
  esac
done

#
# main()
#
[[ "${ALL_MODE}" = "true" ]] && all_mode
[[ "${PRSTAT_MODE}" = "true" ]] && prstat_mode
[[ "${VMSTAT_MODE}" = "true" ]] && vmstat_mode
[[ "${LSOF_MODE}" = "true" ]] && lsof_mode
[[ "${NSD_LSOF_MODE}" = "true" ]] && nsd_lsof_mode
[[ "${NSD_INFO_MODE}" = "true" ]] && nsd_info_mode

${CHOWN} -R ${NOTES_USER}:notes ${LOG_DIR}

exit 0

    

This script would be saved as /usr/local/bin/system_monitor.ksh on each of the servers. lsof would need to be downloaded from SunFreeware if it wasn't already present on the systems. On all servers, the following cronjob would enable the monitoring at six-hourly intervals

    
0 4,10,16,22 * * * /usr/local/bin/system_monitor.ksh -l -i -p >/dev/null 2>&1
    
    

Obviously LOG_DIR should already exist before monitoring commences

Once this has run a few times, the LOG_DIR will start to populate. Now it's time to get the files across to the Windoze share.

So, on our Linux "drop-box", the following shell-script wrapper (wrapper.sh) was coded:

    
#!/bin/bash

GET_FILES="/home/someuser/bin/get_files.pl"
BASEDIR="/home/someuser/logdir/"
HOSTS="appsrv01 appsrv02 appsrv03
USERNAME="dozedomain/dozeuser"
PASSWORD="dozepassowrd"
TARGET="//dozeserver/dozeshare"

${GET_FILES}
if [ "$?" -ne "0" ]; then
  echo "${GET_FILES} returned with errors - exiting" >&2
  exit 1
fi

for SOURCE in ${HOSTS}; do
  cd ${BASEDIR}/${SOURCE}
  if [ "`ls | wc  -l`" -eq 0 ]; then
     echo "No files for ${SOURCE} - skipping" >&2
     continue
  fi
  for SOURCEFILE in *; do
     gzip -d ${SOURCEFILE}
     unix2dos < ${SOURCEFILE%.*} > ${SOURCEFILE%.*}.dos
     mv ${SOURCEFILE%.*}.dos ${SOURCEFILE%.*}
  done
  zip -m ${SOURCE}.zip *
  smbclient ${TARGET} -U "${USERNAME}%${PASSWORD}" -c "cd some/final/dir/${SOURCE};prompt;mput *"
  rm *
done

exit 0
    
    

You can see that this performs our dos2unix conversion, and also converts these files from individually gzipped files into one single zip bundle.

You can also see this script calling our Perl get_files.pl script, which connects to each of the application servers in turn and downloads the gathered statistics.

    
#!/usr/bin/perl

use strict;
use warnings;
use Net::SFTP;

my @config = (
  { hostname => "appsrv01", username => "notes", password => "notespass", 
    source => "/export/home/notes/logdir", dest => "/home/someuser/logdir/appsrv01" },
  { hostname => "appsrv02", username => "notes", password => "notespass", 
    source => "/export/home/notes/logdir", dest => "/home/someuser/logdir/appsrv02" },
  { hostname => "appsrv03", username => "notes", password => "notespass", 
    source => "/export/home/notes/logdir", dest => "/home/someuser/logdir/appsrv03" }
);

foreach my $entry ( @config ) {
  printf( "Getting files from %s\n", $entry->{hostname} );
  printf( "Source %s Destination %s\n", $entry->{source}, $entry->{dest} );
  my %args = ( ssh_args => [] );
  $args{user} = $entry->{username};
  $args{password} = $entry->{password};

  my $sftp = Net::SFTP->new($entry->{hostname}, %args);
  my @files = $sftp->ls( $entry->{source} );
  foreach my $file ( @files ) {
     if ( $file->{filename} eq '.' || $file->{filename} eq '..' ) {
     } else {
        printf( "Retrieving %s\n", "$entry->{source}/$file->{filename}" );
        $sftp->get( "$entry->{source}/$file->{filename}", 
                    "$entry->{dest}/$file->{filename}" );
     }
  }
}

exit 0;
    
    

As you can see, you'll need to grab Net::SFTP from CPAN for this. Each midnight, the following cronjob grabs fires up the wrapper script, downloading the files from each of the application servers, and transferring them to the Windoze share using smbclient:

    
0 0 * * * /home/someuser/bin/wrapper.sh >/dev/null 2>&1
    
    

Conclusion

Whilst the actual application itself is probably useful only to me, the techniques described here show how versatile the *nix Operating System is. Here, we have system monitoring scripts running on Solaris servers, the generated statistics then gathered by a Linux server, with the converted files finally ending up on a Windoze box.

Cheers
Kevin Waldron
kevin@zazzybob.com

Disclaimer! - This article is provided for guidance only, and does not replace the relevant official documentation and manuals. I will not be held liable for any hosed systems and/or data.

Valid CSS!

Valid HTML 4.01!