myNMS

Implementation

Most of the code is written in Perl, using the SNMP.pm module to interface to the NET-SNMP (formerly UCD-SNMP) package to gather SNMP data. (Originally Tcl scripts using the Scotty extensions for network management were used.) A MySQL database is used to store most of the data, with a little held in flat files. (Previously mSQL was used but proved unstable.)

Some of the collection scripts run continuously, some several times an hour, some daily. Some involve collection of data from other machines using SSH: this includes the scripts which gather NIS information as the NMS machine itself does not run NIS (to avoid dependency on another machine which may not be available).

The NMS machine has an Apache web server to present information and interact with users using a mix of pages generated periodically (for time-consuming reports such as for all hosts and all users) and pages generated on-the-fly from CGI scripts (such as individual host, user and device SNMP queries). (The Apache server should be upgraded with the mod-perl module to allow faster execution of the cgi scripts.)

Access control

Since information presented by myNMS should not be made available to all users (either for security reasons or because it is not appropriate for their needs) myNMS integrates with NCSA/Apache-style htaccess mechanisms for web pages to control users' access to information and to present appropriate views to different groups of users.

Components

This diagram shows how the components relate to each other

DataBase structure

Configuration files

These are in the configured etc directory (by default, /usr/local/myNMS/etc)

KeepAlive

is a housekeeping program which keeps the collection and processing scripts running continuously, restarting them if/when they die. Most are designed to exit at midnight, so that any memory leaks or other problems are not allowed to build up by running over many days, weeks or months, and also so that log files can conveniently be re-named with the current day's date.

Output of programs is directed to logfiles, constructed from the programs' names with datestamps in the form YYYYMMDD, located in the configured log directory (default /usr/local/myNMS/var/log unless specified otherwise in the configuration file).

KeepAlive traps signals INT and HUP: both cause it to close down all running programs; after HUP the programs are restarted, whereas after INT the program removes its own .pid file (in the configured var directory) and exits.

If run with command line parameters -HUP, -INT or -KILL, or -SUSPEND, the program finds (from the .pid file) the PID of the running instance of itself and sends a HUP or INT signal to it. In the case of the SUSPEND option it then remains running (in a loop) until it is itself terminated by an INT or KILL signal, so that it prevents the normal re-starting of an instance by cron. This allows the programs it runs to be suspended without having to alter either the KeepAlive config file or crontab.

KeepAlive itself is (re)started every minute from cron: on startup it checks to see if a copy of itself is already running (by reading the .pid file and checking that a program of that PID is running) and exits if it is.

Which commands are run is specified in the KeepAlive configuration file etc/KeepAlive.cf: this is re-read every 10 seconds (or whatever interval is configured in the KeepAlive script) and any changes are acted upon by killing any commands no longer found in the config file and starting any new commands.
The module Proc::Simple would probably do the job better than the current implmentation:
Proc::Simple helps controlling background processes in Perl. It provides "Process Objects" that mimic their real world counterparts. You don't have to deal with fork and wait and friends, Proc::Simple is very easy to use, you just start processes in background, poll their status once in a while and kill them if necessary.
However Proc::Simple requires Perl 5.6, which was not available on the development system.


Query

This is the myNMS front-end to data stored in the database (and other files). It operates in 2 distinct modes: as a CGI script (it is symlinked as Query.cgi from web pages) and cli (batch) mode.

SNMP_info

This program runs in three modes (governed by command-line switches): A note about data structures:

The routines getting/updating devices' SNMP info use a data structure - usually called %SNMP - which is structured thus:

    $SNMP   {deviceID}      deviceID, assigned as the unix time when the device was discovered
            {mtime}         time (in unix seconds) when device info updated from live SNMP query (of system group)
            {sysName}       }
            {sysDescr}      }
            {sysContact}    }   usual system group variables
            {sysLocation}   }
            {sysObjectID}   }
            {sysObjIDtxt}   version of sysObjectID translated to text string (not currently implemented - used to get this from Scotty)
            {UpSinceTime}   sysUpTime translated to an absolute time in unix seconds
            {ifNumber}      number of records in ifTable:
            {ifTable}       a set of data structures, indexed on ifIndex values, i.e.:
            {ifTable}{$index}   each comprising the elements:
                            {ifDescr}       }
                            {ifType}        }
                            {ifSpeed}       }   values from the SNMP ifTable group
                            {ifPhysAddress} }
                            {ifAdminStatus} }
                            {ifOperStatus}  }
                            {ifLastChangeAt}    ifLastChange translated to an absolute time in unix seconds
                            {ifName}        from the ifName table
            {ipAddrTable}   a set of data structures, indexed on IP address i.e.:
            {ipAddrTable}{$Addr}    each comprising the elements:
                            {ipAdEntIfIndex}    ifIndex with which this address is associated
                            {ipAdEntNetMask}    netmask associated with this address
in addition when information for this device is retieved from the DB it is added to the structure thus:
            {DB}
                {deviceID}      }
                {sysName}       }
                {sysDescr}      }
                {sysContact}    }
                {sysLocation}   }
                {sysObjectID}   }   values from DB corresponding to 'live' values (above)
                {sysObjIDtxt}   }
                {sysServices}   }
                {ifNumber}      }
                {UpSinceTime}   }
                {mtime}         }

LOG_info

Part of this script processes logs of squid-type authenticating web caches, correlating records of times + username + IP addresses in the logs with IP + MAC + time info from the IP_MAC table to generate the IP_MAC_user table, and also checking for users logged on simultaneously on 2 or more machines. Normally we will see this sort of thing:
u  u    u u    u   u uu    u u /.../   u   u     u       <- user from logs
I  I    I I    I   I II    I I \...\   I   I     I       <- IP from logs
       |------------ IP-MAC ---/.../----------|          <- IP & MAC from IP_MAC
These sightings will persist over periods of minutes or hours in the case of user activity, and indefinitely (hours -> years) for IP-MAC records. Where a host is used sequentially by many users (e.g. a shared PC) we will see:
u1  u1    u1 u1    u1   u2 u2 u2    u2 u2    u2
       |------------ MAC ----------------|     

or on a multi-user (timesharing) host:
u1  u2    u1 u3    u4   u1 u4 u2  u2 u3 u4    u1
       |------------ MAC ----------------|     
ANOMALIES:

When we have duplicate IP addresses we will seee this

u1  u1    u1 u1    u1       ...  u2 u2 u2     u2 u2
  |------- MAC1 --------|         |------ MAC2 ------|
or with multiple users:
u1  u2 u4    u3 u1    u3     ...  u4 u4   u4   u5 u5 u5
  |------- MAC1 --------|         |------ MAC2 ------|
We can also check for possibly accounts shared amongst multiple users (solely from web logs)
    u1   u1  u1   u1  u1  u1                        
    I1   I1  I1   I1  I1  I1                        
                  u2     u2  u2 u2   u2
                  I2     I2  I2 I2   I2
                  |----------|
Depending on the duration of the overlap, type and location of machines this may be a fairly innocent case of a user moving to a new PC in the same lab having failed to log out correctly from their first machine (or using 2 machines simultaneously), a user innocently using web browsers on more than one machine simultaneously (e.g. on a PC and, via X, on a unix host to which they connect via the PC). Where the same user is logged into single-user machines in physically separate parts of the campus however an account is being shared, and this may have security implications (e.g. a compromised user account).

The sources of data we have are:

In order to process this data we need to correlate the timings of the two sets of data. We need to take account of possible skew between the clocks of the machine recording the IP-MAC timings and that making the host-user logs, and also that IP-MAC records will persist in router ARP tables for some time after the host involved was last active. Allowances for these factors can quite easily be made my adjusting the times of the IP-MAC records (increasing the time-first-seen to allow for clock skew, and reducing the time-last-seen to allow for ARP persistence).

The actual correlation of the two data sets looks as if it ought to be simple but I have not found it to be so, without having impracticably large data sets in memory (for example reading the entire web cache logs and ARP cache data into structures such as hashes indexed on IP address, and correlating one with another for each address seen in either) or being impracticably slow (e.g. traversing the web cache files once for each host and matching against arp records).

However a variation of the first approach outlined above seems a possibility:

The arp table itself is small enough to reside in memory, so we could break the cache logs into chunks covering relatively small times, read these into data structures, and correlate these with the arp table.

Thus we might have a table of IP-MAC-time of

    IPadd           (index), 
    time-first-seen (index),
    time-last-seen  (index),
    MAC address
And for the web logs a table of IP-user-time of
    IPadd       (index),
    time-seen   (index),
    userID
These would be easier to implement as SQL tables than perl data structures e.g. hash tables, as we would want to retrieve records for a given IPadd combined with a time relative >= t1 and <= t2. Writing these tables and then retrieveing them would be slow. The alternative would be to search through a hash indexed on IPadd and time looking for records whose time range matched that of the web records. We could have my %IP_MAC ;
%{$IP_MAC {$IPadd}} = (
    time1 => $t1,
    time2 => $t2,
    MAC   => $MAC
    ) ;

my %host_user ;
    $host_user {$IPadd} {$time} = $user;
and then for each host of host_user 
    retrieve all records from IP_MAC with the same IP, 
        and for each time of host_user {$IPadd}
            find all records from IP_MAC with matching times
As mentioned the granularity of arp records is in the order of 10 minutes, and in this period web cache logs can contain many hundreds or even thousands of records, many of which have only milliseconds' difference in timing, so a great deal of compaction of data is possible by amalgamating web log records for the same host + user occuring within a certain period of time such as 10 or 100 seconds