myNMS

Implementation

Most of the code is written in Perl, using the SNMP.pm module to interface to the NET-SNMP (formerly UCD-SNMP) package to gather SNMP data. (Originally Tcl scripts using the Scotty extensions for network management were used.) A MySQL database is used to store most of the data, with a little held in flat files. (Previously mSQL was used but proved unstable.)

Some of the collection scripts run continuously, some several times an hour, some daily. Some involve collection of data from other machines using SSH: this includes the scripts which gather NIS information as the NMS machine itself does not run NIS (to avoid dependency on another machine which may not be available).

The NMS machine has an Apache web server to present information and interact with users using a mix of pages generated periodically (for time-consuming reports such as for all hosts and all users) and pages generated on-the-fly from CGI scripts (such as individual host, user and device SNMP queries). (The Apache server should be upgraded with the mod-perl module to allow faster execution of the cgi scripts.)

Access control

Since information presented by myNMS should not be made available to all users (either for security reasons or because it is not appropriate for their needs) myNMS integrates with NCSA/Apache-style htaccess mechanisms for web pages to control users' access to information and to present appropriate views to different groups of users.

Components

This diagram shows how the components relate to each other

DB_maint has several functions including creating the SQL tables used by the myNMS database and various other maintenance and utility tasks such as checking and purging duplicate records from tables (which, of course, should never be necessary :-)
It also builds htaccess and passwd files for web security: this mode is used under control of a makefile and a cron job to keep access control files up to date with manual changes to the source configuration file
KeepAlive keeps collection and processing scripts running continuously.
SNMP_info gathers SNMP info from network devices.
MAP_info gathers info from NIS maps (hosts, passwd etc); and also quota info, using QUOTA_info.pl script run on machine which has access to quota files
LOG_info collects and processes logfiles from various devices; principally squid authenticating web cache logs (containing times, usernames and IP addresses, used to link host addresses with userIDs)
Query is the front end to the system, providing 'search' and index pages and generating reports on-the-fly or (for large reports) in batch mode.
My_DButils.txt and My_Utils.txt comprise routines used by other scripts for database operations, and other common functions.
My_Config.pm reads configuration file etc/My_Config.cf to set site-specific configuration parameters used in the scripts.
The module AppConfig which "allows complex data structures in a 'standard' unix format" might be a better choice

DataBase structure

SQL Tables

Configuration files

These are in the configured etc directory (by default, /usr/local/myNMS/etc)

My_Config.cf myNMS programs' configuration file
KeepAlive.cf KeepAlive configuration file
index.cf template for web index pages
subnets subnets of local network(s)
makefile run by cron to rebuild web access control files and index pages

KeepAlive

is a housekeeping program which keeps the collection and processing scripts running continuously, restarting them if/when they die. Most are designed to exit at midnight, so that any memory leaks or other problems are not allowed to build up by running over many days, weeks or months, and also so that log files can conveniently be re-named with the current day's date.

Output of programs is directed to logfiles, constructed from the programs' names with datestamps in the form YYYYMMDD, located in the configured log directory (default /usr/local/myNMS/var/log unless specified otherwise in the configuration file).

KeepAlive traps signals INT and HUP: both cause it to close down all running programs; after HUP the programs are restarted, whereas after INT the program removes its own .pid file (in the configured var directory) and exits.

If run with command line parameters -HUP, -INT or -KILL, or -SUSPEND, the program finds (from the .pid file) the PID of the running instance of itself and sends a HUP or INT signal to it. In the case of the SUSPEND option it then remains running (in a loop) until it is itself terminated by an INT or KILL signal, so that it prevents the normal re-starting of an instance by cron. This allows the programs it runs to be suspended without having to alter either the KeepAlive config file or crontab.

KeepAlive itself is (re)started every minute from cron: on startup it checks to see if a copy of itself is already running (by reading the .pid file and checking that a program of that PID is running) and exits if it is.

Which commands are run is specified in the KeepAlive configuration file etc/KeepAlive.cf: this is re-read every 10 seconds (or whatever interval is configured in the KeepAlive script) and any changes are acted upon by killing any commands no longer found in the config file and starting any new commands.
The module Proc::Simple would probably do the job better than the current implmentation:
Proc::Simple helps controlling background processes in Perl. It provides "Process Objects" that mimic their real world counterparts. You don't have to deal with fork and wait and friends, Proc::Simple is very easy to use, you just start processes in background, poll their status once in a while and kill them if necessary.
However Proc::Simple requires Perl 5.6, which was not available on the development system.

Query

This is the myNMS front-end to data stored in the database (and other files). It operates in 2 distinct modes: as a CGI script (it is symlinked as Query.cgi from web pages) and cli (batch) mode.

In CGI mode, when invoked without parameters it generates an index page from a template file with a 'search' form for querying users and hosts (see demo).
The script can also generate frames-based and non-frames pages.
When a link is selected, or a query string enetered in the search box, it gets called with parameters and then generates a page dependent on the parameters supplied:
- Query : IPadd - gives information on IPaddress(es) specified
- Query : SNMP - gives information on all SNMP devices, or specified device
- Query : User - gives information on specified User(s)
- Query : VENDOR - gives information on a specific Vendor code, if supplied, or all Vendor codes
In cli (non-cgi) mode, without any parameters (or with any invalid parameter) it regenerates its index pages as static .html files in the configured www subdirectories. This mode is used under control of the makefile and a cron job to keep static index pages up-to-date.
Note that static index pages are not absolutely necessary: Query generates exactly the same pages on demand in CGI mode. However it is slow to do this unless the web server is compiled with mod_perl support and configured to keep the Query script loaded in memory, which was not the case on the prototype system.
In cli mode with parameters Query=HOSTS and an --OUT={some-directory} it generates an index of all networks/subnets, a report listing all hosts, report pages for each net, and a listing of the MAC addresses 'owned' by all SNMP devices.
This is currently invoked from cron by the wrapper script Query_Hosts
In cli mode with parameters Query=User and a null search= parameter it generates a report listing all users
This is currently invoked from cron by the wrapper script Query_Users

SNMP_info

This program runs in three modes (governed by command-line switches):

--snmp checks devices on the network for SNMP (public read community).
--arp polls all the SNMP devices with sysServices=6 for their ipNetToMedia (arp) tables
--dot1d polls all the SNMP devices with sysServices=2 for their dot1d (bridge forwarding) tables

A note about data structures:

The routines getting/updating devices' SNMP info use a data structure - usually called %SNMP - which is structured thus:

    $SNMP   {deviceID}      deviceID, assigned as the unix time when the device was discovered
            {mtime}         time (in unix seconds) when device info updated from live SNMP query (of system group)
            {sysName}       }
            {sysDescr}      }
            {sysContact}    }   usual system group variables
            {sysLocation}   }
            {sysObjectID}   }
            {sysObjIDtxt}   version of sysObjectID translated to text string (not currently implemented - used to get this from Scotty)
            {UpSinceTime}   sysUpTime translated to an absolute time in unix seconds
            {ifNumber}      number of records in ifTable:
            {ifTable}       a set of data structures, indexed on ifIndex values, i.e.:
            {ifTable}{$index}   each comprising the elements:
                            {ifDescr}       }
                            {ifType}        }
                            {ifSpeed}       }   values from the SNMP ifTable group
                            {ifPhysAddress} }
                            {ifAdminStatus} }
                            {ifOperStatus}  }
                            {ifLastChangeAt}    ifLastChange translated to an absolute time in unix seconds
                            {ifName}        from the ifName table
            {ipAddrTable}   a set of data structures, indexed on IP address i.e.:
            {ipAddrTable}{$Addr}    each comprising the elements:
                            {ipAdEntIfIndex}    ifIndex with which this address is associated
                            {ipAdEntNetMask}    netmask associated with this address

in addition when information for this device is retieved from the DB it is added to the structure thus: {DB} {deviceID} } {sysName} } {sysDescr} } {sysContact} } {sysLocation} } {sysObjectID} } values from DB corresponding to 'live' values (above) {sysObjIDtxt} } {sysServices} } {ifNumber} } {UpSinceTime} } {mtime} }

LOG_info

Part of this script processes logs of squid-type authenticating web caches, correlating records of times + username + IP addresses in the logs with IP + MAC + time info from the IP_MAC table to generate the IP_MAC_user table, and also checking for users logged on simultaneously on 2 or more machines. Normally we will see this sort of thing: u u u u u u uu u u /.../ u u u <- user from logs I I I I I I II I I \...\ I I I <- IP from logs |------------ IP-MAC ---/.../----------| <- IP & MAC from IP_MAC These sightings will persist over periods of minutes or hours in the case of user activity, and indefinitely (hours -> years) for IP-MAC records. Where a host is used sequentially by many users (e.g. a shared PC) we will see: u1 u1 u1 u1 u1 u2 u2 u2 u2 u2 u2 |------------ MAC ----------------| or on a multi-user (timesharing) host: u1 u2 u1 u3 u4 u1 u4 u2 u2 u3 u4 u1 |------------ MAC ----------------| ANOMALIES:

When we have duplicate IP addresses we will seee this

u1  u1    u1 u1    u1       ...  u2 u2 u2     u2 u2
  |------- MAC1 --------|         |------ MAC2 ------|

or with multiple users: u1 u2 u4 u3 u1 u3 ... u4 u4 u4 u5 u5 u5 |------- MAC1 --------| |------ MAC2 ------| We can also check for possibly accounts shared amongst multiple users (solely from web logs) u1 u1 u1 u1 u1 u1 I1 I1 I1 I1 I1 I1 u2 u2 u2 u2 u2 I2 I2 I2 I2 I2 |----------| Depending on the duration of the overlap, type and location of machines this may be a fairly innocent case of a user moving to a new PC in the same lab having failed to log out correctly from their first machine (or using 2 machines simultaneously), a user innocently using web browsers on more than one machine simultaneously (e.g. on a PC and, via X, on a unix host to which they connect via the PC). Where the same user is logged into single-user machines in physically separate parts of the campus however an account is being shared, and this may have security implications (e.g. a compromised user account).

The sources of data we have are:

the IP_MAC database table, recording earliest and latest times at which a particular combination of IP and MAC address were seen in any router's arp table. The granularity of this data is, at best, several minutes (the time the collecting script takes to poll all the routers and, in the present implementation, also the far longer time taken to retrieve dot1d tables from all switches) and it is possible that during such an interval another host (with a different MAC address) was active using the same IP address but we missed its appearance.
the web cache logs, recording IP addresses and usernames with timestamps in milliseconds.

In order to process this data we need to correlate the timings of the two sets of data. We need to take account of possible skew between the clocks of the machine recording the IP-MAC timings and that making the host-user logs, and also that IP-MAC records will persist in router ARP tables for some time after the host involved was last active. Allowances for these factors can quite easily be made my adjusting the times of the IP-MAC records (increasing the time-first-seen to allow for clock skew, and reducing the time-last-seen to allow for ARP persistence).

The actual correlation of the two data sets looks as if it ought to be simple but I have not found it to be so, without having impracticably large data sets in memory (for example reading the entire web cache logs and ARP cache data into structures such as hashes indexed on IP address, and correlating one with another for each address seen in either) or being impracticably slow (e.g. traversing the web cache files once for each host and matching against arp records).

However a variation of the first approach outlined above seems a possibility:

The arp table itself is small enough to reside in memory, so we could break the cache logs into chunks covering relatively small times, read these into data structures, and correlate these with the arp table.

Thus we might have a table of IP-MAC-time of

    IPadd           (index), 
    time-first-seen (index),
    time-last-seen  (index),
    MAC address

And for the web logs a table of IP-user-time of IPadd (index), time-seen (index), userID These would be easier to implement as SQL tables than perl data structures e.g. hash tables, as we would want to retrieve records for a given IPadd combined with a time relative >= t1 and <= t2. Writing these tables and then retrieveing them would be slow. The alternative would be to search through a hash indexed on IPadd and time looking for records whose time range matched that of the web records. We could have my %IP_MAC ; %{$IP_MAC {$IPadd}} = ( time1 => $t1, time2 => $t2, MAC => $MAC ) ; my %host_user ; $host_user {$IPadd} {$time} = $user; and then for each host of host_user retrieve all records from IP_MAC with the same IP, and for each time of host_user {$IPadd} find all records from IP_MAC with matching times As mentioned the granularity of arp records is in the order of 10 minutes, and in this period web cache logs can contain many hundreds or even thousands of records, many of which have only milliseconds' difference in timing, so a great deal of compaction of data is possible by amalgamating web log records for the same host + user occuring within a certain period of time such as 10 or 100 seconds