WatchDawg Alarm

Introduction

The WatchDawg script watches several aspects of the DMT operation to make sure that everything is running normally. Items checked include:
  1. DMT resource usage
  2. Trigger/Segment insertion process
  3. Data flow from the DAQ system
Each of these points are checked every 30 seconds. If an error condition is found, an alarm is generated on the DMT alarm page for LHO or LLO.

DMT Resource Usage

WatchDawg is meant to check resource usage to make sure there isn't an imminent danger of failure. At present, disk space and FPU traps are checked.

DeviceFull

Text:
Device: $1 is >95% full. $2 kB of $3 kB left.
Meaning:
More than 95% of the available space is in use on the specified device.
Action:
Contact a GDS expert.

DeviceFull

Text:
Device: $1 is not mounted on node $2.
Meaning:
The specified device is not mounted on the named node
Action:
Contact a GDS or CDS expert.

FPU_Traps

Text:
High FPU trap rate ($1 Hz) on node $2.
Meaning:
The specified node has a large FPU trap rate. This usually indicates that one or more monitors on the node has an error in its calculations that causes floating point exceptions. These exceptions may increase the CPU load significantly.
Action:
Contact a GDS expert.

StatLimit

Text:
System stat exceeds limit: $1 on node $2.
Meaning:
The specified system statistic condition was satisfied on the specified node. The typical statistics
Action:
Contact a GDS expert.

SeqInsert Errors

SeqInsert is the script that submits triggers processed by the trigger manager to the LDAS meta-database. The WatchDawg verifies that an instance of SeqInsert is running on the local communications node (stone or delaronde), that SeqInsert is in the run state, and that it is reasonably up to date. An alarm is generated if any of these tests fails. The alarms are described below.

SeqInsertError

Text:
SeqInsert is in state: $1 on node: $2
Meaning:
SeqInsert has detected an error and entered an error state.
Actions:
  • Verify that LDAS is running.
  • Log on to the communications node (stone or delaronde) as ops and type 'SeqInsert retry'.
  • If this doesn't result in the alarm being cleared within 40 seconds, type 'SeqInsert cont'.
  • If the alarm persists after ~1 minute, contact a GDS expert.

NoSeqInsert

Text:
SeqInsert is not running on node: $1
Meaning:
WatchDawg was unable to locate a running instance of SeqInsert. SeqInsert is run by the process manager so it is likely that the process manager has failed or given up on restarting SeqInsert.
Action:
Contact a GDS expert.

SeqInsertBacklog

Text:
SeqInsert has backlog of $1 on node: $2
Meaning:
SeqInsert has more than 20 files queued for insertion into the LDAS meta-data database.
Actions:
  • Check the alarm page for the SeqInsertError alarm. If this alarm is asserted, handle it first as described above.
  • Verify that the local LDAS system is running.
  • Verify on the current LDAS jobs page (LHO, LLO) that a DMT insertion job (command putMetaData and user ldasdmt) is running. The run time should increase or a new job should be created between updates of the page. Note that the jobs come and go 3-4 times per minute. You may need to update the page a couple times to see an insertion job.
  • If the job is running, then SeqInsert is handling the backlog. The backlog is normally reduced by about 2 files per minute. Check occasionally to make sure the insertion jobs are running and that the backlog is decreasing.
  • If no DMT jobs are running, log on to stone or delaronde as ops and type 'dfmkill'. This will restart the data-flow manager, and should cause SeqInsert to submit a new job.
  • If the insertion jobs still do not show up in the running job list, contact a GDS expert.

Shared Memory Errors

The DMT monitors receive data via the shared memory partitions of the DMT data distribution sub-system. The WatchDawg verifies that the data contained by the shared memory partition are current.

DataNotCurrent

Text:
Data on node $1 are not current ($2).
Meaning:
The most recent data are more than 5 seconds earlier than the current time.
Action:
  • Verify that the DAQ system and the frame broadcaster are running correctly.
  • Log on to the specified node as ops and run smrepair.
  • Contact a GDS expert if the alarm doesn't clear within 2 minutes.

DateInFuture

Text:
Data on node $1 have future date ($2).
Meaning:
The time stamp of the most recent data in the shared memory partition is in the future.
Action:
Contact a GDS expert.

SMDataLost

Text:
SM Data ($1 seconds) lost on node $2.
Meaning:
The specified number of data seconds were not copied to the shared memory partition of the specified node.
Action:
Data losses occur on occasion due to DAQ or frame builder restarts and when DMT nodes are heavily loaded. If data losses persist, contact a GDS expert.