Setting Alarm by using Autosys job and tool handling for alarm

By Lakshmi - June 22, 2019

Alarms are automated notifications triggered when a system detects an abnormal condition that requires human attention to resolve. Unlike automated recovery processes, alarms highlight scenarios where:

Manual intervention is necessary (e.g., decision-making, troubleshooting).

Automated fixes are not possible (e.g., missing dependencies, corrupted data).

Timely action is critical to prevent cascading failures.

Key Characteristics of Such Alarm

Event-Driven

Triggered by specific conditions (e.g., a file not arriving on time, a job failing repeatedly).

Example: A scheduled ETL job fails because an expected input file is missing.

Requires Human Judgment

The system cannot auto-resolve (e.g., deciding whether to proceed with incomplete data or wait).

Example: A payment processing system flags a transaction for fraud review.

Escalation Mechanisms

If unacknowledged, alarms escalate (e.g., email → SMS → phone call → on-call engineer).

Example: A server disk reaching 95% capacity triggers an alert to IT ops.

Prioritized by Severity

Critical (Red): Immediate action needed (e.g., production outage).

Warning (Yellow): Needs investigation but not urgent (e.g., delayed file arrival).

Informational (Blue): Logged for awareness (e.g., job completed with warnings).

Example Scenario: Missing File Alarm

Situation

A scheduled batch job depends on an input file expected by 2:00 AM.

By 2:30 AM, the file hasn’t arrived, and the job fails to start.

Alarm Behavior

Detection: Monitoring tool identifies the file is overdue.
Notification: Sends an alert (email/SMS) to the support team.

Manual Intervention Required:

Investigate: Check if the file was delayed or lost.
Decide: Proceed with a backup file, rerun the job later, or abort.
Resolve: Manually trigger the job or fix the upstream issue.

Why Automation Isn’t Enough

The system doesn’t know whether to:
Wait longer (network delay?).
Use an older file (is it acceptable?).
Skip the job (will downstream processes break?).
A human must assess and decide.

Best Practices for Such Alarms

Clear Alert Messaging

Include:

What failed, Why it matters, Possible actions.

Avoid Alarm Fatigue

Only alert for true exceptions (not every minor delay).

Runbook Integration

Link alarms to troubleshooting guides (e.g., "Steps if a file is missing").

Post-Incident Review

Analyze if the alarm could be automated or prevented in the future.

Tools Handling Such Alarms:

IT Operations: PagerDuty, Opsgenie, Nagios

Data Pipelines: Apache Airflow (alerts on DAG failures)

Cloud Monitoring: AWS CloudWatch, Azure Monitor

Search This Blog

LAKSHMI WEBLOGIC/SOA ADMIN