Using the Standalone Analysis Tool (SAT)

All too often we get support call, or email about a system interruption with very scant details.  A string of Bxxx DEAD hex codes or a system abort # with subsystems and status codes. The worst case is that of a system hang with no status codes at all. The severity of the hang may range from a running system that no one can access, to total eclipse, where even a ctrl-b is ignored. Generally the consensus is that ‘I won’t take the time for a DUMP until it happens again’. Taking this stance is acceptable if a reasonable amount of time passes until the next event, but can come back to bite you if ‘the problem’ starts to reoccur with a vengeance. Then at least some amount of data from the first event would have been very valuable. So the question is, what are the alternatives when:

  1. Time is of the essence, and
  2. There is too much data to manually collect given issue #1.

The HP3000 platform has a powerful combination to address these issues. Namely, the remote console facility and the Standalone Analysis Tool (SAT). This article will focus on using remote console and SAT as the first line of defense diagnostic tools. For a more comprehensive article on how to handle unexpected system interruptions you are encouraged to review the article titled “Handling System Aborts and System Failures” .

The remote console facility has two major features: One is, as its name implies, it allows console control of the system remotely. And secondly, using a terminal emulator, it allows large amounts of data to be collected in a short amount of time. Please contact us if you need help on configuring or the remote console access on your system.

 cpu 0 ; tr,i,d
 cpu 1 ; tr,i,d
 cpu 2 ; tr,i,d
 exit

After a system abort, the console shows in inverse video:

SYSTEM ABORT 0 FROM SUBSYSTEM 0
SYSTEM HALT 7, $0000

With Status Codes of:

FLT DEAD FLT B907 FLT 0100

At this point, press CTRL-B (9×9 systems ensure that the key is in SERVICE, on 9×8 system verify the toggle switch on the back of the system is in the SERVICE position.)

CM> tc 

Wait for the system to reset. Interrupt the AUTOBOOT process if necessary by pressing any key within 10 seconds when prompted.

Main Menu: Enter command or menu > boot pri
 Interact with IPL (Y or N)?> 
 ISL> sat
 ...
 Processor: 00 HPA: fffa0000 IVA: 00148000 Config: TRUE
 Processor: 01 HPA: fffa2000 IVA: 00970000 Config: TRUE
 Processor: 02 HPA: fffa4000 IVA: 00976000 Config: TRUE 

Current                 CPU: 0 Original CPU: 0 Monarch CPU: 0 MP array at: c8000

Main memory: 80000000
 HPDIR table: 1000000, len 1000000 HPDIROF table: 2000000, len                 800000
 IPDIR table: 2822000
 RGLOB: 0 ICS: 0
 Last PIN: 5 ON ICS DISP running
 Processing dumpworthy file NMLOGMON.PUB.SYS, SID $128, #946 symbols...Done
 Processing dumpworthy file NMCONSOL.PUB.SYS, SID $129, #456 symbols...Done

$2 ($0) nmsat > cpu 0 ; tr,i,d
 PC=a.00178648 idle_disable_int+$8
 NM* 0) SP=81e41390 RP=a.002c5c44 dispatcher+$790
 NM 1) SP=81e41390 RP=a.00177800 iexit
 --- Interrupt Marker
 (end of NM stack)

$4 ($0) nmsat > cpu 1 ; tr,i,d
 PC=a.00178648 idle_disable_int+$8
 NM* 0) SP=81fd0390 RP=a.002c5c44 dispatcher+$790
 NM 1) SP=81fd0390 RP=a.00177800 iexit
 --- Interrupt Marker
 (end of NM stack)

$6 ($0) nmsat > cpu 2 ; tr,i,d
 PC=a.0018304c system_abort
 NM* 0) SP=41855428 RP=a.00182dd8 ?system_abort+$8
 export stub: a.008b61a0 make_pcall_from_debug+$330
 NM 1) SP=41855428 RP=a.007a5d2c func_code_eval+$998
 NM 2) SP=41855128 RP=a.007aa868 func_evaluate+$594
 NM 3) SP=41854ee8 RP=a.007c7dd0 operand_search+$6a4
 NM 4) SP=41852428 RP=a.007cada8 getvalue+$558
 NM 5) SP=4184c7e8 RP=a.007c9e20 getaddrvalue+$d0
 NM 6) SP=4184b6a8 RP=a.007c9900 getfactor+$c4
 NM 7) SP=4184a568 RP=a.007c93cc getterm+$c0
 NM 8) SP=41848c28 RP=a.007c8eb0 getsimpexpr+$1b0
 NM 9) SP=41847ae8 RP=a.007c8934 getanyexpression+$f0
 NM a) SP=418469a8 RP=a.007a28b0 scn_calc+$44
 NM b) SP=41845068 RP=a.00726ca0 do_the_command+$104
 NM c) SP=41844768 RP=a.00727b94 secondary_cmd_loop+$204
 NM d) SP=41844668 RP=a.007280f8 main_cmd_loop+$98
 NM e) SP=418443e8 RP=a.010465c4 nm_debug+$ca0
 NM f) SP=41843ae8 RP=a.00e2d098 dbg_tell_the_owner+$36c
 NM 10) SP=418438a8 RP=a.001ae65c dbg_break_handler+$688
 NM 11) SP=418436e8 RP=a.0036c858 hpe_debug+$504
 NM 12) SP=418435a8 RP=a.00383cd4 recovery_counter+$5c
 NM 13) SP=41843468 RP=a.0013b038 hpe_interrupt_marker_stub
 --- Interrupt Marker
 NM 1) SP=418433e8 RP=a.00a5a064 hxdebug+$e4
 --- End Interrupt Marker Frame ---
 $a ($2f) nmsat > exit

ISL> start norecovery

Gathering information from SAT in this manner will only take a few minutes and is surely worth the effort. Consider that taking a memory dump can take an hour or more and is largely dependent upon the amount of memory in the system, the number and type of disks in the system volume set that contain virtual memory, and the speed of the tape drive in which memory will be dumped. As you can see from the sample output above there is a large amount of information contained here to write down. Clearly connecting to the remote console (or telnet to the GSP on an A-class or N-class system) via a terminal emulator is the way to go. You can capture the entire screen contents and email them to us to record the event and for analysis. Or we can log on to the remote console ourselves and collect the data directly.