1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
|
How to log errors on OPAL
=========================
Currently the errors reported by OPAL interfaces are in free form, where as
errors reported by service processor is in standard Platform Error Log (PEL)
format. For out-of band management via IPMI interfaces, it is necessary to
push down the errors to service processor via mailbox (reported by OPAL)
in PEL format.
PEL size can vary from 2K-16K bytes, fields of which needs to populated based
on the kind of event and error that needs to be reported. All the information
needed to be reported as part of the error, is passed by user using the
error-logging interfaces outlined below. Following which, PEL structure is
generated based on the input and then passed on to service processor.
We do create eSEL error log format for some service processors but it's just
a wrapper around PEL format. Actual data still stays in PEL format.
Error logging interfaces in OPAL
--------------------------------
Interfaces are provided for the user to log/report an error in OPAL. Using
these interfaces relevant error information is collected and later converted
to PEL format and then pushed to service processor.
Step 1: To report an error, invoke ``opal_elog_create()`` with required argument.
``struct errorlog *opal_elog_create(struct opal_err_info *e_info, uint32_t tag);``
Parameters:
* ``struct opal_err_info *e_info``
Struct to hold information identifying error/event source.
* ``uint32_t tag:`` Unique value to identify the data.
Ideal to have ASCII value for 4-byte string.
The opal_err_info struct holds several pieces of information to help
identify the error/event. The struct can be obtained via the
``DEFINE_LOG_ENTRY`` macro as below - it only needs to be called once.
::
DEFINE_LOG_ENTRY(OPAL_RC_ATTN, OPAL_PLATFORM_ERR_EVT, OPAL_CHIP,
OPAL_PLATFORM_FIRMWARE, OPAL_PREDICTIVE_ERR_GENERAL,
OPAL_NA);
The various attributes set by this macro are described below.
``uint8_t opal_error_event_type``: Classification of error/events
type reported on OPAL. ::
/* Platform Events/Errors: Report Machine Check Interrupt */
#define OPAL_PLATFORM_ERR_EVT 0x01
/* INPUT_OUTPUT: Report all I/O related events/errors */
#define OPAL_INPUT_OUTPUT_ERR_EVT 0x02
/* RESOURCE_DEALLOC: Hotplug events and errors */
#define OPAL_RESOURCE_DEALLOC_ERR_EVT 0x03
/* MISC: Miscellaneous error */
#define OPAL_MISC_ERR_EVT 0x04
``uint16_t component_id``: Component ID of OPAL component as
listed in ``include/errorlog.h``.
``uint8_t subsystem_id``: ID of the sub-system reporting error. ::
/* OPAL Subsystem IDs listed for reporting events/errors */
#define OPAL_PROCESSOR_SUBSYSTEM 0x10
#define OPAL_MEMORY_SUBSYSTEM 0x20
#define OPAL_IO_SUBSYSTEM 0x30
#define OPAL_IO_DEVICES 0x40
#define OPAL_CEC_HARDWARE 0x50
#define OPAL_POWER_COOLING 0x60
#define OPAL_MISC 0x70
#define OPAL_SURVEILLANCE_ERR 0x7A
#define OPAL_PLATFORM_FIRMWARE 0x80
#define OPAL_SOFTWARE 0x90
#define OPAL_EXTERNAL_ENV 0xA0
``uint8_t event_severity``: Severity of the event/error to be reported. ::
#define OPAL_INFO 0x00
#define OPAL_RECOVERED_ERR_GENERAL 0x10
/* 0x2X series is to denote set of Predictive Error */
/* 0x20 Generic predictive error */
#define OPAL_PREDICTIVE_ERR_GENERAL 0x20
/* 0x21 Predictive error, degraded performance */
#define OPAL_PREDICTIVE_ERR_DEGRADED_PERF 0x21
/* 0x22 Predictive error, fault may be corrected after reboot */
#define OPAL_PREDICTIVE_ERR_FAULT_RECTIFY_REBOOT 0x22
/*
* 0x23 Predictive error, fault may be corrected after reboot,
* degraded performance
*/
#define OPAL_PREDICTIVE_ERR_FAULT_RECTIFY_BOOT_DEGRADE_PERF 0x23
/* 0x24 Predictive error, loss of redundancy */
#define OPAL_PREDICTIVE_ERR_LOSS_OF_REDUNDANCY 0x24
/* 0x4X series for Unrecoverable Error */
/* 0x40 Generic Unrecoverable error */
#define OPAL_UNRECOVERABLE_ERR_GENERAL 0x40
/* 0x41 Unrecoverable error bypassed with degraded performance */
#define OPAL_UNRECOVERABLE_ERR_DEGRADE_PERF 0x41
/* 0x44 Unrecoverable error bypassed with loss of redundancy */
#define OPAL_UNRECOVERABLE_ERR_LOSS_REDUNDANCY 0x44
/* 0x45 Unrecoverable error bypassed with loss of redundancy
* and performance
*/
#define OPAL_UNRECOVERABLE_ERR_LOSS_REDUNDANCY_PERF 0x45
/* 0x48 Unrecoverable error bypassed with loss of function */
#define OPAL_UNRECOVERABLE_ERR_LOSS_OF_FUNCTION 0x48
#define OPAL_ERROR_PANIC 0x50
``uint8_t event_subtype``: Event Sub-type ::
#define OPAL_NA 0x00
#define OPAL_MISCELLANEOUS_INFO_ONLY 0x01
#define OPAL_PREV_REPORTED_ERR_RECTIFIED 0x10
#define OPAL_SYS_RESOURCES_DECONFIG_BY_USER 0x20
#define OPAL_SYS_RESOURCE_DECONFIG_PRIOR_ERR 0x21
#define OPAL_RESOURCE_DEALLOC_EVENT_NOTIFY 0x22
#define OPAL_CONCURRENT_MAINTENANCE_EVENT 0x40
#define OPAL_CAPACITY_UPGRADE_EVENT 0x60
#define OPAL_RESOURCE_SPARING_EVENT 0x70
#define OPAL_DYNAMIC_RECONFIG_EVENT 0x80
#define OPAL_NORMAL_SYS_PLATFORM_SHUTDOWN 0xD0
#define OPAL_ABNORMAL_POWER_OFF 0xE0
``uint8_t opal_srctype``: SRC type, value should be OPAL_SRC_TYPE_ERROR.
SRC refers to System Reference Code.
It is 4 byte hexa-decimal number that reflects the
current system state.
Eg: BB821010,
* 1st byte -> BB -> SRC Type
* 2nd byte -> 82 -> Subsystem
* 3rd, 4th byte -> Component ID and Reason Code
SRC needs to be generated on the fly depending on the state
of the system. All the parameters needed to generate a SRC
should be provided during reporting of an event/error.
``uint32_t reason_code``: Reason for failure as stated in ``include/errorlog.h`` for OPAL.
Eg: Reason code for code-update failures can be
* ``OPAL_RC_CU_INIT`` -> Initialisation failure
* ``OPAL_RC_CU_FLASH`` -> Flash failure
Step 2: Data can be appended to the user data section using the either of
the below two interfaces: ::
void log_append_data(struct errorlog *buf, unsigned char *data,
uint16_t size);
Parameters:
``struct opal_errorlog *buf``: ``struct opal_errorlog`` pointer
returned by ``opal_elog_create()`` call.
``unsigned char *data``: Pointer to the dump data
``uint16_t size``: Size of the dump data.
``void log_append_msg(struct errorlog *buf, const char *fmt, ...);``
Parameters:
``struct opal_errorlog *buf``: pointer returned by ``opal_elog_create()``
call.
``const char *fmt``: Formatted error log string.
Additional user data sections can be added to the error log to
separate data (eg. readable text vs binary data) by calling
``log_add_section()``. The interfaces in Step 2 operate on the 'last'
user data section of the error log.
``void log_add_section(struct errorlog *buf, uint32_t tag);``
Parameters:
``struct opal_errorlog *buf``: pointer returned by ``opal_elog_create()`` call.
``uint32_t tag``: Unique value to identify the data.
Ideal to have ASCII value for 4-byte string.
Step 3: There is a platform hook for the OPAL error log to be committed on any
service processor(Currently used for FSP and BMC based machines).
Below is snippet of the code of how this hook is called.
::
void log_commit(struct errorlog *elog)
{
....
....
if (platform.elog_commit) {
rc = platform.elog_commit(elog);
if (rc)
prerror("ELOG: Platform commit error %d"
"\n", rc);
return;
}
....
....
}
Step 3.1 FSP:
::
.elog_commit = elog_fsp_commit
Once all the data for an error is logged in, the error needs to
be committed in FSP.
In the process of committing an error to FSP, log info is first
internally converted to PEL format and then pushed to the FSP.
FSP then take cares of sending all logs including its own and
OPAL's one to the POWERNV.
OPAL maintains timeout field for all error logs it is sending to
FSP. If it is not logged within allotted time period (e.g if FSP
is down), in that case OPAL sends those logs to POWERNV.
Step 3.2 BMC:
::
.elog_commit = ipmi_elog_commit
In case of BMC machines, error logs are first converted to eSEL format.
i.e:
::
eSEL = SEL header + PEL data
SEL header contains below fields.
::
struct sel_header {
uint16_t id;
uint8_t record_type;
uint32_t timestamp;
uint16_t genid;
uint8_t evmrev;
uint8_t sensor_type;
uint8_t sensor_num;
uint8_t dir_type;
uint8_t signature;
uint8_t reserved[2];
}
After filling up the SEL header fields, OPAL copies the error log
PEL data after the header section in the error log buffer. Then using
IPMI interface, eSEL gets logged in BMC.
If the user does not intend to dump various user data sections, but just
log the error with some amount of description around that error, they can do
so using just the simple error logging interface. ::
log_simple_error(uint32_t reason_code, char *fmt, ...);
For example: ::
log_simple_error(OPAL_RC_SURVE_STATUS,
"SURV: Error retrieving surveillance status: %d\n",
err_len);
Using the reason code, an error log is generated with the information derived
from the look-up table, populated and committed to service processor. All of it
is done with just one call.
Error logging retrieval from FSP:
=================================
FSP sends error log notification to OPAL via mailbox protocol.
OPAL maintains below lists:
* Free list : List of free nodes.
* Pending list : List of nodes which is yet to be read by the POWERNV.
* Processed list : List of nodes which has been read but still waiting for
acknowledgement.
Below is the structure of the node: ::
struct fsp_log_entry {
uint32_t log_id;
size_t log_size;
struct list_node link;
};
OPAL maintains a state machine which has following states. ::
enum elog_head_state {
ELOG_STATE_FETCHING, /*In the process of reading log from FSP. */
ELOG_STATE_FETCHED_INFO,/* Indicates reading log info is completed */
ELOG_STATE_FETCHED_DATA,/* Indicates reading log is completed */
ELOG_STATE_HOST_INFO, /* Host read log info */
ELOG_STATE_NONE, /* Indicates to fetch next log */
ELOG_STATE_REJECTED, /* resend all pending logs to linux */
};
Initially, state of the state machine is ``ELOG_STATE_NONE``. When OPAL gets
the notification about the error log, it takes out the node from free list
and put it into pending list and update the state machine to fetching state
(``ELOG_STATE_FETCHING``). It also gives response back to FSP about the
received error log notification.
It then queue mailbox message to get the error log data in OPAL error log
buffer, once it is done state machine gets into fetched state
(``ELOG_STATE_FETCHED_DATA``). After that, OPAL notifies POWERNV host to
fetch new error log.
POWERNV uses the OPAL interface to get the error log info(elogid, elog_size,
elog_type) first then it reads the error log data in its buffer that moves
the pending error log to processed list. After reading, the state machine
moves to ``ELOG_STATE_NONE`` state.
It acknowledges the error log id after reading error log data by sending the
call to OPAL, which in turn sends the acknowledgement mbox message to FSP and
moves error log id from processed list to again back to free node list and this
process goes on every FSP error log.
Design constraints:
-------------------
::
#define ELOG_READ_MAX_RECORD 128
Currently, the number of error logs from FSP, OPAL can hold is limited to
128. If OPAL run out of free node in the list for the new error log, it sends
'Discarded by OPAL' message to the FSP. At some point in the future, it is
upto FSP when it notifies again to OPAL about the discarded error log.
::
#define ELOG_WRITE_MAX_RECORD 64
There is also limitation on the number of OPAL error logs OPAL can hold is 64.
If it is run out of the buffers in the pool, it will log the message saying
'Failed to get the buffer'.
Note
----
* For more information regarding error logging and PEL format
refer to PAPR doc and P7 PEL and SRC PLDD document.
* Refer to ``include/errorlog.h`` for all the error logging
interface parameters and ``include/pel.h`` for PEL
structures.
Sample error logging
--------------------
::
DEFINE_LOG_ENTRY(OPAL_RC_ATTN, OPAL_PLATFORM_ERR_EVT, OPAL_ATTN,
OPAL_PLATFORM_FIRMWARE, OPAL_PREDICTIVE_ERR_GENERAL,
OPAL_NA);
void report_error(int index)
{
struct errorlog *buf;
char data1[] = "This is a sample user defined data section1";
char data2[] = "Error logging sample. These are dummy errors. Section 2";
char data3[] = "Sample error Sample error Sample error Sample error \
Sample error abcdefghijklmnopqrstuvwxyz";
int tag;
printf("ELOG: In machine check report error index: %d\n", index);
/* To report an error, create an error log with relevant information
* opal_elog_create(). Call returns a pre-allocated buffer of type
* 'struct errorlog' buffer with relevant fields updated.
*/
/* tag -> unique ascii tag to identify a particular data dump section */
tag = 0x4b4b4b4b;
buf = opal_elog_create(&e_info(OPAL_RC_ATTN), tag);
if (buf == NULL) {
printf("ELOG: Error getting buffer.\n");
return;
}
/* Append data or text with log_append_data() or log_append_msg() */
log_append_data(buf, data1, sizeof(data1));
/* In case of user wanting to add multiple sections of various dump data
* for better debug, data sections can be added using this interface
* void log_add_section(struct errorlog *buf, uint32_t tag);
*/
tag = 0x4c4c4c4c;
log_add_section(buf, tag);
log_append_data(buf, data2, sizeof(data2));
log_append_data(buf, data3, sizeof(data3));
/* Once all info is updated, ready to be sent to FSP */
printf("ELOG:commit to FSP\n");
log_commit(buf);
}
Sample output PEL dump got from FSP
-----------------------------------
::
$ errl -d -x 0x533C9B37
| 00000000 50480030 01004154 20150728 02000500 PH.0..AT ..(.... |
| 00000010 20150728 02000566 4B000107 00000000 ..(...fK....... |
| 00000020 00000000 00000000 B0000002 533C9B37 ............S..7 |
| 00000030 55480018 01004154 80002000 00000000 UH....AT.. ..... |
| 00000040 00002000 01005300 50530050 01004154 .. ...S.PS.P..AT |
| 00000050 02000008 00000048 00000080 00000000 .......H........ |
| 00000060 00000000 00000000 00000000 00000000 ................ |
| 00000070 00000000 00000000 42423832 31343130 ........BB821410 |
| 00000080 20202020 20202020 20202020 20202020 |
| 00000090 20202020 20202020 4548004C 01004154 EH.L..AT |
| 000000A0 38323836 2D343241 31303738 34415400 8286-42A10784AT. |
| 000000B0 00000000 00000000 00000000 00000000 ................ |
| 000000C0 00000000 00000000 00000000 00000000 ................ |
| 000000D0 00000000 00000000 20150728 02000500 ........ ..(.... |
| 000000E0 00000000 4D54001C 01004154 38323836 ....MT....AT8286 |
| 000000F0 2D343241 31303738 34415400 00000000 -42A10784AT..... |
| 00000100 5544003C 01004154 4B4B4B4B 00340000 UD....ATKKKK.4.. |
| 00000110 54686973 20697320 61207361 6D706C65 This is a sample |
| 00000120 20757365 72206465 66696E65 64206461 user defined da |
| 00000130 74612073 65637469 6F6E3100 554400A7 ta section1.UD.. |
| 00000140 01004154 4C4C4C4C 009F0000 4572726F ..ATLLLL....Erro |
| 00000150 72206C6F 6767696E 67207361 6D706C65 r logging sample |
| 00000160 2E205468 65736520 61726520 64756D6D . These are dumm |
| 00000170 79206572 726F7273 2E205365 6374696F y errors. Sectio |
| 00000180 6E203200 53616D70 6C652065 72726F72 n 2.Sample error |
| 00000190 2053616D 706C6520 6572726F 72205361 Sample error Sa |
| 000001A0 6D706C65 20657272 6F722053 616D706C mple error Sampl |
| 000001B0 65206572 726F7220 09090953 616D706C e error ...Sampl |
| 000001C0 65206572 726F7220 61626364 65666768 e error abcdefgh |
| 000001D0 696A6B6C 6D6E6F70 71727374 75767778 ijklmnopqrstuvwx |
| 000001E0 797A00 yz. |
|------------------------------------------------------------------------------|
| Platform Event Log - 0x533C9B37 |
|------------------------------------------------------------------------------|
| Private Header |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| Created at : 07/28/2015 02:00:05 |
| Committed at : 07/28/2015 02:00:05 |
| Creator Subsystem : OPAL |
| CSSVER : |
| Platform Log Id : 0xB0000002 |
| Entry Id : 0x533C9B37 |
| Total Log Size : 483 |
|------------------------------------------------------------------------------|
| User Header |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Log Committed by : 4154 |
| Subsystem : Platform Firmware |
| Event Scope : Unknown - 0x00000000 |
| Event Severity : Predictive Error |
| Event Type : Not Applicable |
| Return Code : 0x00000000 |
| Action Flags : Report Externally |
| Action Status : Sent to Hypervisor |
|------------------------------------------------------------------------------|
| Primary System Reference Code |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| SRC Format : 0x80 |
| SRC Version : 0x02 |
| Virtual Progress SRC : False |
| I5/OS Service Event Bit : False |
| Hypervisor Dump Initiated: False |
| Power Control Net Fault : False |
| |
| Valid Word Count : 0x08 |
| Reference Code : BB821410 |
| Hex Words 2 - 5 : 00000080 00000000 00000000 00000000 |
| Hex Words 6 - 9 : 00000000 00000000 00000000 00000000 |
| |
|------------------------------------------------------------------------------|
| Extended User Header |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| Reporting Machine Type : 8286-42A |
| Reporting Serial Number : 10784AT |
| FW Released Ver : |
| FW SubSys Version : |
| Common Ref Time : 07/28/2015 02:00:05 |
| Symptom Id Len : 0 |
| Symptom Id : |
|------------------------------------------------------------------------------|
| Machine Type/Model & Serial Number |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| Machine Type Model : 8286-42A |
| Serial Number : 10784AT |
|------------------------------------------------------------------------------|
| User Defined Data |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| |
| 00000000 4B4B4B4B 00340000 54686973 20697320 KKKK.4..This is |
| 00000010 61207361 6D706C65 20757365 72206465 a sample user de |
| 00000020 66696E65 64206461 74612073 65637469 fined data secti |
| 00000030 6F6E3100 on1. |
| |
|------------------------------------------------------------------------------|
| User Defined Data |
|------------------------------------------------------------------------------|
| Section Version : 1 |
| Sub-section type : 0 |
| Created by : 4154 |
| |
| 00000000 4C4C4C4C 009F0000 4572726F 72206C6F LLLL....Error lo |
| 00000010 6767696E 67207361 6D706C65 2E205468 gging sample. Th |
| 00000020 65736520 61726520 64756D6D 79206572 ese are dummy er |
| 00000030 726F7273 2E205365 6374696F 6E203200 rors. Section 2. |
| 00000040 53616D70 6C652065 72726F72 2053616D Sample error Sam |
| 00000050 706C6520 6572726F 72205361 6D706C65 ple error Sample |
| 00000060 20657272 6F722053 616D706C 65206572 error Sample er |
| 00000070 726F7220 09090953 616D706C 65206572 ror ...Sample er |
| 00000080 726F7220 61626364 65666768 696A6B6C ror abcdefghijkl |
| 00000090 6D6E6F70 71727374 75767778 797A00 mnopqrstuvwxyz. |
| |
|------------------------------------------------------------------------------|
|