Loading Documentation/driver-api/libata.rst +1 −1 Original line number Diff line number Diff line Loading @@ -401,7 +401,7 @@ Error handling ============== This chapter describes how errors are handled under libata. Readers are advised to read SCSI EH (Documentation/scsi/scsi_eh.txt) and ATA advised to read SCSI EH (Documentation/scsi/scsi_eh.rst) and ATA exceptions doc first. Origins of commands Loading Documentation/scsi/index.rst +1 −0 Original line number Diff line number Diff line Loading @@ -32,5 +32,6 @@ Linux SCSI Subsystem ppa qlogicfas scsi-changer scsi_eh scsi_transport_srp/figures Documentation/scsi/scsi_eh.txt→Documentation/scsi/scsi_eh.rst +124 −87 Original line number Diff line number Diff line .. SPDX-License-Identifier: GPL-2.0 ======= SCSI EH ====================================== ======= This document describes SCSI midlayer error handling infrastructure. Please refer to Documentation/scsi/scsi_mid_low_api.txt for more information regarding SCSI midlayer. TABLE OF CONTENTS .. TABLE OF CONTENTS [1] How SCSI commands travel through the midlayer and to EH [1-1] struct scsi_cmnd Loading @@ -25,9 +27,11 @@ TABLE OF CONTENTS [2-2-3] Things to consider [1] How SCSI commands travel through the midlayer and to EH 1. How SCSI commands travel through the midlayer and to EH ========================================================== [1-1] struct scsi_cmnd 1.1 struct scsi_cmnd -------------------- Each SCSI command is represented with struct scsi_cmnd (== scmd). A scmd has two list_head's to link itself into lists. The two are Loading @@ -38,14 +42,16 @@ otherwise stated scmds are always linked using scmd->eh_entry in this discussion. [1-2] How do scmd's get completed? 1.2 How do scmd's get completed? -------------------------------- Once LLDD gets hold of a scmd, either the LLDD will complete the command by calling scsi_done callback passed from midlayer when invoking hostt->queuecommand() or the block layer will time it out. [1-2-1] Completing a scmd w/ scsi_done 1.2.1 Completing a scmd w/ scsi_done ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For all non-EH commands, scsi_done() is the completion callback. It just calls blk_complete_request() to delete the block layer timer and Loading @@ -57,6 +63,7 @@ looks at the scmd->result value and sense data to determine what to do with the command. - SUCCESS scsi_finish_command() is invoked for the command. The function does some maintenance chores and then calls scsi_io_completion() to finish the I/O. Loading @@ -66,15 +73,19 @@ with the command. of the data in case of an error. - NEEDS_RETRY - ADD_TO_MLQUEUE scmd is requeued to blk queue. - otherwise scsi_eh_scmd_add(scmd) is invoked for the command. See [1-3] for details of this function. [1-2-2] Completing a scmd w/ timeout 1.2.2 Completing a scmd w/ timeout ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The timeout handler is scsi_times_out(). When a timeout occurs, this function Loading @@ -101,16 +112,19 @@ function 3. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the command. See [1-4] for more information. [1-3] Asynchronous command aborts 1.3 Asynchronous command aborts ------------------------------- After a timeout occurs a command abort is scheduled from scsi_abort_command(). If the abort is successful the command will either be retried (if the number of retries is not exhausted) or terminated with DID_TIME_OUT. Otherwise scsi_eh_scmd_add() is invoked for the command. See [1-4] for more information. [1-4] How EH takes over 1.4 How EH takes over --------------------- scmds enter EH via scsi_eh_scmd_add(), which does the following. Loading Loading @@ -147,7 +161,8 @@ timer has already expired. forget about - timed out scmds later. [2] How SCSI EH works 2. How SCSI EH works ==================== LLDD's can implement SCSI EH actions in one of the following two ways. Loading Loading @@ -177,9 +192,11 @@ calling scsi_restart_operations(), which 4. Kicks queues in all devices on the host in the asses [2-1] EH through fine-grained callbacks 2.1 EH through fine-grained callbacks ------------------------------------- [2-1-1] Overview 2.1.1 Overview ^^^^^^^^^^^^^^ If eh_strategy_handler() is not present, SCSI midlayer takes charge of driving error handling. EH's goals are two - make LLDD, host and Loading @@ -194,6 +211,8 @@ others are performed by invoking one of the following fine-grained hostt EH callbacks. Callbacks may be omitted and omitted ones are considered to fail always. :: int (* eh_abort_handler)(struct scsi_cmnd *); int (* eh_device_reset_handler)(struct scsi_cmnd *); int (* eh_bus_reset_handler)(struct scsi_cmnd *); Loading Loading @@ -232,47 +251,61 @@ EH), REQ_FAILFAST is not set and ++scmd->retries is less than scmd->allowed. [2-1-2] Flow of scmds through EH 2.1.2 Flow of scmds through EH ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Error completion / time out ACTION: scsi_eh_scmd_add() is invoked for scmd :ACTION: scsi_eh_scmd_add() is invoked for scmd - add scmd to shost->eh_cmd_q - set SHOST_RECOVERY - shost->host_failed++ LOCKING: shost->host_lock :LOCKING: shost->host_lock 2. EH starts ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q :ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q is cleared. LOCKING: shost->host_lock (not strictly necessary, just for :LOCKING: shost->host_lock (not strictly necessary, just for consistency) 3. scmd recovered ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd :ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd - scsi_setup_cmd_retry() - move from local eh_work_q to local eh_done_q LOCKING: none CONCURRENCY: at most one thread per separate eh_work_q to :LOCKING: none :CONCURRENCY: at most one thread per separate eh_work_q to keep queue manipulation lockless 4. EH completes ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper :ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper layer of failure. May be called concurrently but must have a no more than one thread per separate eh_work_q to manipulate the queue locklessly - scmd is removed from eh_done_q and scmd->eh_entry is cleared - if retry is necessary, scmd is requeued using scsi_queue_insert() - otherwise, scsi_finish_command() is invoked for scmd - zero shost->host_failed LOCKING: queue or finish function performs appropriate locking :LOCKING: queue or finish function performs appropriate locking [2-1-3] Flow of control 2.1.3 Flow of control ^^^^^^^^^^^^^^^^^^^^^^ EH through fine-grained callbacks start from scsi_unjam_host(). <<scsi_unjam_host>> ``scsi_unjam_host`` 1. Lock shost->host_lock, splice_init shost->eh_cmd_q into local eh_work_q and unlock host_lock. Note that shost->eh_cmd_q is Loading @@ -280,7 +313,7 @@ scmd->allowed. 2. Invoke scsi_eh_get_sense. <<scsi_eh_get_sense>> ``scsi_eh_get_sense`` This action is taken for each error-completed (!SCSI_EH_CANCEL_CMD) commands without valid sense data. Most Loading Loading @@ -315,7 +348,7 @@ scmd->allowed. 3. If !list_empty(&eh_work_q), invoke scsi_eh_abort_cmds(). <<scsi_eh_abort_cmds>> ``scsi_eh_abort_cmds`` This action is taken for each timed out command when no_async_abort is enabled in the host template. Loading @@ -339,14 +372,14 @@ scmd->allowed. 4. If !list_empty(&eh_work_q), invoke scsi_eh_ready_devs() <<scsi_eh_ready_devs>> ``scsi_eh_ready_devs`` This function takes four increasingly more severe measures to make failed sdevs ready for new commands. 1. Invoke scsi_eh_stu() <<scsi_eh_stu>> ``scsi_eh_stu`` For each sdev which has failed scmds with valid sense data of which scsi_check_sense()'s verdict is FAILED, Loading @@ -369,7 +402,7 @@ scmd->allowed. 2. If !list_empty(&eh_work_q), invoke scsi_eh_bus_device_reset(). <<scsi_eh_bus_device_reset>> ``scsi_eh_bus_device_reset`` This action is very similar to scsi_eh_stu() except that, instead of issuing STU, hostt->eh_device_reset_handler() Loading @@ -379,7 +412,7 @@ scmd->allowed. 3. If !list_empty(&eh_work_q), invoke scsi_eh_bus_reset() <<scsi_eh_bus_reset>> ``scsi_eh_bus_reset`` hostt->eh_bus_reset_handler() is invoked for each channel with failed scmds. If bus reset succeeds, all failed Loading @@ -388,7 +421,7 @@ scmd->allowed. 4. If !list_empty(&eh_work_q), invoke scsi_eh_host_reset() <<scsi_eh_host_reset>> ``scsi_eh_host_reset`` This is the last resort. hostt->eh_host_reset_handler() is invoked. If host reset succeeds, all failed scmds on Loading @@ -396,14 +429,14 @@ scmd->allowed. 5. If !list_empty(&eh_work_q), invoke scsi_eh_offline_sdevs() <<scsi_eh_offline_sdevs>> ``scsi_eh_offline_sdevs`` Take all sdevs which still have unrecovered scmds offline and EH-finish the scmds. 5. Invoke scsi_eh_flush_done_q(). <<scsi_eh_flush_done_q>> ``scsi_eh_flush_done_q`` At this point all scmds are recovered (or given up) and put on eh_done_q by scsi_eh_finish_cmd(). This function Loading @@ -411,7 +444,8 @@ scmd->allowed. layer of failure of the scmds. [2-2] EH through transportt->eh_strategy_handler() 2.2 EH through transportt->eh_strategy_handler() ------------------------------------------------ transportt->eh_strategy_handler() is invoked in the place of scsi_unjam_host() and it is responsible for whole recovery process. Loading @@ -422,7 +456,8 @@ SCSI midlayer. IOW, of the steps described in [2-1-2], all steps except for #1 must be implemented by eh_strategy_handler(). [2-2-1] Pre transportt->eh_strategy_handler() SCSI midlayer conditions 2.2.1 Pre transportt->eh_strategy_handler() SCSI midlayer conditions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following conditions are true on entry to the handler. Loading @@ -435,7 +470,8 @@ except for #1 must be implemented by eh_strategy_handler(). - shost->host_failed == shost->host_busy [2-2-2] Post transportt->eh_strategy_handler() SCSI midlayer conditions 2.2.2 Post transportt->eh_strategy_handler() SCSI midlayer conditions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following conditions must be true on exit from the handler. Loading @@ -453,7 +489,8 @@ except for #1 must be implemented by eh_strategy_handler(). ->allowed to limit the number of retries. [2-2-3] Things to consider 2.2.3 Things to consider ^^^^^^^^^^^^^^^^^^^^^^^^ - Know that timed out scmds are still active on lower layers. Make lower layers forget about them before doing anything else with Loading @@ -469,7 +506,7 @@ except for #1 must be implemented by eh_strategy_handler(). offline. -- Tejun Heo htejun@gmail.com 11th September 2005 Loading
Documentation/driver-api/libata.rst +1 −1 Original line number Diff line number Diff line Loading @@ -401,7 +401,7 @@ Error handling ============== This chapter describes how errors are handled under libata. Readers are advised to read SCSI EH (Documentation/scsi/scsi_eh.txt) and ATA advised to read SCSI EH (Documentation/scsi/scsi_eh.rst) and ATA exceptions doc first. Origins of commands Loading
Documentation/scsi/index.rst +1 −0 Original line number Diff line number Diff line Loading @@ -32,5 +32,6 @@ Linux SCSI Subsystem ppa qlogicfas scsi-changer scsi_eh scsi_transport_srp/figures
Documentation/scsi/scsi_eh.txt→Documentation/scsi/scsi_eh.rst +124 −87 Original line number Diff line number Diff line .. SPDX-License-Identifier: GPL-2.0 ======= SCSI EH ====================================== ======= This document describes SCSI midlayer error handling infrastructure. Please refer to Documentation/scsi/scsi_mid_low_api.txt for more information regarding SCSI midlayer. TABLE OF CONTENTS .. TABLE OF CONTENTS [1] How SCSI commands travel through the midlayer and to EH [1-1] struct scsi_cmnd Loading @@ -25,9 +27,11 @@ TABLE OF CONTENTS [2-2-3] Things to consider [1] How SCSI commands travel through the midlayer and to EH 1. How SCSI commands travel through the midlayer and to EH ========================================================== [1-1] struct scsi_cmnd 1.1 struct scsi_cmnd -------------------- Each SCSI command is represented with struct scsi_cmnd (== scmd). A scmd has two list_head's to link itself into lists. The two are Loading @@ -38,14 +42,16 @@ otherwise stated scmds are always linked using scmd->eh_entry in this discussion. [1-2] How do scmd's get completed? 1.2 How do scmd's get completed? -------------------------------- Once LLDD gets hold of a scmd, either the LLDD will complete the command by calling scsi_done callback passed from midlayer when invoking hostt->queuecommand() or the block layer will time it out. [1-2-1] Completing a scmd w/ scsi_done 1.2.1 Completing a scmd w/ scsi_done ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For all non-EH commands, scsi_done() is the completion callback. It just calls blk_complete_request() to delete the block layer timer and Loading @@ -57,6 +63,7 @@ looks at the scmd->result value and sense data to determine what to do with the command. - SUCCESS scsi_finish_command() is invoked for the command. The function does some maintenance chores and then calls scsi_io_completion() to finish the I/O. Loading @@ -66,15 +73,19 @@ with the command. of the data in case of an error. - NEEDS_RETRY - ADD_TO_MLQUEUE scmd is requeued to blk queue. - otherwise scsi_eh_scmd_add(scmd) is invoked for the command. See [1-3] for details of this function. [1-2-2] Completing a scmd w/ timeout 1.2.2 Completing a scmd w/ timeout ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The timeout handler is scsi_times_out(). When a timeout occurs, this function Loading @@ -101,16 +112,19 @@ function 3. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the command. See [1-4] for more information. [1-3] Asynchronous command aborts 1.3 Asynchronous command aborts ------------------------------- After a timeout occurs a command abort is scheduled from scsi_abort_command(). If the abort is successful the command will either be retried (if the number of retries is not exhausted) or terminated with DID_TIME_OUT. Otherwise scsi_eh_scmd_add() is invoked for the command. See [1-4] for more information. [1-4] How EH takes over 1.4 How EH takes over --------------------- scmds enter EH via scsi_eh_scmd_add(), which does the following. Loading Loading @@ -147,7 +161,8 @@ timer has already expired. forget about - timed out scmds later. [2] How SCSI EH works 2. How SCSI EH works ==================== LLDD's can implement SCSI EH actions in one of the following two ways. Loading Loading @@ -177,9 +192,11 @@ calling scsi_restart_operations(), which 4. Kicks queues in all devices on the host in the asses [2-1] EH through fine-grained callbacks 2.1 EH through fine-grained callbacks ------------------------------------- [2-1-1] Overview 2.1.1 Overview ^^^^^^^^^^^^^^ If eh_strategy_handler() is not present, SCSI midlayer takes charge of driving error handling. EH's goals are two - make LLDD, host and Loading @@ -194,6 +211,8 @@ others are performed by invoking one of the following fine-grained hostt EH callbacks. Callbacks may be omitted and omitted ones are considered to fail always. :: int (* eh_abort_handler)(struct scsi_cmnd *); int (* eh_device_reset_handler)(struct scsi_cmnd *); int (* eh_bus_reset_handler)(struct scsi_cmnd *); Loading Loading @@ -232,47 +251,61 @@ EH), REQ_FAILFAST is not set and ++scmd->retries is less than scmd->allowed. [2-1-2] Flow of scmds through EH 2.1.2 Flow of scmds through EH ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Error completion / time out ACTION: scsi_eh_scmd_add() is invoked for scmd :ACTION: scsi_eh_scmd_add() is invoked for scmd - add scmd to shost->eh_cmd_q - set SHOST_RECOVERY - shost->host_failed++ LOCKING: shost->host_lock :LOCKING: shost->host_lock 2. EH starts ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q :ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q is cleared. LOCKING: shost->host_lock (not strictly necessary, just for :LOCKING: shost->host_lock (not strictly necessary, just for consistency) 3. scmd recovered ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd :ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd - scsi_setup_cmd_retry() - move from local eh_work_q to local eh_done_q LOCKING: none CONCURRENCY: at most one thread per separate eh_work_q to :LOCKING: none :CONCURRENCY: at most one thread per separate eh_work_q to keep queue manipulation lockless 4. EH completes ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper :ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper layer of failure. May be called concurrently but must have a no more than one thread per separate eh_work_q to manipulate the queue locklessly - scmd is removed from eh_done_q and scmd->eh_entry is cleared - if retry is necessary, scmd is requeued using scsi_queue_insert() - otherwise, scsi_finish_command() is invoked for scmd - zero shost->host_failed LOCKING: queue or finish function performs appropriate locking :LOCKING: queue or finish function performs appropriate locking [2-1-3] Flow of control 2.1.3 Flow of control ^^^^^^^^^^^^^^^^^^^^^^ EH through fine-grained callbacks start from scsi_unjam_host(). <<scsi_unjam_host>> ``scsi_unjam_host`` 1. Lock shost->host_lock, splice_init shost->eh_cmd_q into local eh_work_q and unlock host_lock. Note that shost->eh_cmd_q is Loading @@ -280,7 +313,7 @@ scmd->allowed. 2. Invoke scsi_eh_get_sense. <<scsi_eh_get_sense>> ``scsi_eh_get_sense`` This action is taken for each error-completed (!SCSI_EH_CANCEL_CMD) commands without valid sense data. Most Loading Loading @@ -315,7 +348,7 @@ scmd->allowed. 3. If !list_empty(&eh_work_q), invoke scsi_eh_abort_cmds(). <<scsi_eh_abort_cmds>> ``scsi_eh_abort_cmds`` This action is taken for each timed out command when no_async_abort is enabled in the host template. Loading @@ -339,14 +372,14 @@ scmd->allowed. 4. If !list_empty(&eh_work_q), invoke scsi_eh_ready_devs() <<scsi_eh_ready_devs>> ``scsi_eh_ready_devs`` This function takes four increasingly more severe measures to make failed sdevs ready for new commands. 1. Invoke scsi_eh_stu() <<scsi_eh_stu>> ``scsi_eh_stu`` For each sdev which has failed scmds with valid sense data of which scsi_check_sense()'s verdict is FAILED, Loading @@ -369,7 +402,7 @@ scmd->allowed. 2. If !list_empty(&eh_work_q), invoke scsi_eh_bus_device_reset(). <<scsi_eh_bus_device_reset>> ``scsi_eh_bus_device_reset`` This action is very similar to scsi_eh_stu() except that, instead of issuing STU, hostt->eh_device_reset_handler() Loading @@ -379,7 +412,7 @@ scmd->allowed. 3. If !list_empty(&eh_work_q), invoke scsi_eh_bus_reset() <<scsi_eh_bus_reset>> ``scsi_eh_bus_reset`` hostt->eh_bus_reset_handler() is invoked for each channel with failed scmds. If bus reset succeeds, all failed Loading @@ -388,7 +421,7 @@ scmd->allowed. 4. If !list_empty(&eh_work_q), invoke scsi_eh_host_reset() <<scsi_eh_host_reset>> ``scsi_eh_host_reset`` This is the last resort. hostt->eh_host_reset_handler() is invoked. If host reset succeeds, all failed scmds on Loading @@ -396,14 +429,14 @@ scmd->allowed. 5. If !list_empty(&eh_work_q), invoke scsi_eh_offline_sdevs() <<scsi_eh_offline_sdevs>> ``scsi_eh_offline_sdevs`` Take all sdevs which still have unrecovered scmds offline and EH-finish the scmds. 5. Invoke scsi_eh_flush_done_q(). <<scsi_eh_flush_done_q>> ``scsi_eh_flush_done_q`` At this point all scmds are recovered (or given up) and put on eh_done_q by scsi_eh_finish_cmd(). This function Loading @@ -411,7 +444,8 @@ scmd->allowed. layer of failure of the scmds. [2-2] EH through transportt->eh_strategy_handler() 2.2 EH through transportt->eh_strategy_handler() ------------------------------------------------ transportt->eh_strategy_handler() is invoked in the place of scsi_unjam_host() and it is responsible for whole recovery process. Loading @@ -422,7 +456,8 @@ SCSI midlayer. IOW, of the steps described in [2-1-2], all steps except for #1 must be implemented by eh_strategy_handler(). [2-2-1] Pre transportt->eh_strategy_handler() SCSI midlayer conditions 2.2.1 Pre transportt->eh_strategy_handler() SCSI midlayer conditions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following conditions are true on entry to the handler. Loading @@ -435,7 +470,8 @@ except for #1 must be implemented by eh_strategy_handler(). - shost->host_failed == shost->host_busy [2-2-2] Post transportt->eh_strategy_handler() SCSI midlayer conditions 2.2.2 Post transportt->eh_strategy_handler() SCSI midlayer conditions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following conditions must be true on exit from the handler. Loading @@ -453,7 +489,8 @@ except for #1 must be implemented by eh_strategy_handler(). ->allowed to limit the number of retries. [2-2-3] Things to consider 2.2.3 Things to consider ^^^^^^^^^^^^^^^^^^^^^^^^ - Know that timed out scmds are still active on lower layers. Make lower layers forget about them before doing anything else with Loading @@ -469,7 +506,7 @@ except for #1 must be implemented by eh_strategy_handler(). offline. -- Tejun Heo htejun@gmail.com 11th September 2005