Benutzer:Rdiez/ErrorHandling: Unterschied zwischen den Versionen

Aus /dev/tal
Wechseln zu: Navigation, Suche
Zeile 24: Zeile 24:
  
 
Proper error-handling logic is what sets professional developers apart. Writing quality error handlers requires continuous discipline during development, because it is a tedious task that can easily cost more than the normal application logic for the sunny-day scenario that the developer is paid to write. Testing error paths manually with the debugger is recommended practice, but that doesn't make it any less time consuming. Repeatable test cases that feed the code with invalid data sequences in order to trigger and test each possible error scenario is a rare luxury. This is why error handling in general needs constant encouraging through systematic code reviews or through separate testing personnel. In my experience, lack of good error handling is also symptomatic that the code hasn't been properly developed and tested. A quick look at the error handlers in the source code can give you a pretty reliable measurement of the general code quality.
 
Proper error-handling logic is what sets professional developers apart. Writing quality error handlers requires continuous discipline during development, because it is a tedious task that can easily cost more than the normal application logic for the sunny-day scenario that the developer is paid to write. Testing error paths manually with the debugger is recommended practice, but that doesn't make it any less time consuming. Repeatable test cases that feed the code with invalid data sequences in order to trigger and test each possible error scenario is a rare luxury. This is why error handling in general needs constant encouraging through systematic code reviews or through separate testing personnel. In my experience, lack of good error handling is also symptomatic that the code hasn't been properly developed and tested. A quick look at the error handlers in the source code can give you a pretty reliable measurement of the general code quality.
 +
 +
The fact that most example code snippets in software documentation do not bother checking for error conditions, let alone handling them gracefully, does not help either. This gives the impression that the official developers do not take error handling seriously, like everybody else, so you don't need to either. Sometimes you'll find the excuse of keeping those examples free from "clutter". However, when using a new syscall or library call, working out how to check for errors and how to collect the corresponding error information can take longer than coding its normal usage scenario, so this aspect would actually be most helpful in the usage example. There is some hope, however, as I have noticed that some API usage examples in Microsoft's documentation now include error checks with "handle the error here" comments below. While it is still not enough, it is better than nothing.
  
 
It is hard to assess how much value robust error handling brings to the end product, and therefore any extra development costs in this field are hard to justify. Free-software developers are often investing their own spare time and frequently take shortcuts in this area. Software contracts are usually drafted on positive terms describing what the software should do, and robustness in the face of errors gets then relegated to some implied general quality standards that are not properly described or quantified. Furthermore, when a customer tests a software product for acceptance, he is primarily worried about fulfilling the contractual obligations in the "normal", non-error case, and that tends to be hard enough. That the software is brittle either goes unnoticed or is not properly rated in the software bug list.
 
It is hard to assess how much value robust error handling brings to the end product, and therefore any extra development costs in this field are hard to justify. Free-software developers are often investing their own spare time and frequently take shortcuts in this area. Software contracts are usually drafted on positive terms describing what the software should do, and robustness in the face of errors gets then relegated to some implied general quality standards that are not properly described or quantified. Furthermore, when a customer tests a software product for acceptance, he is primarily worried about fulfilling the contractual obligations in the "normal", non-error case, and that tends to be hard enough. That the software is brittle either goes unnoticed or is not properly rated in the software bug list.
Zeile 31: Zeile 33:
 
In addition to all of the above, when the error-handling logic does fail, or when it does not yield helpful information for troubleshooting purposes, it tends to impact first and foremost the users' budget, and not the developer's, and that normally happens after the delivery and payment dates. Even if the error does come back to the original developer, it may find its way through a separate support department, which may even be able to provide a work-around and further justify the business case for that same support department. If nothing else helps, the developer's urgent help is then suddenly required for a real-world, important business problem, which may help make that original developer a well-regarded, irreplaceable person. After all, only the original person understands the code well enough to figure out what went wrong, and any newcomers will shy away from making any changes to a brittle codebase. This scenario can also hold true in open-source communities, where social credit from quickly fixing bugs may be more relevant than introducing those bugs in the first place. All these factors conspire to make poor error handling an attractive business strategy.
 
In addition to all of the above, when the error-handling logic does fail, or when it does not yield helpful information for troubleshooting purposes, it tends to impact first and foremost the users' budget, and not the developer's, and that normally happens after the delivery and payment dates. Even if the error does come back to the original developer, it may find its way through a separate support department, which may even be able to provide a work-around and further justify the business case for that same support department. If nothing else helps, the developer's urgent help is then suddenly required for a real-world, important business problem, which may help make that original developer a well-regarded, irreplaceable person. After all, only the original person understands the code well enough to figure out what went wrong, and any newcomers will shy away from making any changes to a brittle codebase. This scenario can also hold true in open-source communities, where social credit from quickly fixing bugs may be more relevant than introducing those bugs in the first place. All these factors conspire to make poor error handling an attractive business strategy.
  
As a result, error handling gets mostly neglected, and that reflects in our day-to-day experience with computer software. I have seen plenty of jokes around about unhelpful or funny error messages. Many security issues have their roots in incorrect error detection or handling, and such issues are still getting patched on a weekly rhythm for operating system releases that have been considered stable for years.
+
In the end, error handling gets mostly neglected, and that reflects in our day-to-day experience with computer software. I have seen plenty of jokes around about unhelpful or funny error messages. Many security issues have their roots in incorrect error detection or handling, and such issues are still getting patched on a weekly rhythm for operating system releases that have been considered stable for years.
  
== Goals Overview ==
+
== General Strategy ==
  
These are the main goals for a good error-handling strategy:
+
Robust error handling is costly but it is an important aspect of software development. Choosing a good strategy from the beginning reduces costs in the long run. These are the main goals:
  
 
# Provide helpful error messages.
 
# Provide helpful error messages.
Zeile 43: Zeile 45:
 
#* adding error checks to the source code.
 
#* adding error checks to the source code.
 
#* repurposing existing code.
 
#* repurposing existing code.
 +
 +
Non-goals are:
 +
 +
# Improve error tolerance. <br/> Normally, when an error occurs, the operation that caused it is considered to have failed. This document does not deal with error tolerance at all.
 +
# Optimise error-handling performance. <br/> In normal scenarios, only the successful (non-error) paths need to be fast. This may not hold true on critical, real-time systems, where the error response time needs to meet certain constrains.
 +
# Optimise memory consumption. <br/> Good error messages and proper error handling comes at a cost, but the investment almost always pays off.
  
 
=== Compromises ===
 
=== Compromises ===
  
Coding the error-handling logic can be costly, and sometimes compromises must be made:
+
Writing good error-handling logic can be costly, and sometimes compromises must be made:
  
 
==== Unpleasant Error Messages ====
 
==== Unpleasant Error Messages ====
Zeile 74: Zeile 82:
 
Besides, a complete crash will trigger any emergency mechanism installed, like automatically restarting the failed service/daemon and timely alerting your system administrators. Such a recovery course may be better than any unpredictable behaviour down the line due to a previous error handled incorrectly.
 
Besides, a complete crash will trigger any emergency mechanism installed, like automatically restarting the failed service/daemon and timely alerting your system administrators. Such a recovery course may be better than any unpredictable behaviour down the line due to a previous error handled incorrectly.
  
Some people are tempted to write clever unexpected error handlers to help deal with panics, or even avoid them completely. However, it is usually better to focus on the emergency recovery procedures after the crash rather than installing your own crash handler in an attempt at capturing more error information or surviving the unknown error condition. Your Operating System will probably do a better job at collecting crash information, you may just need to enable its crash reporting features. You don't want to interfere with that process, because, if your memory contents or window handlers are already corrupt or invalid, trying to run your crash handler may make matters worse and corrupt or even mask the original crash altogether. Getting an in-application crash handler right is hard if not downright impossible, and I've seen quite a few of them failing themselves after the first application failure they were supposed to report. If you have time to spare on fatal error scenarios, try to minimise their consequences by saving the user data at regular intervals before the crash (like some text editors or word processors do) or by configuring the system to automatically restart any important service when it crashes.
+
==== Do Not Install Your Own Critical Error Handler ====
 +
 
 +
Some people are tempted to write clever unexpected error handlers to help deal with panics, or even avoid them completely. However, it is usually better to focus on the emergency recovery procedures after the crash rather than installing your own crash handler in an attempt at capturing more error information or surviving the unknown error condition.
 +
 
 +
Your Operating System will probably do a better job at collecting crash information, you may just need to enable its crash reporting features. You don't want to interfere with that process, because, if your memory contents or window handlers are already corrupt or invalid, trying to run your crash handler may make matters worse and corrupt or even mask the original crash altogether.
 +
 
 +
Getting an in-application crash handler right is hard if not downright impossible, and I've seen quite a few of them failing themselves after the first application failure they were supposed to report. If you have time to spare on fatal error scenarios, try to minimise their consequences by saving the user data at regular intervals before the crash (like some text editors or word processors do) or by configuring the system to automatically restart any important service when it crashes. You could also direct your remaining efforts at improving your software quality process instead.
  
 
== How to Generate Helpful Error Messages ==
 
== How to Generate Helpful Error Messages ==
Zeile 507: Zeile 521:
 
     }
 
     }
 
  }
 
  }
 +
 +
== Why You Should Use Exceptions ==
 +
 +
The exception mechanism is the best way to write general error-handling logic. After all, it was designed specifically for that purpose. Even though the C++ language shows some weaknesses (lack of ''finally'' clause, need of several helper routines), the exception-enabled code example above shows a clear improvement. However, there are surprisingly many oponents, especially in the context of the C++ programming language. While I don't share most of the critique, there are still issues with some compilers and some C++ runtime libraries, even as late as year 2013.
 +
 +
Modern applications and software frameworks tend to rely on C++ exceptions for error handling, and it is impractical to ignore C++ exceptions nowadays. The C++ Standard Template Library (STL), Microsoft's ATL and MFC are prominent examples. Just by using them you need to cater for any exceptions they might throw.
 +
 +
In the case of the STL, the designers seem to have put a lot of effort into avoiding making C++ exceptions a hard requirement in practice. The most common STL implementations still work even if C++ exception support is turned off when compiling. As long as you use the STL containers carefully and do not trigger any errors, everthing will work fine. You don't miss exceptions very much because the STL has always favoured speed over robustness and does little error checking except for some debug-only assertions. Using an invalid iterator, for example, leads quickly to fatal crashes. However, the official documentation still states that the library reports errors by throwing exceptions. In fact, an out-of-memory condition when inserting elements into an STL container can only be reported by throwing a C++ exception, as the library has no other way to return such an error indication. Without C++ exception support, a release build will fail or even crash without warning if that happens. However, out-of-memory conditions are often ignored altogether during development, so this particular shortcoming does not show up in practice either.
 +
 +
Exceptions are prevalent outside the C++ world: Java, Javascript, C#, Objective-C, Perl and Emacs Lisp, for example, use exceptions for error-handling purposes. And the list [http://en.wikipedia.org/wiki/Exception_handling_syntax goes on].
 +
 +
Even plain C has a similar [http://en.wikipedia.org/wiki/Setjmp/longjmp setjmp/longjmp] mechanism. The need to quickly unwind the call stack on an error condition is a very old idea indeed.
 +
 +
=== Downsides of Using C++ exceptions ===
 +
 +
==== Exceptions make the code bigger and/or slower ====
 +
 +
This should not be the case, and even if it is, it is almost always an issue with the current version of the compiler or its C++ runtime library. For example, in my experience, GCC generates smaller exception-handling code for the ARM platform than for the PowerPC.
 +
 +
But first of all, even if the code size does increase or if the software becomes slower, it may not matter much. Better error-handling support may be much more important.
 +
 +
In theory, logic that uses C++ exceptions should generate smaller code than the traditional if/else approach, because the exception-handling support is normally implemented with stack unwind tables that can be efficiently shared (commoned up) at link time.
 +
 +
Because source code that uses exceptions does not need to check for errors at each call (with the associated branch/jump instruction), the resulting machine code should run faster in the normal (non-error) scenario and slower if an exception occurs (as the stack unwinder is generic, table-driven routine). This is actually an advantage, as speed is not normally important when handling error conditions.
 +
 +
However, code size or speed may still be an issue in severely-constrained embedded environments. Enabling C++ exceptions has an initial impact in the code size, as the stack unwinding support needs to be linked in. Compilers may also conspire against you. Let's say you are writing a bare-metal embedded application for a small microcontroller that does not use dynamic memory at all (everything is static). With GCC, turning on C++ exceptions means pulling in the malloc() library, as its C++ runtime library creates exception objects on the heap. Such an strategy may be faster on average, but is not always welcome. The C++ specification allows for exception objects to be placed on the stack and to be copied around when necessary during stack unwinding. Another implementation could also use a separate, fixed-size memory area for that purpose. However, GCC offers no alternative implementation.
 +
 +
GCC's development is particularly sluggish in the embedded area. After years of neglect, version 4.8.0 finally gained configuration switch ''--disable-libstdcxx-verbose'', which avoids linking in big chunks of the standard C I/O library just because you enabled C++ exception support. If you are not compiling a standard Linux application, chances are that the C++ exceptions tables are generated in the "old fashioned" way, which means that the stack unwind tables will have to be sorted on first touch. The first ''throw()'' statement will incur a runtime penalty, and, depending on your target embedded OS, this table sorting may not be thread safe, so you may have sort the tables on start-up, increasing the boot time.
 +
 +
Debug builds may get bigger when turning C++ exceptions on. The compiler normally assumes that any routine can throw an exception, so it may generate more exception-handling code than necessary. Ways to avoid this are:
 +
# Append "''throw()''" to the function declarations in the header files. <br/> This indicates that the function will never throw an exception. Use it sparingly, or you may find it difficult to add an error check in one of those routines at a later point in time.
 +
# Turn on global optimisation (LTO). <br/> The compiler will then be able to determine whether a function called from another module could ever throw an exception, and optimise the callers accordingly. <br/> Unfortunately, using GCC's LTO is not yet a viable option on many architectures. You may be tempted to discard LTO altogether because of the lack of debug information on LTO-optimised executables (as of GCC version 4.8).
 +
 +
==== Exceptions are unsafe because the tend to break existing code more easily ====
 +
 +
The usual argument is that, if you make a change somewhere deep down the code, an exception might come up at an unexpected place higher up the call stack and break the existing software. However, I believe that developers are better off embracing the idea of [http://en.wikipedia.org/wiki/Defensive_programming devensive programming] from the start. With or without exceptions, errors do tend to come up at unexpected places in the end.
 +
 +
It is true that, if the old code handles errors with manual if() statements, adding a new error condition normally means adding extra if() sentences that make new code paths more obvious. However, when a routine gains an error return code, existing callers are often not amended to check it. Furthermore, it is unlikely that developers will review the higher software layers, or even test the new error scenario, so as to make sure that the application can handle the new error condition correctly.
 +
 +
More importantly, in such old code there is a strong urge to handle errors only whenever necessary, that is, only where error checks occur. As a result, if a piece of code was not expecting any errors from all the routines it calls, and one of those routines can now report an error, the calling code will not be ready to handle it. Therefore, the developer adding an error condition deep down below may need to add a whole bunch of if() statements in many layers above in order to handle that new error condition. You need to be careful when adding such if() statements around: if any new error check could trigger an early return, you need to know what resources need to be cleaned up beforehand. That means inspecting a lot of older code that other developers have written. Anything that breaks further up is now your reponsibility, for the older code was "working correctly" in the past. This amounts to a great social deterrant from adding new error checks.
 +
 +
Let's illustrate the problem with an example. Say you have this kind of code, which does not use exceptions at all:
 +
 +
void a ( void )
 +
{
 +
  my_class * my_instance = new my_class();
 +
  ...
 +
  b();
 +
  ...
 +
  delete my_instance;
 +
}
 +
 +
If b() does not return any error indication, there is no need to protect ''my_instance'' with a smart pointer. If b()'s implementation changes and it now needs to return an error indication, you should amend routine a() to deal with it as follows:
 +
 +
bool a ( void )
 +
{
 +
  my_class * my_instance = new my_class();
 +
  ...
 +
  if ( ! b() )
 +
  {
 +
      delete my_instance;
 +
      return false;
 +
  }
 +
  ...
 +
  delete my_instance;
 +
  return true;
 +
}
 +
 +
That means you have to read and understand a() in order to add the "return false;" statement in the middle. You need to check if it safe to destroy the object at that point in time. Maybe you should change the implementation to use a smart pointer now, which may affect other parts of the code. Note that a() has gained a return value, so all callers need to be amended similarly. In short, you have to analyse and modify existing code all the way upwards in order to support the new error path.
 +
 +
If the original code had been written in a defensive manner, with techniques like [http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization Resource Acquisition Is Initialization], and had used C++ exceptions from the scratch, chances are it would already have been ready to handle any new error conditions that could come up in the middle of execution. If not, any unsafe code (any resource not managed by a smart pointer, and so on) is a bug which can be more easily assigned to the original developer. Unsafe code may also be fixed during code reviews before the new error condition comes. Such code makes it easier to add error checks, because a developer does not need to check and modify so much code in the layers above, and is less exposed to blame if something breaks somewhere higher up as a result.
 +
 +
Therefore, for the reasons above, I am not convinced that relying on old-style if() statements for error-handling purposes helps writing better code in the long run.
  
 
== Never Ignore Error Indications ==
 
== Never Ignore Error Indications ==
Zeile 562: Zeile 649:
 
=== Error Codes Should Not Be Used for Logic Decisions ===
 
=== Error Codes Should Not Be Used for Logic Decisions ===
  
TODO: Still to write.
+
It is hard enough to design and maintain a stable [http://en.wikipedia.org/wiki/API Application Programming Interface], so your error codes should not be a part of it. Error codes will constantly change, as new error codes are added in order to help pin-point problems more accurately. Therefore, it's best not to document all of your error codes formally, just mention a few of them as examples of what the caller may get back from a failed operation.
 +
 
 +
The logic on the caller's side should not make decisions based on the returned error codes. All the caller needs know is whether there was an error or not. Only the human user will eventually benefit from knowing exactly what went wrong. Therefore, your library users should not actually need a header file that exposes a constant for each error code. In other words, do not publish your library's version of errno.h for the end user.
 +
 
 +
TODO: Still to write further.
  
 
=== Managing Error Codes in Source Code ===
 
=== Managing Error Codes in Source Code ===
  
Whenever a developer writes code for a new error condition, there is a strong urge to re-use some existing error code, instead of adding a new one. This is why generic error codes like EINVAL constantly get abused until they are all but meaningless. If you insist on using error codes in order to report error conditions, you should at least state a policy that discourages error code reuse.
+
Whenever a developer writes code for a new error condition, there is a strong urge to re-use some existing error code, instead of adding a new one. This is why generic error codes like EINVAL constantly get abused until they are all but meaningless. If you insist on using error codes in order to report error conditions, you should at least state a policy that discourages error code reuse.  
  
 
TODO: Still to write.
 
TODO: Still to write.
Zeile 595: Zeile 686:
  
 
TODO: Still to write.
 
TODO: Still to write.
 +
TODO: Sending a format string for internationalisation purposes is normally the wrong thing to do.
  
 
=== Using Positional Parameters ===
 
=== Using Positional Parameters ===
  
 
TODO: Still to write.
 
TODO: Still to write.
 +
 +
== Helper Routines ==
 +
 +
Writing robust code in the face of errors is a chore that can be alleviated with the right set of helper routines. In my experience, you don't need too many of them, most of them are trivial wrapper, and, once written, they tend to remain stable. What matters most is the convention and associated development policy to use them whenever appropriate.
 +
 +
=== Throw Utility ===
 +
 +
Instead of throwing an error object directly whenever an error is detected, you may want to use a ''LogAndThrow( errObj )'' routine instead. This way, you could easily turn some compilation option on in order to dump the call stack at the point of error to a log file, in case a developer needs such detailed information to track down a nasty bug. A more shallow logging could still be useful for statistical purposes.
 +
 +
=== Global Panic Routines ===
 +
 +
See the section about abrupt termination above for more information.
 +
 +
=== Wrappers for printf() and the like ===
 +
 +
# void printf_e ( const char * format_str, ... );
 +
 +
See the section about printf() above for more information.
 +
 +
=== String-handling helpers to build error messages ===
 +
 +
I recommend using the standard std::string class. Routines like this would be helpful:
 +
 +
# std::string format_msg ( const char * format_str, ... );
 +
 +
TODO: Still to write.
 +
 +
=== Utility functions that collect OS error messages ===
 +
 +
TODO: errno, GetLastError(), ...
 +
 +
=== Wrappers for the most common OS syscalls ===
 +
 +
TODO: open(), close(), fopen_e(). In the case of close(), you will probably want a close_p() for "close with panic" semantics.
 +
 +
=== Smart pointers for memory, file descriptors and so on ===
 +
 +
Classes like ''auto_ptr_free'' for memory pointers obtained through ''malloc()'', or ''auto_file_close'' for Linux file descriptors.
 +
 +
TODO: Still to write.
 +
 +
== Assertions Are No Substitute for Proper Error Handling ==
 +
 +
[http://en.wikipedia.org/wiki/Assertion_%28computing%29 Assertions] are designed to help developers quickly find bugs, but are not actually a form of error handling, as assertions are normally left out of release builds. Even if there were not, they tend to generate error messages that a normal user would not understand.
 +
 +
Therefore, you need to resist the temptation of asserting on syscall return codes and the like.
 +
 +
However, an assertion is still better than no error check at all. If you have no time to write proper error checking code, at least add a "to do" comment to the source code. That will remind other developers (or even yourself in the distant future) that this aspect needs to be considered. For example, under Win32:
 +
 +
if ( FALSE == GetCursorPos( &cursorPos ) )
 +
  assert( false );  // FIXME: Write proper error handling here.
 +
 +
I have grown fond of Microsoft's VERIFY macro for this kind of check.
 +
 +
VERIFY( FALSE != GetCursorPos( &cursorPos ) );  // FIXME: Write proper error handling here.
  
 
== The rest of the article has not been written yet ==
 
== The rest of the article has not been written yet ==

Version vom 20. September 2013, 19:33 Uhr

Warning sign
Dies sind die persönlichen Benutzerseiten von rdiez, bitte nicht verändern! Ausnahmen sind nur einfache Sprachkorrekturen wie Tippfehler, falsche Präpositionen oder Ähnliches. Alles andere bitte nur dem Benutzer melden!


Inhaltsverzeichnis

Error Handling in General and C++ Exceptions in Particular

Introduction

Motivation

The software industry does not seem to take software quality seriously, and a good part of it falls into the error-handling category. After putting up for years with so much misinformation, so many half-truths and with a general sentiment of apathy on the subject, I finally decided to write a lengthy article about error handling in general and C++ exceptions in particular.

I am not a professional technical writer and I cannot afford the time to start a long discussion on the subject, but I still welcome feedback, so feel free to drop me a line if you find any mistakes or would like to see some other aspect covered here.

Scope

This document focuses on the "normal" software development scenarios for user-facing applications or for non-critical embedded systems. There are of course other areas not covered here: there are systems where errors are measured, tolerated, compensated for or even incorporated into the decision process.

Audience

This document is meant for software developers who have already gathered a reasonable amount of programming experience. The main goal is to give practical information and describe effective techniques for day-to-day work.

Although you can probably guess how C++ exceptions work from the source code examples below, it is expected that you already know the basics, especially the concept of stack unwinding upon raising (throwing) an exception. Look into your favourite C++ book for a detailed description of exception semantics and their syntax peculiarities.

Causes of Neglect

Proper error-handling logic is what sets professional developers apart. Writing quality error handlers requires continuous discipline during development, because it is a tedious task that can easily cost more than the normal application logic for the sunny-day scenario that the developer is paid to write. Testing error paths manually with the debugger is recommended practice, but that doesn't make it any less time consuming. Repeatable test cases that feed the code with invalid data sequences in order to trigger and test each possible error scenario is a rare luxury. This is why error handling in general needs constant encouraging through systematic code reviews or through separate testing personnel. In my experience, lack of good error handling is also symptomatic that the code hasn't been properly developed and tested. A quick look at the error handlers in the source code can give you a pretty reliable measurement of the general code quality.

The fact that most example code snippets in software documentation do not bother checking for error conditions, let alone handling them gracefully, does not help either. This gives the impression that the official developers do not take error handling seriously, like everybody else, so you don't need to either. Sometimes you'll find the excuse of keeping those examples free from "clutter". However, when using a new syscall or library call, working out how to check for errors and how to collect the corresponding error information can take longer than coding its normal usage scenario, so this aspect would actually be most helpful in the usage example. There is some hope, however, as I have noticed that some API usage examples in Microsoft's documentation now include error checks with "handle the error here" comments below. While it is still not enough, it is better than nothing.

It is hard to assess how much value robust error handling brings to the end product, and therefore any extra development costs in this field are hard to justify. Free-software developers are often investing their own spare time and frequently take shortcuts in this area. Software contracts are usually drafted on positive terms describing what the software should do, and robustness in the face of errors gets then relegated to some implied general quality standards that are not properly described or quantified. Furthermore, when a customer tests a software product for acceptance, he is primarily worried about fulfilling the contractual obligations in the "normal", non-error case, and that tends to be hard enough. That the software is brittle either goes unnoticed or is not properly rated in the software bug list.

As a result, small software errors often cascade into great disasters, because all the error paths in between fail one after the next one across all the different software layers and communicating devices, as error handlers hardly ever got any attention. But even in this scenario, the common excuse sounds like "yes, but my part wouldn't have failed in the previous one hadn't in the first place".

In addition to all of the above, when the error-handling logic does fail, or when it does not yield helpful information for troubleshooting purposes, it tends to impact first and foremost the users' budget, and not the developer's, and that normally happens after the delivery and payment dates. Even if the error does come back to the original developer, it may find its way through a separate support department, which may even be able to provide a work-around and further justify the business case for that same support department. If nothing else helps, the developer's urgent help is then suddenly required for a real-world, important business problem, which may help make that original developer a well-regarded, irreplaceable person. After all, only the original person understands the code well enough to figure out what went wrong, and any newcomers will shy away from making any changes to a brittle codebase. This scenario can also hold true in open-source communities, where social credit from quickly fixing bugs may be more relevant than introducing those bugs in the first place. All these factors conspire to make poor error handling an attractive business strategy.

In the end, error handling gets mostly neglected, and that reflects in our day-to-day experience with computer software. I have seen plenty of jokes around about unhelpful or funny error messages. Many security issues have their roots in incorrect error detection or handling, and such issues are still getting patched on a weekly rhythm for operating system releases that have been considered stable for years.

General Strategy

Robust error handling is costly but it is an important aspect of software development. Choosing a good strategy from the beginning reduces costs in the long run. These are the main goals:

  1. Provide helpful error messages.
  2. Deliver the error messages timely and to the right person.
    The developer may want more information than the user.
  3. Limit the fallout after an error condition.
    Only the operation that failed should be affected, the rest should continue to run.
  4. Reduce the development costs of:
    • adding error checks to the source code.
    • repurposing existing code.

Non-goals are:

  1. Improve error tolerance.
    Normally, when an error occurs, the operation that caused it is considered to have failed. This document does not deal with error tolerance at all.
  2. Optimise error-handling performance.
    In normal scenarios, only the successful (non-error) paths need to be fast. This may not hold true on critical, real-time systems, where the error response time needs to meet certain constrains.
  3. Optimise memory consumption.
    Good error messages and proper error handling comes at a cost, but the investment almost always pays off.

Compromises

Writing good error-handling logic can be costly, and sometimes compromises must be made:

Unpleasant Error Messages

In order to keep development costs under control, the techniques described below may tend to generate error messages that are too long or unpleasant to read. However, such drawbacks easily outweight the disadvantages of delivering too little error information. After all, errors should be the exception rather than the rule, so users should not need to read too many error messages during normal operation.

Abrupt Termination

It may often be desirable to let an application panic on a severe error than to try and cope with the error condition or ignore it altogether.

Some errors are just too expensive or virtually impossible to handle. An example could be a failed close( file_descriptor ); syscall, which should never fail, and when it does, there is not much the error handler can do about it. These errors are symptomatic of a serious logic error, but usually this kind of error is easy to fix.

Other error conditions may indicate that some memory is corrupt or that some data structure has invalid information that hasn't been detected soon enough. If the application carries on, its behaviour may well be undefined (it may act randomly), which may be even more undesirable than an instant crash.

Leaving a memory, handle or resource leak behind is not an option either, because the application will crash later on for a seemingly random reason. The user will probably not be able to provide an accurate error report, and the error will not be easy to reproduce either. The real cause will be very hard to discover and the user will quickly loose confidence in the general application stability.

Raising a normal error from inside clean-up code may give the caller the wrong impression about what really happened. Let's go back to the failing close( file_descriptor ); syscall, as it is a relatively common scenario. In most cases, a file descriptor is closed after the work has been done. If the descriptor fails to close, the code probably attempted to close the wrong one, leaving a handle leak behind. Or the descriptor was already closed, in which case it's probably a very simple logic error to fix. If you decide to raise the error nevertheless, instead of aborting the whole application with a sudden panic, you have the following issues:

  1. You risk a file handle leak, which may be worse than the panic, see above.
  2. You report an error for an operation that has actually successfully completed.
    Unless there is some sort of transaction support, and the caller can undo the work (for example, in the context of an SQL database transatation), the user and/or the software layers above will get the wrong impression that the operation failed when in fact it succeeded. This can have dangerous consequences, especially when money transfers are involved. Therefore, it could be best to crash straight away, as the operator will not be so sure wether the transaction did complete and will probably want to check manually after the system restarts.

Sometimes in practical life, you just have to give in. Let's consider the case where some clean-up code fails and the caller can perform a transaction roll-back. What happens if the roll-back itself fails then? If you raise a normal error again, you're back to the scenario where the operation succeeded, but was reported as a failure. You could indicate at that point that the software is unsure whether the operation completed or not. However, such an error message is only helpful to humans, other automated callers may assume that it failed in the presence of an error indication. You could also add some flag to the error information indicating whether the caller should carry on or assume a corruption that can only be manually fixed. But, at some point during development of clean-up and error-handling code, you'll have to draw the line and treat normal errors as irrecoverable panics. Otherwise, the code will get too complicated to maintain economically.

Abrupt termination is always unpleasant, but a controlled crash at least lets the user know what went wrong. Although it may sound counterintuitive, such an immediate crash will probably help improve the software quality in the long run, as there will be an incentive to fix the error quickly together with a helpful panic report.

After all, if you are worried about adding panic points to your source code, keep in mind that you will not be able to completely rule out abrupt termination anyway. Just touching a NULL pointer, calling some OS syscall with the wrong memory address or using too much stack space at the wrong place may terminate your application at once.

Besides, a complete crash will trigger any emergency mechanism installed, like automatically restarting the failed service/daemon and timely alerting your system administrators. Such a recovery course may be better than any unpredictable behaviour down the line due to a previous error handled incorrectly.

Do Not Install Your Own Critical Error Handler

Some people are tempted to write clever unexpected error handlers to help deal with panics, or even avoid them completely. However, it is usually better to focus on the emergency recovery procedures after the crash rather than installing your own crash handler in an attempt at capturing more error information or surviving the unknown error condition.

Your Operating System will probably do a better job at collecting crash information, you may just need to enable its crash reporting features. You don't want to interfere with that process, because, if your memory contents or window handlers are already corrupt or invalid, trying to run your crash handler may make matters worse and corrupt or even mask the original crash altogether.

Getting an in-application crash handler right is hard if not downright impossible, and I've seen quite a few of them failing themselves after the first application failure they were supposed to report. If you have time to spare on fatal error scenarios, try to minimise their consequences by saving the user data at regular intervals before the crash (like some text editors or word processors do) or by configuring the system to automatically restart any important service when it crashes. You could also direct your remaining efforts at improving your software quality process instead.

How to Generate Helpful Error Messages

Let's say you press the 'print' button on your accounting application and the printing fails. Here are some example error messages, ordered by message quality:

  1. Kernel panic / blue screen / access violation.
  2. Nothing gets printed, and there is no error message.
  3. There was an error.
  4. Error 0x03A5.
  5. Error 0x03A5: Write access denied.
  6. Error opening file: Error 0x03A5: Write access denied.
  7. Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  8. Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  9. I cannot start printing the letters because write access to file "invoice arrears.txt" was denied.
  10. Before trying to print those letters, please remove the write-protection tab from the SD Card.
    In order to do that, remove the little memory card you just inserted and flip over the tiny white plastic switch on its left side.
  11. You don't need to print those letters. Those customers are not going to pay. Get over it.

Let's evaluate each of the error messages above:

  1. Worst-case scenario.
  2. Awful. Have you ever waited to no avail for a page to come out of a printer?
    When printing, there usually is no success indication either, so the user will wonder and probably try again after a few seconds. If the operation did not actually fail, but the printer just happens to be a little slow, he will end up with 2 or more printed copies. It happens to me all the time, and we live in 2013 now.
    If the printing did fail, where should the user find the error cause? He could try and find the printer's spooler queue application. Or he could try with 'strace'. Or look in the system log file. Or maybe the CUPS printing service maintains a separate log file somewhere?
  3. Negligent development.
  4. Unprofessional development.
  5. You show some hope as a programmer.
  6. You are getting the idea.
  7. You are implementing the idea properly.
  8. This is the most that you can achieve in practice.
    The error message has been generated by a computer, and it shows: it is too long, clunky and sounds artificial. But the error message is still helpful, and it contains enough information for the user to try to understand what went wrong, and for the developer to quickly pin-point the issue. It's a workable compromise.
  9. Unrealistic. This text implies that the error message generation was deferred to a point where both knowledge was available about the high-level operation that was being carried out (printing letters) and about the particular low-level operation that failed (opening a file). This kind of error-handling logic would be too hard to implement in real life.
    Alternatively, the software could check the most common error scenarios upfront, before attempting to print the letters. However, that strategy does not scale, and it's not worth implementing if the standard error-handling is properly written. Consider checking beforehand if there is any paper left in the printer. If the user happens to have a printer where the paper level reporting does not work properly, the upfront check would not let him print, even if it would actually work. Implementing an "ignore this advance warning" would fix it, but you don't want the user to dismiss that warning every time. Should you also implement a "don't show this warning again today" button for each possible advance warning?
  10. In your dreams. But there is an aspect of this message that the Operating System could have provided in the messages above: instead of saying "write access denied", it could have said "write access denied because the storage medium is write protected". Or, better still, "cannot modify the file because the memory card is physically write protected". That is doable, because it's a common error and the OS could internally find out the reason for the write protection and provide a textual description of the write-protected media type. But Linux could never build such error messages with its errno-style error reporting.
    Providing a hint about fixing the problem is not so unrealistic as it might appear at first. After all, Apple's NSError class in the Cocoa framework has fields like localizedRecoverySuggestion, localizedRecoveryOptions and even NSErrorRecoveryAttempting. I do think that such fine implementation is overkill and hard to implement in practice across operating system and libraries, but proving a helpful recovery hint in the error message could achievable.
  11. Your computer has become self-aware. You may stop worrying now about error handling in your source code.

Therefore, the best achievable error message in practice, assuming that the Operating System developers have read this guide too, would be:

Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Cannot open the file with write access. Try switching the write protection tab over on the memory card.

Note that I left the error code out, as it does not really help. More on that further below.

The end-user will read such long error messages left-to-right, and may only understand it up to a point, but that could be enough to make the problem out and maybe to work around it. If there is a useful hint at the end, hopefully the user will also read it. Should the user decide to send the error message to the software developer, there will be enough detail towords the right to help locate the exact issue down some obscure library call.

Such an error message gets built from right to left. When the 'open' syscall fails, the OS delivers the error code (0x03A5) and the low-level error description "Cannot open the file with write access". The OS may add the suffix "Try switching the write protection tab over on the memory card" or an alternative like "The file system was mounted in read-only mode" after checking whether the card actually has such a switch that is currently switched on. A single string is built out of these components and gets returned to the level above in the call stack. Instead a normal 'return' statement, you would raise a C++ exception with 'throw' (or fill in some error information object passed from above). At every relevant stage in the way up while unwinding the call stack (at every 'catch' point), the error string becomes a new prefix (like "Error opening file "invoice arrears.txt": "), and the exception gets passed further up (gets 'rethrown'). At the top level (the last 'catch'), the final error message is presented to the user.

The source code will contain a large number of 'throw' statements but only a few 'catch/rethrow' points. There will be very few final 'catch' levels, except for GUI applications, where each button-press event handler will need one. However, all such GUI 'catch' points will look the same: they will probably call some helper routine in order to display a standard modal error message box.

How to Write Error Handlers

Say you have a large program written in C++ with many nested function calls, like this example:

int main ( int argc, char * argv[] )
{
   ...
   b();
   ...
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   // Error check example: we only accept filenames that are at least 10 characters long.

   if ( strlen( filename ) < 10 )
   {
     // What now? Ideally, we should report that the filename should be at least 10 characters long.
   }

   // Yes, you should check the return value of printf(), see further below for more information.

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     // What now?
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     // What now?
   }
   ...
}

Let's try to deal with the errors in routine e() above. It's a real pain, as it distracts us from the real work we need to do. But it has to be done.

Here is a very common approach where all routines return an integer error code, like most Linux system calls do. Note that zero means no error.

int main ( int argc, char * argv[] )
{
   ...
   int error_code = b();
   if ( error_code != 0 )
   {
     fprintf( stderr, "Error %d calling b().", error_code );
     return 1;  // This is equivalent to exit(1);
                // We could also return error_code directly, but you need to check
                // what the exit code limit is on your operating system.
   }
   ...
}

void b ( void )
{
   ...
   int err_code_1 = c("file1.txt")
   if ( err_code_1 != 0 )
   {
     return err_code_1;
   }

   int err_code_2 = c("file2.txt")
   if ( err_code_2 != 0 )
   {
     return err_code_2;
   }
   ...
}

int c ( const char * filename )
{
   ...
   int err_code = d( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

int d ( const char * filename )
{
   ...
   int err_code = e( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     return some non-zero value, but which one?
            Shall we create our own list of error codes?
            Or should we just pick a random one from errno.h, like EINVAL?
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     return some non-zero value, but which one? Note that printf() sets errno.
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     fprintf( stderr, "Error opening file %s: %s", filename, e.what() );
     return some non-zero value, but which one? Note that fopen() sets errno.
   }
   ...
}

As shown in the example above, the code has become less readable. All function calls are now inside if() statements, and you have to manually check the return values for possible errors. Maintaining the code has become cumbersome.

There is just one place in routine main() where the final error message gets printed, which means that only the original error code makes its way to the top and any other context information gets lost, so it's hard to know what went wrong during which operation. We could call printf() at each point where an error is detected, like we do after the fopen() call, but then we would be calling printf() all over the place. Besides, we may want to return the error message to a caller over the network or display it to the user in a dialog box, so printing errors to the standard output may not be the right thing to do.

The same code uses C++ exceptions and looks much more readable:

int main ( int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     // We can decide here whether we want to print the error message to the console, write it to a log file,
     // display it in a dialog box, send it back over the network, or all of those options at the same time.
     fprintf( stderr, "Error calling b(): %s", e.what() );
     return 1;
   }
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     throw std::runtime_error( "The filename should be at least 10 characters long." );
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     throw std::runtime_error( collect_errno_msg( "Error opening file %s: ", filename ) );
   }
   ...
}

If the strlen() check above fails, the throw() invocation stops execution of routine e() and returns all the way up to the 'catch' statement in routine main() without executing any more code in any of the intermediate callers b(), c(), etc.

We still have a number of error-checking if() statements in routine e(), but we could write thin wrappers for library or system calls like printf() and fopen() in order to remove most of those if()'s. A wrapper like fopen_e() would just call fopen() and throw an exception in case of error, so the caller does not need to check with if() any more.

Improving the Error Message with try/catch Statements

Let's improve routine e() so that all error messages generated by that routine automatically mention the filename. That should also be the case for any errors generated by any routines called from e(), even though those routines may not get the filename passed as a parameter. The improved code looks like this:

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       throw std::runtime_error( "The filename should be at least 10 characters long." );
     }

     if ( printf( "About to open file %s", filename ) < 0 )
     {
       throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
     }

     FILE * f = fopen( filename, ... );
     if ( f == NULL )
     {
       throw std::runtime_error( collect_errno_msg( "Error opening the file." ) );
     }
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

In the example above, helper routines format_msg() and collect_errno_msg() have not been introduced yet, see below for more information.

Note that all exception types are converted to an std::exception object, so only the error message is preserved. There are other options that will be discussed in another section further ahead.

You may not need a catch(...) statement if your application uses exclusively exception classes ultimately derived from std::exception. However, if you always add one, the code will generate better error messages if an unexpected exception type does come up. Note that, in this case, we cannot recover the original exception type or error message (if there was a message at all), but the resulting error message should get the developer headed in the right direction. You should provide at least add one catch(...) statement at the application top-level, in the main() function. Otherwise, the application might end up in the unhandled exception handler, which may not be able to deliver a clue to the right person at the right time.

We could improve routine b() in the same way too:

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

You need to find a good compromise when placing such catch/rethrow blocks in the source code. Write too many, and the error messages will become bloated. Write too little of them, and the error messages may miss some important clue that would help troubleshoot the problem. For example, the error message prefix we just added to routine b() may help the user realise that the affected file is part of his personal address book. If the user has just added a new address book entry, he will probably guess that the new entry is invalid or has rendered the address book corrupt. In this situation, that little error message prefix provides the vital clue that removing the new entry or reverting to the last address book backup may work around the problem.

If you look a the original code, you'll realise that routine c() is actually the first one to get the filename as a parameter, so routine c() may be the optimal place for the try/catch block we added to routine e() above. Whether the best place is c() or e(), or both, depends on who may call these routines. If you move the try/catch block from e() to c() and someone calls e() directly from outside, he will need to provide the same kind of try/catch block himself. You need to be careful with your call-tree analysis, or you may end up mentioning the filename twice in the resulting error message, but that's still better than not mentioning it at all.

Using try/catch Statements to Clean Up

Sometimes, you need to add try/catch blocks in order to clean up after an error. Consider this modified c() routine from the example above:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();
  ...
  d( filename );
  ...
  delete my_instance;
}

If d() were to throw an exception, we would get a memory leak. This is one way to fix it:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();

  try
  {
    ...
    d( filename );
    ...
  }
  catch ( ... )
  {
    delete my_instance;
    throw;
  }

  delete my_instance;
}

Unfortunately, C++ lacks the 'finally' clause, which I consider to be a glaring oversight. May other languages, such as Java or Object Pascal, do have 'finally' clauses. Without it, we need to write "delete my_instance;" twice in the example above. See further below for an alternative approach with smart pointers and other wrapper classes.

The Final Version

This is what the example code above looks like with smart pointers, wrapper functions and a little extra polish:

int main ( const int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     return top_level_error( e.what() );
   }
   catch ( ... )
   {
     return top_level_error( "Unexpected C++ exception." );
   }
}

int top_level_error ( const char * const msg )
{
  if ( fprintf( stderr, "Error calling b(): %s", msg ) < 0 )
  {
    // It's hard to decide what to do here. At least let the developer know.
    assert( false );
  }

  return 1;
}

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

void c ( const char * filename )
{
  std::auto_ptr< my_class > my_instance( new my_class() );
  ...
  d( filename );
  ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       throw std::runtime_error( "The filename should be at least 10 characters long." );
     }

     printf_to_log_e( "About to open file %s", filename );

     auto_close_file f( fopen_e( filename, ... ) );

     const size_t read_count = fread_e( some_buffer, some_byte_count, 1, f.get_FILE() );

     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

Why You Should Use Exceptions

The exception mechanism is the best way to write general error-handling logic. After all, it was designed specifically for that purpose. Even though the C++ language shows some weaknesses (lack of finally clause, need of several helper routines), the exception-enabled code example above shows a clear improvement. However, there are surprisingly many oponents, especially in the context of the C++ programming language. While I don't share most of the critique, there are still issues with some compilers and some C++ runtime libraries, even as late as year 2013.

Modern applications and software frameworks tend to rely on C++ exceptions for error handling, and it is impractical to ignore C++ exceptions nowadays. The C++ Standard Template Library (STL), Microsoft's ATL and MFC are prominent examples. Just by using them you need to cater for any exceptions they might throw.

In the case of the STL, the designers seem to have put a lot of effort into avoiding making C++ exceptions a hard requirement in practice. The most common STL implementations still work even if C++ exception support is turned off when compiling. As long as you use the STL containers carefully and do not trigger any errors, everthing will work fine. You don't miss exceptions very much because the STL has always favoured speed over robustness and does little error checking except for some debug-only assertions. Using an invalid iterator, for example, leads quickly to fatal crashes. However, the official documentation still states that the library reports errors by throwing exceptions. In fact, an out-of-memory condition when inserting elements into an STL container can only be reported by throwing a C++ exception, as the library has no other way to return such an error indication. Without C++ exception support, a release build will fail or even crash without warning if that happens. However, out-of-memory conditions are often ignored altogether during development, so this particular shortcoming does not show up in practice either.

Exceptions are prevalent outside the C++ world: Java, Javascript, C#, Objective-C, Perl and Emacs Lisp, for example, use exceptions for error-handling purposes. And the list goes on.

Even plain C has a similar setjmp/longjmp mechanism. The need to quickly unwind the call stack on an error condition is a very old idea indeed.

Downsides of Using C++ exceptions

Exceptions make the code bigger and/or slower

This should not be the case, and even if it is, it is almost always an issue with the current version of the compiler or its C++ runtime library. For example, in my experience, GCC generates smaller exception-handling code for the ARM platform than for the PowerPC.

But first of all, even if the code size does increase or if the software becomes slower, it may not matter much. Better error-handling support may be much more important.

In theory, logic that uses C++ exceptions should generate smaller code than the traditional if/else approach, because the exception-handling support is normally implemented with stack unwind tables that can be efficiently shared (commoned up) at link time.

Because source code that uses exceptions does not need to check for errors at each call (with the associated branch/jump instruction), the resulting machine code should run faster in the normal (non-error) scenario and slower if an exception occurs (as the stack unwinder is generic, table-driven routine). This is actually an advantage, as speed is not normally important when handling error conditions.

However, code size or speed may still be an issue in severely-constrained embedded environments. Enabling C++ exceptions has an initial impact in the code size, as the stack unwinding support needs to be linked in. Compilers may also conspire against you. Let's say you are writing a bare-metal embedded application for a small microcontroller that does not use dynamic memory at all (everything is static). With GCC, turning on C++ exceptions means pulling in the malloc() library, as its C++ runtime library creates exception objects on the heap. Such an strategy may be faster on average, but is not always welcome. The C++ specification allows for exception objects to be placed on the stack and to be copied around when necessary during stack unwinding. Another implementation could also use a separate, fixed-size memory area for that purpose. However, GCC offers no alternative implementation.

GCC's development is particularly sluggish in the embedded area. After years of neglect, version 4.8.0 finally gained configuration switch --disable-libstdcxx-verbose, which avoids linking in big chunks of the standard C I/O library just because you enabled C++ exception support. If you are not compiling a standard Linux application, chances are that the C++ exceptions tables are generated in the "old fashioned" way, which means that the stack unwind tables will have to be sorted on first touch. The first throw() statement will incur a runtime penalty, and, depending on your target embedded OS, this table sorting may not be thread safe, so you may have sort the tables on start-up, increasing the boot time.

Debug builds may get bigger when turning C++ exceptions on. The compiler normally assumes that any routine can throw an exception, so it may generate more exception-handling code than necessary. Ways to avoid this are:

  1. Append "throw()" to the function declarations in the header files.
    This indicates that the function will never throw an exception. Use it sparingly, or you may find it difficult to add an error check in one of those routines at a later point in time.
  2. Turn on global optimisation (LTO).
    The compiler will then be able to determine whether a function called from another module could ever throw an exception, and optimise the callers accordingly.
    Unfortunately, using GCC's LTO is not yet a viable option on many architectures. You may be tempted to discard LTO altogether because of the lack of debug information on LTO-optimised executables (as of GCC version 4.8).

Exceptions are unsafe because the tend to break existing code more easily

The usual argument is that, if you make a change somewhere deep down the code, an exception might come up at an unexpected place higher up the call stack and break the existing software. However, I believe that developers are better off embracing the idea of devensive programming from the start. With or without exceptions, errors do tend to come up at unexpected places in the end.

It is true that, if the old code handles errors with manual if() statements, adding a new error condition normally means adding extra if() sentences that make new code paths more obvious. However, when a routine gains an error return code, existing callers are often not amended to check it. Furthermore, it is unlikely that developers will review the higher software layers, or even test the new error scenario, so as to make sure that the application can handle the new error condition correctly.

More importantly, in such old code there is a strong urge to handle errors only whenever necessary, that is, only where error checks occur. As a result, if a piece of code was not expecting any errors from all the routines it calls, and one of those routines can now report an error, the calling code will not be ready to handle it. Therefore, the developer adding an error condition deep down below may need to add a whole bunch of if() statements in many layers above in order to handle that new error condition. You need to be careful when adding such if() statements around: if any new error check could trigger an early return, you need to know what resources need to be cleaned up beforehand. That means inspecting a lot of older code that other developers have written. Anything that breaks further up is now your reponsibility, for the older code was "working correctly" in the past. This amounts to a great social deterrant from adding new error checks.

Let's illustrate the problem with an example. Say you have this kind of code, which does not use exceptions at all:

void a ( void )
{
  my_class * my_instance = new my_class();
  ...
  b();
  ...
  delete my_instance;
}

If b() does not return any error indication, there is no need to protect my_instance with a smart pointer. If b()'s implementation changes and it now needs to return an error indication, you should amend routine a() to deal with it as follows:

bool a ( void )
{
  my_class * my_instance = new my_class();
  ...
  if ( ! b() )
  {
     delete my_instance;
     return false;
  }
  ...
  delete my_instance;
  return true;
}

That means you have to read and understand a() in order to add the "return false;" statement in the middle. You need to check if it safe to destroy the object at that point in time. Maybe you should change the implementation to use a smart pointer now, which may affect other parts of the code. Note that a() has gained a return value, so all callers need to be amended similarly. In short, you have to analyse and modify existing code all the way upwards in order to support the new error path.

If the original code had been written in a defensive manner, with techniques like Resource Acquisition Is Initialization, and had used C++ exceptions from the scratch, chances are it would already have been ready to handle any new error conditions that could come up in the middle of execution. If not, any unsafe code (any resource not managed by a smart pointer, and so on) is a bug which can be more easily assigned to the original developer. Unsafe code may also be fixed during code reviews before the new error condition comes. Such code makes it easier to add error checks, because a developer does not need to check and modify so much code in the layers above, and is less exposed to blame if something breaks somewhere higher up as a result.

Therefore, for the reasons above, I am not convinced that relying on old-style if() statements for error-handling purposes helps writing better code in the long run.

Never Ignore Error Indications

I once had Win32 GetCursorPos() failing on a remote Windows 2000 without a mouse when remotely controlled with a kind of VNC software. Because the code ignored the error indication, a random mouse position was passed along and it made the application randomly fail later on. As soon as I sat in front of the remote PC, I connected the mouse and then I could not reproduce the random failures any more. The VNC software emulated a mouse on the remote PC (a mouse cursor was visible), so the cause wasn't immediately obvious. And Windows 2000 without that VNC software also provided a mouse pointer position even if you didn't have one. It was the combination of factors that triggered the issue. It was probably a bug in Windows 2000 or in the VNC software, but still, even a humble assert around GetCursorPos(), which is by no means a proper error check, would have saved me much grief.

The upshot is, everything can fail at any point in time. Always check. If you really, really don't have the time, add at least a "to do" comment to the relevant place in the source code, so that the next time around you or some other colleague eventually adds the missing error condition check.

Checking Errors from printf()

Nobody checks the return value from printf() calls, but you should, for the following reasons:

  1. A failing printf() may provide an early warning for some other severe error condition, like the disk being full.
    If the program ignores the error and happily keeps chugging along, it may be hard to tell what went wrong. Or it may fail later on at an inconvenient time, when you actually need the software to do something useful.
  2. Writing to a log file with printf() may be part of the software's feature set, and it may be unacceptable to carry on performing actions without generating the corresponding log entries. Besides, if something else fails later on and you need to troubleshoot the problem, you'll have no log file to look at.
  3. A process' stdout may be piped to another process or even redirected over the network. These days you can easily access a remote serial port on another PC, so things that should always work, like serial port writes, could suddenly fail because of network issues. If such errors are not handled properly, the program may quit without a proper message or just crash with a generic SIGPIPE signal. Worse still, if errors are ignored, other programs down the pipe may attempt to process faulty or non-existent output from the previous process. The whole pipe may hang without an error indication.

The points above apply of course to all I/O operations on stdin, stdout and stderr, and to many languages other than C++.

Although ignoring error codes from function calls is obviously bad practice, it is so widespread that the GCC developers have added an extra warning designed to catch such sloppy code, check out function attribute warn_unused_result and compiler switch -Wunused-result for more information.

Checking the error code from printf() and the like is tedious, so you are better off writing wrapper functions. Here is a routine I have been using in Perl. It does need some improvement, but it's better than nothing. The main problem is remembering to use it instead of the prevalent print built-in routine.

sub write_stdout ( $ )
{
  my $str = shift;

  ( print STDOUT $str ) or
     die "Error writing to standard output: $!\n";
}

About Error Codes

Error Codes Are a Waste of Time

Error codes do not really help, only the error message does. To the average user, error codes are just noise on the line. Developers will also need to translate such codes into some source code constant or human-readable error string, so it's a waste of time for them too.

The only scenario where an error code could be helpful is if the developer does not understand the end-user's language at all. For example, an English developer may get an error message in Chinese. However, nowadays most error messages are communicated electronically, so the developer should have no trouble searching for the translated text in the source code in order to locate the corresponding error strings in English.

That is the reason why your error dialog boxes should always include a prominent button to copy the error message to the clipboard in plain-text form. If the user takes a screen snapshot as a bitmap, you may have trouble typing text in a foreign alphabet into your search box.

Beware that error codes are firmly entrenched in computer error messages. They have become part of the general computing culture. Therefore, you may find it difficult to leave them out of your error messages.

Providing Error Codes Only

There are many Operating Systems, libraries, APIs, etc. that rely on error codes exclusively in order to report error conditions. This is always a mistake, as error codes are not immediately useful without the user manual and can only provide information about a particular, low-level condition, but not about any other circumstantial information that could also help identify the problem.

Error codes are often re-used, as they would otherwise grow with each new software version to an unmanageably long list. With each reuse, error codes gain ambiguity and thus lose expressiveness. Furthermore, error codes often force you to discard error information coming from lower software layers whose error codes cannot be effectively mapped to your own codes. That may not be the case now, but it may happen to your code library in the future. Say for example that your library starts storing its data in an SQL database, but this backend change should be transparent to the library user. Any SQL error messages will get discarded if your library's API can only report errors with a set of predefined error codes. In the end, an error code alone does not normally provide enough information to be really useful, so the user needs to resort to other sources like the application's console output or its log file.

Even if you are writing firmware for a severely-constrained embedded environment, you should strive to provide a way to report proper error messages in your code and communication protocols. Until your embedded hardware platform grows, you may have to limit your error messages to a set of predefined mnemonics. You may even have to re-use some of them. However, there is a good chance that your software will be re-used on a bigger platform, and then you can start reporting errors properly without having to re-engineer your code and protocols.

Look at Linux routine open(), for example. The Linux Programmer's Manual documents 21 error codes as of september 2013. Wouldn't life be great if there were only 21 reasons why open() could fail? The entire errno enumeration consists of 34 entries (ENOENT, EINVAL, etc).

Microsoft has done a better job, the "Win32 Error Codes" page documents several thousands. Error codes are 32-bit values where you'll find some flags, a facility code and an error code. For further information, look at the documentation for GetLastError() and for HRESULT. It's a good try, but in the end, it falls short again. As the Operating System grows, and as you start dynamically loading libraries from all sort of vendors, there is no way you can accomodate all the different error codes and make any sense out of them.

Error Codes Should Not Be Used for Logic Decisions

It is hard enough to design and maintain a stable Application Programming Interface, so your error codes should not be a part of it. Error codes will constantly change, as new error codes are added in order to help pin-point problems more accurately. Therefore, it's best not to document all of your error codes formally, just mention a few of them as examples of what the caller may get back from a failed operation.

The logic on the caller's side should not make decisions based on the returned error codes. All the caller needs know is whether there was an error or not. Only the human user will eventually benefit from knowing exactly what went wrong. Therefore, your library users should not actually need a header file that exposes a constant for each error code. In other words, do not publish your library's version of errno.h for the end user.

TODO: Still to write further.

Managing Error Codes in Source Code

Whenever a developer writes code for a new error condition, there is a strong urge to re-use some existing error code, instead of adding a new one. This is why generic error codes like EINVAL constantly get abused until they are all but meaningless. If you insist on using error codes in order to report error conditions, you should at least state a policy that discourages error code reuse.

TODO: Still to write.

Providing Rich Error Information

After its first attempt with HRESULT codes (see further above), Microsoft must have realised that error codes weren't cutting it, and provided the IErrorInfo COM interface and the _com_error wrapper class for error-reporting purposes. In addition to a flexible error message, IErrorInfo allows you to set a help filename and a help context ID, a feature which is rarely used. Even if you take care that a help file is always correctly installed next to your code library, the file path will become invalid as soon as you cross process, user or network boundaries.

Look also at C++, .NET, Java, Perl, etc.: they all offer some sort of way of reporting error messages as variable-length character strings. Some of them offer fancy additions too: Apple's Cocoa has fields named localizedRecoverySuggestion, NSHelpAnchorErrorKey, NSURLErrorFailingURLErrorKey and even localizedRecoveryOptions and NSErrorRecoveryAttempting in its NSError class. I wonder how much those fields are actually used in real life. Should you decide to move your code to a shared library, most such options become unusable. After all, in the end the caller may not even be a human that can read a help file or attempt an automatic recovery.

C++ and Java offer subclasses derived from a common error base class in order to differenciate between types of errors, which does not really help in most circumstances.

In .NET you can provide nested exception objects, so you can build a chain of exception objects. As a result, error dialogs often allow you to expand the next level of error information by clicking on the next '+' icon, as if you were expanding subdirectories in a file tree view. How cumbersome is that for a normal user. A couple of standard buttons like "copy error message" and "e-mail error message" would have been more helpful.

For some software environments, it's too late. Consider process exit codes under Unix: their structure (or lack of it) is cast in stone, the exit codes are always custom values. But there's some minimum standard here too: if the exit code is non-zero, you can probably retrieve some sort of error description if you capture the process' stderr output as plain text.

Imagine now that you need to write software that interoperates with other languages in other computers over the network. Your software may get ported to another platform, and parts of it may get moved to a shared library. In the end, the only practical way to pass error information back and forth is to reduce it to a single, possibly lengthy error message string. Therefore, you should focus on generating good error strings in the first place. As soon as you get an error code from some system or library call, turn it into a string and forget the original error code. And don't worry too much about all the fancy additions listed above.

Internationalisation

Selecting the User Language

Error messages should be in a language the user is comfortable with, be it English, German, Spanish or whatever he happens to speak. Sometimes it would be helpful to get an error message in several languages at the same time, but this normally isn't worth the development effort.

Your code should not guess the user's language based on the current operating system and the like at the point where an error happens. Instead, the language should be determined upfront. If you are writing a library, you could pass some language ID to the library's init() routine. Keep in mind that your library may end up in some multi-user server, so it's best to negotiate the language early as part of the connection handshake.

Providing Translations

TODO: Still to write. TODO: Sending a format string for internationalisation purposes is normally the wrong thing to do.

Using Positional Parameters

TODO: Still to write.

Helper Routines

Writing robust code in the face of errors is a chore that can be alleviated with the right set of helper routines. In my experience, you don't need too many of them, most of them are trivial wrapper, and, once written, they tend to remain stable. What matters most is the convention and associated development policy to use them whenever appropriate.

Throw Utility

Instead of throwing an error object directly whenever an error is detected, you may want to use a LogAndThrow( errObj ) routine instead. This way, you could easily turn some compilation option on in order to dump the call stack at the point of error to a log file, in case a developer needs such detailed information to track down a nasty bug. A more shallow logging could still be useful for statistical purposes.

Global Panic Routines

See the section about abrupt termination above for more information.

Wrappers for printf() and the like

  1. void printf_e ( const char * format_str, ... );

See the section about printf() above for more information.

String-handling helpers to build error messages

I recommend using the standard std::string class. Routines like this would be helpful:

  1. std::string format_msg ( const char * format_str, ... );

TODO: Still to write.

Utility functions that collect OS error messages

TODO: errno, GetLastError(), ...

Wrappers for the most common OS syscalls

TODO: open(), close(), fopen_e(). In the case of close(), you will probably want a close_p() for "close with panic" semantics.

Smart pointers for memory, file descriptors and so on

Classes like auto_ptr_free for memory pointers obtained through malloc(), or auto_file_close for Linux file descriptors.

TODO: Still to write.

Assertions Are No Substitute for Proper Error Handling

Assertions are designed to help developers quickly find bugs, but are not actually a form of error handling, as assertions are normally left out of release builds. Even if there were not, they tend to generate error messages that a normal user would not understand.

Therefore, you need to resist the temptation of asserting on syscall return codes and the like.

However, an assertion is still better than no error check at all. If you have no time to write proper error checking code, at least add a "to do" comment to the source code. That will remind other developers (or even yourself in the distant future) that this aspect needs to be considered. For example, under Win32:

if ( FALSE == GetCursorPos( &cursorPos ) )
  assert( false );  // FIXME: Write proper error handling here.

I have grown fond of Microsoft's VERIFY macro for this kind of check.

VERIFY( FALSE != GetCursorPos( &cursorPos ) );  // FIXME: Write proper error handling here.

The rest of the article has not been written yet