Benutzer:Rdiez/ErrorHandling

Aus /dev/tal
< Benutzer:Rdiez
Version vom 15. August 2014, 19:54 Uhr von Rdiez (Diskussion | Beiträge) (Link auf lwn.net/Articles/576478/)

Wechseln zu: Navigation, Suche
Warning sign
Dies sind die persönlichen Benutzerseiten von rdiez, bitte nicht verändern! Ausnahmen sind nur einfache Sprachkorrekturen wie Tippfehler, falsche Präpositionen oder Ähnliches. Alles andere bitte nur dem Benutzer melden!


Inhaltsverzeichnis

Error Handling in General and C++ Exceptions in Particular

Introduction

Motivation

The software industry does not seem to take software quality seriously, and a good part of it falls into the error-handling category. After putting up for years with so much misinformation, so many half-truths and with a general sentiment of apathy on the subject, I finally decided to write a lengthy article about error handling in general and C++ exceptions in particular.

I am not a professional technical writer and I cannot afford the time to start a long discussion on the subject, but I still welcome feedback, so feel free to drop me a line if you find any mistakes or would like to see some other aspect covered here.

Scope

This document focuses on the "normal" software development scenarios for user-facing applications or for non-critical embedded systems. There are of course other areas not covered here: there are systems where errors are measured, tolerated, compensated for or even incorporated into the decision process.

Audience

This document is meant for software developers who have already gathered a reasonable amount of programming experience. The main goal is to give practical information and describe effective techniques for day-to-day work.

Although you can probably guess how C++ exceptions work from the source code examples below, it is expected that you already know the basics, especially the concept of stack unwinding upon raising (throwing) an exception. Look into your favourite C++ book for a detailed description of exception semantics and their syntax peculiarities.

Causes of Neglect

Proper error-handling logic is what sets professional developers apart. Writing quality error handlers requires continuous discipline during development, because it is a tedious task that can easily cost more than the normal application logic for the sunny-day scenario that the developer is paid to write. Testing error paths manually with the debugger is recommended practice, but that doesn't make it any less time consuming. Repeatable test cases that feed the code with invalid data sequences in order to trigger and test each possible error scenario is a rare luxury. This is why error handling in general needs constant encouraging through systematic code reviews or through separate testing personnel. In my experience, lack of good error handling is also symptomatic that the code hasn't been properly developed and tested. A quick look at the error handlers in the source code can give you a pretty reliable measurement of the general code quality.

The fact that most example code snippets in software documentation do not bother checking for error conditions, let alone handling them gracefully, does not help either. This gives the impression that the official developers do not take error handling seriously, like everybody else, so you don't need to either. Sometimes you'll find the excuse of keeping those examples free from "clutter". However, when using a new syscall or library call, working out how to check for errors and how to collect the corresponding error information can take longer than coding its normal usage scenario, so this aspect would actually be most helpful in the usage example. There is some hope, however, as I have noticed that some API usage examples in Microsoft's documentation now include error checks with "handle the error here" comments below. While it is still not enough, it is better than nothing.

It is hard to assess how much value robust error handling brings to the end product, and therefore any extra development costs in this field are hard to justify. Free-software developers are often investing their own spare time and frequently take shortcuts in this area. Software contracts are usually drafted on positive terms describing what the software should do, and robustness in the face of errors gets then relegated to some implied general quality standards that are not properly described or quantified. Furthermore, when a customer tests a software product for acceptance, he is primarily worried about fulfilling the contractual obligations in the "normal", non-error case, and that tends to be hard enough. That the software is brittle either goes unnoticed or is not properly rated in the software bug list.

As a result, small software errors often cascade into great disasters, because all the error paths in between fail one after the next one across all the different software layers and communicating devices, as error handlers hardly ever got any attention. But even in this scenario, the common excuse sounds like "yes, but my part wouldn't have failed in the previous one hadn't in the first place".

In addition to all of the above, when the error-handling logic does fail, or when it does not yield helpful information for troubleshooting purposes, it tends to impact first and foremost the users' budget, and not the developer's, and that normally happens after the delivery and payment dates. Even if the error does come back to the original developer, it may find its way through a separate support department, which may even be able to provide a work-around and further justify the business case for that same support department. If nothing else helps, the developer's urgent help is then suddenly required for a real-world, important business problem, which may help make that original developer a well-regarded, irreplaceable person. After all, only the original person understands the code well enough to figure out what went wrong, and any newcomers will shy away from making any changes to a brittle codebase. This scenario can also hold true in open-source communities, where social credit from quickly fixing bugs may be more relevant than introducing those bugs in the first place. All these factors conspire to make poor error handling an attractive business strategy.

In the end, error handling gets mostly neglected, and that reflects in our day-to-day experience with computer software. I have seen plenty of jokes around about unhelpful or funny error messages. Many security issues have their roots in incorrect error detection or handling, and such issues are still getting patched on a weekly rhythm for operating system releases that have been considered stable for years.

Looking for a Balanced Strategy

Definition of Error

In the context of this document, an error is an indication that an operation failed to execute. The reason why it failed is normally included, in the form of an error code, error message, source code position, etc. An error is normally considered to be fatal, which means that retrying the failed operation straight away would only lead to the same error again.

When writing code, the general assumption is that everything should work fine all of the time, so errors are exceptions to the rule. In fact, most code a computer runs executes successfully.

When an error happens, the software should deal with it. Here are some possibilities:

  1. Retry the failed operation a few times before giving up.
  2. Tolerate the error. Maybe look for an alternative action.
  3. Automatically correct the error.
  4. Report the error and let the next level up decide what to do.
    Errors are often forwarded from software layer to software layer until they reach the human operator.

Non-errors

This document does not cover the subjects of error correction or error tolerance. However, it is worth noting that, if there is a way to tolerate or correct an error, or if there is an alternative action, chances are that the first failure was not completely unexpected. If a particular error condition is expected to occur often and has specific code to deal with it, it should probably not be regarded as a fatal error, but as a normal scenario that is handled in the standard applicaton logic. Normal scenarios should not raise C++ exceptions or use the error-handling support routines.

Consider an application that looks for its configuration files in several places. The bash shell, for example, reads and executes commands from /etc/bash.bashrc and ~/.bashrc on start-up, if these files exist. If the application tries to open the first configuration file, and it does not exist, it should not regard it as an error condition and raise a standard error exception. Such an exception would have to be caught in the standard error handler, which would have to implement a filter in order to ignore that particular error for that particular file. If you think about it, it is documented that those files may not exist, and that is not really an error.

Instead, the application should check beforehand if the first configuration file exists with the stat syscall. Alternatively it should check if the open syscall returns error code ENOENT. That may mean calling open directly, or an alternative wrapper function open_if_exists_e(), for the first configuration file, instead of using the usual open_e() wrapper (see further below about writing such helper routines). The open_e() wrapper would raise a standard error if the file does not exist, and should be used only in situations where a file is expected to exist, and if it does not, then it's a real, unexpected, fatal error condition.

Goals

Robust error handling is costly but it is an important aspect of software development. Choosing a good strategy from the beginning reduces costs in the long run. These are the main goals:

  1. Provide helpful error messages.
  2. Deliver the error messages timely and to the right person.
    The developer may want more information than the user.
  3. Limit the fallout after an error condition.
    Only the operation that failed should be affected, the rest should continue to run.
  4. Reduce the development costs of:
    • adding error checks to the source code.
    • repurposing existing code.

Non-goals are:

  1. Improve software fault tolerance.
    Normally, when an error occurs, the operation that caused it is considered to have failed. This document does not deal with error tolerance at all.
  2. Optimise error-handling performance.
    In normal scenarios, only the successful (non-error) paths need to be fast. This may not hold true on critical, real-time systems, where the error response time needs to meet certain constrains.
  3. Optimise memory consumption.
    Good error messages and proper error handling comes at a cost, but the investment almost always pays off.

Implicit Transaction Semantics

Whenever a routine reports an error, the underlying assumption is that the whole operation failed, and not just part of it, because recovering from a partial failure is very difficult. Transaction semantics simplify error handling considerably and usually expected. This means that, when a routine fails, it should automatically "clean up" before reporting the error. That is, the routine must roll back any steps it had performed towards its goal. The idea is that, once the error cause is eliminated, calling the routine again should always succeed.

Now consider a scenario where routine PrintFile() opens a given file and then fails to print its contents because the printer happens to be off-line. The caller expects that the opened filehandle has been automatically closed upon failure. Otherwise, when the printer comes back online, the next call to PrintFile() could fail, because the previous unclosed filehandle may hold an exclusive file lock. The automatic clean-up assumption is the only feasible way to write code, for the caller cannot close the file handle itself in the case of error, as it may not even know that a file handle was involved at all.

Another example would be a failed bank money transfer: if the source account has been charged, and the destination account fails to receive the money, you need to undo the charge on the source account before reporting that the money transfer as failed.

Errors When Handling a Previous Error

An error handler may have several tasks to perform, such as cleaning up resources, rolling any half-finished actions back, writing an entry to the application log or adding further information to the original error. If one of those operations fails, you will have the unpleasant situation of dealing with a secondary error. It is always hard to deal with errors inside error handlers for the following reasons:

  1. Error handlers can usually deal with a single error. The secondary error may get lost, or it may mask the original error.
  2. An error during the clean-up phase may yield a memory or resource leak.
  3. A failed roll-back phase may break the "complete success or complete failure" rule and leave a partially-completed operation behind.
  4. There is a never-ending recursion here: if you write code in the first-level error handler in order to deal with a secondary error, you may encounter yet another error there too. That would be a tertiary error then. Now, if you try to deal with a tertiary error in the second-level error handler, then...

The following rules help keep the cost of writing error handlers under control:

  1. Write code with an eventual roll-back in mind.
  2. Minimize room for failure in the error handlers.
  3. Assume that error handlers have no bugs. If you detect an error within an error handler, terminate the application abruptly. This means that testing error handlers becomes critical. Fortunately, bugs inside error handlers tend to be rare.

For example, say routine ModifyFiles() needs to modify 2 files named A and B. You could do this:

  1. Open file A.
  2. Modify file A.
  3. Close file A.
  4. Open file B.
  5. Modify file B.
  6. Close file B.

The trouble is, if an error happens opening file B, it's hard to roll back any changes in file A. Keep in mind that opening a file is the operation most likely to fail.

You would be less exposed to errors if you implemented the following sequence instead:

  1. Open file A.
  2. Open file B.
  3. Modify file A.
  4. Modify file B.
  5. Close file A.
  6. Close file B.

The clean-up logic in case of error has been reduced to 2 filehandle closing operations, which are unlikely to fail. If something as straightforward as that still fails, look at section "Abrupt Termination" below for reasons why such a drastic action may be the best option.

This approach is even better:

  1. Open file A.
  2. Open file B.
  3. Create file A2.
  4. Create file B2.
  5. Copy file A contents to file A2.
  6. Modify file A2.
  7. Copy file B contents to file B2.
  8. Modify file B2.
  9. Close file A.
  10. Close file A2.
  11. Close file B.
  12. Close file B2.
  13. Checkpoint, see below for more information.
  14. Delete file A.
  15. Delete file B.
  16. Rename file A2 to file A.
  17. Rename file B2 to file B.

If anything fails before the checkpoint, the error handler only has to close file descriptors, which is unlikely to fail. After the checkpoint, there is no easy way to recover from errors. However, if the files alreaday exist, deleting and renaming them is also likely to succeed.

The scenario above occurs so often that even mainstream Operating Systems are starting to implement transactional support in their filesystems. There is even transactional memory support in some platforms, although it is not normally designed for error-handling purposes. If your filesystem supports transactions, and assuming that modifying the files takes a long time and consumes many system resources, you could protect the whole operation with minimal effort like this:

  1. Do everything above up until the checkpoint.
  2. Begin filesystem transaction.
  3. Delete file A.
  4. Delete file B.
  5. Rename file A2 to file A.
  6. Rename file B2 to file B.
  7. End filesystem transaction.

If an error occurs within the transaction, it is unlikely that the transaction rollback fails. After all, it is a service provided by the Operating System specifically designed for that purpose. Should the rollback nevertheless fail, these are your options:

  1. Ignore the rollback error and report the original error.
    You may leave inconsistent data behind, see below.
  2. Report the secondary error.
    Note that raising a normal error from inside clean-up code may give the caller the wrong impression about what really happened, as the context information for the first error will not be present in the error message.
  3. Report both errors together. But merging the errors could also fail, generating a tertiary error. And so on.
  4. Terminate the application abruptly.

Reporting any of the errors and then carrying on is risky, as you may leave a half-finished operation behind. Think of a money transfer where the money is neither in the source nor in the destination account. You could try to add to the error message an indication that the data may be corrupt, but that hint is only useful to a human operator. Consider adding a to our ModifyFiles() routine a boolean flag to indicate that the transaction has only partially succeeded. What should an automated caller do in that case? Let's say that a human operator gets the message with that special hint, he will probably ignore it if in a hurry. Assume for a moment that the human operator wants to do something about it. He will probably not be able to fix the data inconsistency with normal means anyway. Shortly afterwards, further transactions could keep coming, maybe through other human operators, and those transactions could now build upon the inconsistent data. Fixing the mess afterwards may be really hard indeed.

In such a situation, abruptly terminating the application may actually be the best option. At the very least, the operator will wonder whether the transaction succeeded or not, and will probably check afterwards. But more often than not, such a catastrophic crash will prompt the intervention of a system administrator, or trigger some higher-level backup file recovery mechanism that can deal with the data consistency problem more effectively.

Compromises

Writing good error-handling logic can be costly, and sometimes compromises must be made:

Unpleasant Error Messages

In order to keep development costs under control, the techniques described below may tend to generate error messages that are too long or unpleasant to read. However, such drawbacks easily outweight the disadvantages of delivering too little error information. After all, errors should be the exception rather than the rule, so users should not need to read too many error messages during normal operation.

Abrupt Termination

Sometimes, it may be desirable to let an application panic on a severe error than to try and cope with the error condition or ignore it altogether.

Some error conditions may indicate that memory is corrupt or that a data structure has invalid information that hasn't been detected soon enough. If the application carries on, its behaviour may well be undefined (it may act randomly), which may be even more undesirable than an instant crash.

Leaving a memory, handle or resource leak behind is not an option either, because the application will crash or fail later on for a seemingly random reason. The user will probably not be able to provide an accurate error report, and the error will not be easy to reproduce either. The real cause will be very hard to discover and the user will quickly loose confidence in the general application stability.

Some errors are just too expensive or virtually impossible to handle, especially when they occur in clean-up sections. An example could be a failed close( file_descriptor ); syscall in a clean-up section, which should never fail, and when it does, there is not much the error handler can do about it. In most cases, a file descriptor is closed after the work has been done. If the descriptor fails to close, the code probably attempted to close the wrong one, leaving a handle leak behind. Or the descriptor was already closed, in which case it's probably a very simple logic error that will manifest itself early and is easy to fix. For more details about close()'s possible error codes, check out LWN article Returning EINTR from close().

See section "Errors When Handling a Previous Error" above for other error conditions that could break the transaction semantics (the 'fully succeeded' or 'fully failed' rule). Leaving corrupt or inconsistent data behind is probably worse than an instant crash too. At some point during development of clean-up and error-handling code, you'll have to draw the line and treat some errors as irrecoverable panics. Otherwise, the code will get too complicated to maintain economically.

Abrupt termination is always unpleasant, but a controlled crash at least lets the user know what went wrong. Although it may sound counterintuitive, such an immediate crash will probably help improve the software quality in the long run, as there will be an incentive to fix the error quickly together with a helpful panic report.

After all, if you are worried about adding "artificial" panic points to your source code, keep in mind that you will not be able to completely rule out abrupt termination anyway. Just touching a NULL pointer, freeing the same memory block twice, calling some OS syscall with the wrong memory address or using too much stack space at the wrong place may terminate your application at once.

Besides, a complete crash will trigger any emergency mechanism installed, like reverting to the last consistent data backup, automatically restarting the failed service/daemon and timely alerting human system administrators. Such a recovery course may be better than any unpredictable behaviour down the line due to a previous error that was handled incorrectly.

Do Not Install Your Own Critical Error Handler

Some people are tempted to write clever unexpected error handlers to help deal with panics, or even avoid them completely. However, it is usually better to focus on the emergency recovery procedures after the crash rather than installing your own crash handler in an attempt at capturing more error information or surviving the unknown error condition.

Your Operating System will probably do a better job at collecting crash information, you may just need to enable its crash reporting features. You may have to provide some end-user documentation on how to set up WinDbg on Windows in order to automatically store a local crash dump file if your application crashes. Some Linux distributions do not collect crash dump files by default, so you may have to find out how to enable it on the typical user's PC. Other than enabling such OS features, you don't want to interfere with the system's crash dump handler, because, if your application's memory or resource handlers are already corrupt or invalid, trying to run your own crash handler may make matters worse and corrupt the crash information or even mask the crash reason altogether.

Getting an in-application crash handler right is hard if not downright impossible, and I've seen quite a few of them failing themselves after the first application failure they were supposed to report. If you have time to spare on fatal error scenarios, try to minimise their consequences by designing the software so that it can crash at any time without serious consequences. For example, you could save the user data at regular intervals before the crash, like some text editors or word processors do. In fact, this approach is gaining popularity with the advent of smartphone apps and Windows 8 Metro-style applications, where the Operating System may suddenly yank an app from memory without warning.

For other kinds of software, you can also consider configuring the system to automatically restart any important service when it crashes, or to trigger some automated data recovery mechanism. Finally, you may also direct your remaining efforts at improving your software quality process instead.

Do Not Translate Windows SEH Exceptions into C++ Exceptions

Under Windows, you may be tempted to translate Structured Exception Handling (SEH) exceptions into C++ exceptions. In fact, I think that some version of Microsoft's Visual Studio used to do that automatically, unless you manually turned it off on start-up. See compiler switch /EHa for more information.

You should resist the temptation. If your application raises an SEH exception, chances are that there is something seriously wrong and aborting the aplication may be your best option. See section "Abrupt Termination" above for more information.

The only case where you could translate the SEH exception half-way safely is when your application "cleanly" references a NULL pointer. After all, most code uses the special NULL value to indicate that the information a pointer refers to is not available yet.

But even in this scenario, if an SEH exception triggers, you cannot be sure that this was a "clean" NULL pointer access with a pointer where that special meaning of NULL makes sense. Instead, your application may have just read a piece of zeroed memory, or jumped to a random piece of code. Furthermore, this kind of translation mechanism is not portable to other platforms.

Handling an Out-of-Memory Situation

Handling an out-of-memory situation is difficult and most software does not bother even trying. In my experience, even mainstream Operating System tend to crash completely when the system is short of memory. If not, at the very least important system services stop functioning properly at that point. After all, most system services are now implemented as normal user-space processes.

Fortunately, out of memory scenarios are rare nowadays. Some systems take a cavalier approach and start killing user applications in order to release memory, look for the Linux OOM Killer for more information. And, due to wonders virtual memory, your system will probably thrash itself to death before your application ever gets a NULL pointer back from malloc.

If you are writing embedded software, you must prevent excessive memory usage by design. This can be hard, as it is often difficult to estimate how much heap memory a given data structure will consume. If an application needs to process random external events and uses several complex data structures, you may have to resort to empiric evidence. In debug builds, you should run a periodic assertion on the amount of free memory left, so that you get an early warning when you approach the limit during development.

In any case, when your application encounters an out-of-memory situation, there is not much it can do about it. Attempting to handle such an error will probably fail as well, as handling an error usually needs memory too.

In order to alleviate the problem, your application could stop accepting new requests or processing new events if the available memory falls under some limit. The trouble is, the amount of available memory can change drastically without notice, so it may be past the free memory check that there suddendly is no memory left. For many embedded applications, not accepting requests any more is far worse than a controlled crash. Keep in mind that most out-of-memory conditions are caused by a memory or resource leak in your own software. If your application stopped accepting requests when low on memory, you would be effectively turning a memory leak into an application freeze or denial-of-service situation. A controlled crash-and-restart would clean the memory leak and allow the application to function again, if only until the next restart.

A common "solution" is to write a malloc wrapper that terminates the application abruptly if there is no memory left. This way, the user will get a notification at the first point of failure. Otherwise, if the wrapper throws a standard error, the error handler will probably fail again for the same reason anyway, and the application might hang or misbehave instead of terminating "properly". The malloc wrapper could also monitor the amount of free memory left in debug builds, in order to provide the early warning described above.

You could also write your error-handling helper routines in such a robust manner that they still work in low-memory situations. For example, you can pre-allocate a fixed memory buffer per thread in order to hold at least a reasonably-long error message, should a normal malloc() call fail. But then you will not be able to use the standard std::runtime_error or std::string classes. And your compiler and run-time libraries will conspire against you, for even throwing an empty C++ exception with GCC will allocate a piece of dynamic memory behind the scenes. And the first pre-allocation per thread may fail. And the code will become more complex than you might imagine. And so on.

Therefore, if your application allocates big memory buffers, you should probably handle the eventual out-of-memory error when creating them. For all other small memory allocations, it is probably not worth writing special code in order to deal with it. It may even be desirable to crash straight away.

How to Generate Helpful Error Messages

Let's say you press the 'print' button on your accounting application and the printing fails. Here are some example error messages, ordered by message quality:

  1. Kernel panic / blue screen / access violation.
  2. Nothing gets printed, and there is no error message.
  3. There was an error.
  4. Error 0x03A5.
  5. Error 0x03A5: Write access denied.
  6. Error opening file: Error 0x03A5: Write access denied.
  7. Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  8. Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Error 0x03A5: Write access denied.
  9. I cannot start printing the letters because write access to file "invoice arrears.txt" was denied.
  10. Before trying to print those letters, please remove the write-protection tab from the SD Card.
    In order to do that, remove the little memory card you just inserted and flip over the tiny white plastic switch on its left side.
  11. You don't need to print those letters. Those customers are not going to pay. Get over it.

Let's evaluate each of the error messages above:

  1. Worst-case scenario.
  2. Awful. Have you ever waited to no avail for a page to come out of a printer?
    When printing, there usually is no success indication either, so the user will wonder and probably try again after a few seconds. If the operation did not actually fail, but the printer just happens to be a little slow, he will end up with 2 or more printed copies. It happens to me all the time, and we live in 2013 now.
    If the printing did fail, where should the user find the error cause? He could try and find the printer's spooler queue application. Or he could try with 'strace'. Or look in the system log file. Or maybe the CUPS printing service maintains a separate log file somewhere?
  3. Negligent development.
  4. Unprofessional development.
  5. You show some hope as a programmer.
  6. You are getting the idea.
  7. You are implementing the idea properly.
  8. This is the most that you can achieve in practice.
    The error message has been generated by a computer, and it shows: it is too long, clunky and sounds artificial. But the error message is still helpful, and it contains enough information for the user to try to understand what went wrong, and for the developer to quickly pin-point the issue. It's a workable compromise.
  9. Unrealistic. This text implies that the error message generation was deferred to a point where both knowledge was available about the high-level operation that was being carried out (printing letters) and about the particular low-level operation that failed (opening a file). This kind of error-handling logic would be too hard to implement in real life.
    Alternatively, the software could check the most common error scenarios upfront, before attempting to print the letters. However, that strategy does not scale, and it's not worth implementing if the standard error-handling is properly written. Consider checking beforehand if there is any paper left in the printer. If the user happens to have a printer where the paper level reporting does not work properly, the upfront check would not let him print, even if it would actually work. Implementing an "ignore this advance warning" would fix it, but you don't want the user to dismiss that warning every time. Should you also implement a "don't show this warning again today" button for each possible advance warning?
  10. In your dreams. But there is an aspect of this message that the Operating System could have provided in the messages above: instead of saying "write access denied", it could have said "write access denied because the storage medium is write protected". Or, better still, "cannot modify the file because the memory card is physically write protected". That is doable, because it's a common error and the OS could internally find out the reason for the write protection and provide a textual description of the write-protected media type. But Linux could never build such error messages with its errno-style error reporting.
    Providing a hint about fixing the problem is not so unrealistic as it might appear at first. After all, Apple's NSError class in the Cocoa framework has fields like localizedRecoverySuggestion, localizedRecoveryOptions and even NSErrorRecoveryAttempting. I do think that such fine implementation is overkill and hard to implement in practice across operating system and libraries, but proving a helpful recovery hint in the error message could achievable.
  11. Your computer has become self-aware. You may stop worrying now about error handling in your source code.

Therefore, the best achievable error message in practice, assuming that the Operating System developers have read this guide too, would be:

Error printing letters for invoice arrears: Error opening file "invoice arrears.txt": Cannot open the file with write access. Try switching the write protection tab over on the memory card.

Note that I left the error code out, as it does not really help. More on that further below.

The end-user will read such long error messages left-to-right, and may only understand it up to a point, but that could be enough to make the problem out and maybe to work around it. If there is a useful hint at the end, hopefully the user will also read it. Should the user decide to send the error message to the software developer, there will be enough detail towords the right to help locate the exact issue down some obscure library call.

Such an error message gets built from right to left. When the 'open' syscall fails, the OS delivers the error code (0x03A5) and the low-level error description "Cannot open the file with write access". The OS may add the suffix "Try switching the write protection tab over on the memory card" or an alternative like "The file system was mounted in read-only mode" after checking whether the card actually has such a switch that is currently switched on. A single string is built out of these components and gets returned to the level above in the call stack. Instead a normal 'return' statement, you would raise a C++ exception with 'throw' (or fill in some error information object passed from above). At every relevant stage in the way up while unwinding the call stack (at every 'catch' point), the error string becomes a new prefix (like "Error opening file "invoice arrears.txt": "), and the exception gets passed further up (gets 'rethrown'). At the top level (the last 'catch'), the final error message is presented to the user.

The source code will contain a large number of 'throw' statements but only a few 'catch/rethrow' points. There will be very few final 'catch' levels, except for user interface applications, where each button-press event handler will need one. However, all such user interface 'catch' points will look the same: they will probably call some helper routine in order to display a standard modal error message box.

How to Write Error Handlers

Say you have a large program written in C++ with many nested function calls, like this example:

int main ( int argc, char * argv[] )
{
   ...
   b();
   ...
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   // Error check example: we only accept filenames that are at least 10 characters long.

   if ( strlen( filename ) < 10 )
   {
     // What now? Ideally, we should report that the filename should be at least 10 characters long.
   }

   // Yes, you should check the return value of printf(), see further below for more information.

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     // What now?
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     // What now?
   }
   ...
}

Let's try to deal with the errors in routine e() above. It's a real pain, as it distracts us from the real work we need to do. But it has to be done.

Here is a very common approach where all routines return an integer error code, like most Linux system calls do. Note that zero means no error.

int main ( int argc, char * argv[] )
{
   ...
   int error_code = b();
   if ( error_code != 0 )
   {
     fprintf( stderr, "Error %d calling b().", error_code );
     return 1;  // This is equivalent to exit(1);
                // We could also return error_code directly, but you need to check
                // what the exit code limit is on your operating system.
   }
   ...
}

void b ( void )
{
   ...
   int err_code_1 = c("file1.txt")
   if ( err_code_1 != 0 )
   {
     return err_code_1;
   }

   int err_code_2 = c("file2.txt")
   if ( err_code_2 != 0 )
   {
     return err_code_2;
   }
   ...
}

int c ( const char * filename )
{
   ...
   int err_code = d( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

int d ( const char * filename )
{
   ...
   int err_code = e( filename );
   if ( err_code != 0 )
     return err_code;
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     return some non-zero value, but which one?
            Shall we create our own list of error codes?
            Or should we just pick a random one from errno.h, like EINVAL?
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     return some non-zero value, but which one? Note that printf() sets errno.
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     fprintf( stderr, "Error %d opening file %s", errno, filename );
     return some non-zero value, but which one? Note that fopen() sets errno.
   }
   ...
}

As shown in the example above, the code has become less readable. All function calls are now inside if() statements, and you have to manually check the return values for possible errors. Maintaining the code has become cumbersome.

There is just one place in routine main() where the final error message gets printed, which means that only the original error code makes its way to the top and any other context information gets lost, so it's hard to know what went wrong during which operation. We could call printf() at each point where an error is detected, like we do after the fopen() call, but then we would be calling printf() all over the place. Besides, we may want to return the error message to a caller over the network or display it to the user in a dialog box, so printing errors to the standard output may not be the right thing to do.

The same code uses C++ exceptions and looks much more readable:

int main ( int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     // We can decide here whether we want to print the error message to the console, write it to a log file,
     // display it in a dialog box, send it back over the network, or all of those options at the same time.
     fprintf( stderr, "Error calling b(): %s", e.what() );
     return 1;
   }
}

void b ( void )
{
   ...
   c("file1.txt");
   c("file2.txt");
   ...
}

void c ( const char * filename )
{
   ...
   d( filename );
   ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   if ( strlen( filename ) < 10 )
   {
     throw std::runtime_error( "The filename should be at least 10 characters long." );
   }

   if ( printf( "About to open file %s", filename ) < 0 )
   {
     throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
   }

   FILE * f = fopen( filename, ... );
   if ( f == NULL )
   {
     throw std::runtime_error( collect_errno_msg( "Error opening file %s: ", filename ) );
   }
   ...
}

If the strlen() check above fails, the throw() invocation stops execution of routine e() and returns all the way up to the 'catch' statement in routine main() without executing any more code in any of the intermediate callers b(), c(), etc.

We still have a number of error-checking if() statements in routine e(), but we could write thin wrappers for library or system calls like printf() and fopen() in order to remove most of those if()'s. A wrapper like fopen_e() would just call fopen() and throw an exception in case of error, so the caller does not need to check with if() any more.

Improving the Error Message with try/catch Statements

Let's improve routine e() so that all error messages generated by that routine automatically mention the filename. That should also be the case for any errors generated by any routines called from e(), even though those routines may not get the filename passed as a parameter. The improved code looks like this:

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       throw std::runtime_error( "The filename should be at least 10 characters long." );
     }

     if ( printf( "About to open file %s", filename ) < 0 )
     {
       throw std::runtime_error( collect_errno_msg( "Cannot write to the application log: " ) );
     }

     FILE * f = fopen( filename, ... );
     if ( f == NULL )
     {
       throw std::runtime_error( collect_errno_msg( "Error opening the file." ) );
     }
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

In the example above, helper routines format_msg() and collect_errno_msg() have not been introduced yet, see below for more information.

Note that all exception types are converted to an std::exception object, so only the error message is preserved. There are other options that will be discussed in another section further ahead.

You may not need a catch(...) statement if your application uses exclusively exception classes ultimately derived from std::exception. However, if you always add one, the code will generate better error messages if an unexpected exception type does come up. Note that, in this case, we cannot recover the original exception type or error message (if there was a message at all), but the resulting error message should get the developer headed in the right direction. You should provide at least add one catch(...) statement at the application top-level, in the main() function. Otherwise, the application might end up in the unhandled exception handler, which may not be able to deliver a clue to the right person at the right time.

We could improve routine b() in the same way too:

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

You need to find a good compromise when placing such catch/rethrow blocks in the source code. Write too many, and the error messages will become bloated. Write too little of them, and the error messages may miss some important clue that would help troubleshoot the problem. For example, the error message prefix we just added to routine b() may help the user realise that the affected file is part of his personal address book. If the user has just added a new address book entry, he will probably guess that the new entry is invalid or has rendered the address book corrupt. In this situation, that little error message prefix provides the vital clue that removing the new entry or reverting to the last address book backup may work around the problem.

If you look a the original code, you'll realise that routine c() is actually the first one to get the filename as a parameter, so routine c() may be the optimal place for the try/catch block we added to routine e() above. Whether the best place is c() or e(), or both, depends on who may call these routines. If you move the try/catch block from e() to c() and someone calls e() directly from outside, he will need to provide the same kind of try/catch block himself. You need to be careful with your call-tree analysis, or you may end up mentioning the filename twice in the resulting error message, but that's still better than not mentioning it at all.

Using try/catch Statements to Clean Up

Sometimes, you need to add try/catch blocks in order to clean up after an error. Consider this modified c() routine from the example above:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();
  ...
  d( filename );
  ...
  delete my_instance;
}

If d() were to throw an exception, we would get a memory leak. This is one way to fix it:

void c ( const char * filename )
{
  my_class * my_instance = new my_class();

  try
  {
    ...
    d( filename );
    ...
  }
  catch ( ... )
  {
    delete my_instance;
    throw;
  }

  delete my_instance;
}

Unfortunately, C++ lacks the 'finally' clause, which I consider to be a glaring oversight. May other languages, such as Java or Object Pascal, do have 'finally' clauses. Without it, we need to write "delete my_instance;" twice in the example above. The trouble is, the code inside such catch(...) blocks tends to become out of sync with its counterpart below, and it is rarely tested. There is no easy way to avoid this kind duplication, not even with goto, as these are prohibited across the catch() block boundaries. You can factor out the clean-up code to a separate routine and call it twice, passing all clean-up candidates as function arguments. But most people resort to smart pointers and other wrapper classes, see further below for more information.

The Final Version

This is what the example code above looks like with smart pointers, wrapper functions and a little extra polish:

int main ( const int argc, char * argv[] )
{
   try
   {
     ...
     b();
     ...
   }
   catch ( const std::exception & e )
   {
     return top_level_error( e.what() );
   }
   catch ( ... )
   {
     return top_level_error( "Unexpected C++ exception." );
   }
}

int top_level_error ( const char * const msg )
{
  if ( fprintf( stderr, "Error calling b(): %s", msg ) < 0 )
  {
    // It's hard to decide what to do here. At least let the developer know.
    assert( false );
  }

  return 1;
}

void b ( void )
{
   try
   {
     ...
     c("file1.txt");
     c("file2.txt");
     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error loading your personal address book files: %s", e.what() ) );
   }
}

void c ( const char * filename )
{
  std::auto_ptr< my_class > my_instance( new my_class() );
  ...
  d( filename );
  ...
}

void d ( const char * filename )
{
   ...
   e( filename );
   ...
}

void e ( const char * filename )
{
   try
   {
     if ( strlen( filename ) < 10 )
     {
       log_and_throw( std::runtime_error( "The filename should be at least 10 characters long." ) );
     }

     printf_to_log_e( "About to open file %s", filename );

     auto_close_file f( fopen_e( filename, ... ) );

     const size_t read_count = fread_e( some_buffer, some_byte_count, 1, f.get_FILE() );

     ...
   }
   catch ( const std::exception & e )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, e.what() ) );
   }
   catch ( ... )
   {
     throw std::runtime_error( format_msg( "Error processing file \"%s\": %s", filename, "Unexpected C++ exception." ) );
   }
}

Helper function log_and_throw() can optionally write the error message together with a call stack dump to the application's debug log file, before raising an exception with the given std::runtime_error object.

Why You Should Use Exceptions

The exception mechanism is the best way to write general error-handling logic. After all, it was designed specifically for that purpose. Even though the C++ language shows some weaknesses (lack of finally clause, need of several helper routines), the exception-enabled code example above shows a clear improvement. For more details, take a look at Bjarne Stroustrup's reasons.

Another Code Example

Here is another example:

void SerialiseInObjects ( InputStream * file )
{
  ...
  file->ReadAndCheckHeader();
  file->ReadInt( objectCount );

  if ( objectCount > MAX_OBJ_COUNT )
    throw std::runtime_error( "Maximum number of objects exceeded." );

  for ( int i = 0; i < objectCount; ++i )
  {
    obj[i]->name   = file->ReadString();
    obj[i]->color  = file->ReadInt();
    obj[i]->shape  = file->ReadInt();
    obj[i]->posX   = file->ReadInt();
    obj[i]->posY   = file->ReadInt();
    obj[i]->height = file->ReadInt();
    obj[i]->width  = file->ReadInt();
    ...
  }
}

Let's think for a moment about not using exceptions in the example above. Any ReadInt() call may fail at any time, if the data read is not a valid integer, or if we have reached the end of file, so we would need to wrap every serialisation call in a separate if() statement, which would impact code readability. Also, if we need to abort early and we are not using exception-safe techniques, we have to be careful not to leave a memory leak behind with an early return.

We could try delaying some of the validation work, but then we have to validate at least some of the data read immediately. For example, in the case of an invalid objectCount, we may be reading forever or triggering an out-of-memory situation. Besides, keeping all the read and the validation code close together makes sense.

If something fails, we must not forget to store any error code or error message somewhere like errno before returning early. And don't forget to read it back later on when the error is behing handled.

If you are serialising large amounts of information, traditional error checking becomes extremelly tedious. That's probably the reason why, when an old document fails to read in a recent version of your favourite word processor, or when your favourite e-book format conversion tool cannot convert some old file, you get a generic failure message, and those bugs are rarely fixed. Without a proper error-handling strategy, it's just too hard to generate good error information. And without good error information, the software developers cannot know what went wrong, especially when dealing with complicated, half-documented file formats. After all, you cannot just e-mail them every document or every e-book that failed to load.

In spite of the reasons above, there are surprisingly many oponents to exception handling, especially in the context of the C++ programming language. While I don't share most of the critique, there are still issues with some compilers and some C++ runtime libraries, even as late as year 2013, see further below for details.

Exceptions Are Everywhere

Modern applications and software frameworks tend to rely on C++ exceptions for error handling, and it is impractical to ignore C++ exceptions nowadays. The C++ Standard Template Library (STL), Microsoft's ATL and MFC are prominent examples. Just by using them you need to cater for any exceptions they might throw.

Exceptions are prevalent outside the C++ world: Java, Javascript, C#, Objective-C, Perl and Emacs Lisp, for example, use exceptions for error-handling purposes. And the list goes on.

Even plain C has a similar setjmp/longjmp mechanism. The need to quickly unwind the call stack on an error condition is a very old idea pioneered by PL/I in 1964 and refined by CLU in the 1970s.

Exceptions Make it Easier to Repurpose Code

Exception handling allows you to separate error generation from error handling, which isolates the error checking code from the surrounding environment and facilitates its portability.

Consider the following code snippet:

FILE * f = fopen( filename, ... );

if ( f == NULL )
{
  fprintf( stderr, "Error %d opening file %s", errno, filename );
  exit( 1 );
}

You cannot easily repurpose that kind code to run in another environment. First of all, calling exit() at the point of error is not appropriate in most circumstances. Furthermore, if that piece of code ends up in a shared library, and a remote user calls it indirectly over a Remote Procedure Call (RPC), there may be no stderr to write the error message to, and even if there is, it will not be easy to collect the message text and forward it to the remote RPC client.

Let's think for a moment where an error message from a piece of repurposed code might possibly land:

  • Text console for an interactive process.
  • Log file for a background process.
  • System log for a service or daemon.
  • Database record for a failed task in a multiuser batch processing scenario over a Web interface.
  • Result of a remote procedure call (RPC).
  • Error message box in a graphical interactive application.

Clearly, the code raising an error can never deal with the error message delivery itself, if you ever want to reuse the code. The error message must be passed all the way up to the caller, mostly unchanged, a task for which the exception mechanism is most suitable.

Downsides of Using C++ exceptions

Exceptions could make the code bigger and/or slower

This should not be the case, and even if it is, it is almost always an issue with the current version of the compiler or its C++ runtime library. For example, in my experience, GCC generates smaller exception-handling code for the ARM platform than for the PowerPC.

But first of all, even if the code size does increase or if the software becomes slower, it may not matter much. Better error-handling support may be much more important.

In theory, logic that uses C++ exceptions should generate smaller code than the traditional if/else approach, because the exception-handling support is normally implemented with stack unwind tables that can be efficiently shared (commoned up) at link time.

Because source code that uses exceptions does not need to check for errors at each call (with the associated branch/jump instruction), the resulting machine code should run faster in the normal (non-error) scenario and slower if an exception occurs (as the stack unwinder is generic, table-driven routine). This is actually an advantage, as speed is not normally important when handling error conditions.

However, code size or speed may still be an issue in severely-constrained embedded environments. Enabling C++ exceptions has an initial impact in the code size, as the stack unwinding support needs to be linked in. Compilers may also conspire against you. Let's say you are writing a bare-metal embedded application for a small microcontroller that does not use dynamic memory at all (everything is static). With GCC, turning on C++ exceptions means pulling in the malloc() library, as its C++ runtime library creates exception objects on the heap. Such an strategy may be faster on average, but is not always welcome. The C++ specification allows for exception objects to be placed on the stack and to be copied around when necessary during stack unwinding. Another implementation could also use a separate, fixed-size memory area for that purpose. However, GCC offers no alternative implementation.

GCC's development is particularly sluggish in the embedded area. After years of neglect, version 4.8.0 finally gained configuration switch --disable-libstdcxx-verbose, which avoids linking in big chunks of the standard C I/O library just because you enabled C++ exception support. If you are not compiling a standard Linux application, chances are that the C++ exceptions tables are generated in the "old fashioned" way, which means that the stack unwind tables will have to be sorted on first touch. The first throw() statement will incur a runtime penalty, and, depending on your target embedded OS, this table sorting may not be thread safe, so you may have sort the tables on start-up, increasing the boot time.

Debug builds may get bigger when turning C++ exceptions on. The compiler normally assumes that any routine can throw an exception, so it may generate more exception-handling code than necessary. Ways to avoid this are:

  1. Append "throw()" to the function declarations in the header files.
    This indicates that the function will never throw an exception. Use it sparingly, or you may find it difficult to add an error check in one of those routines at a later point in time.
  2. Turn on global optimisation (LTO).
    The compiler will then be able to determine whether a function called from another module could ever throw an exception, and optimise the callers accordingly.
    Unfortunately, using GCC's LTO is not yet a viable option on many architectures. You may be tempted to discard LTO altogether because of the lack of debug information on LTO-optimised executables (as of GCC version 4.8).

Exceptions are allegedly unsafe because they tend to break existing code more easily

The usual argument is that, if you make a change somewhere deep down the code, an exception might come up at an unexpected place higher up the call stack and break the existing software. For someone used to the traditional C coding style (assuming he is not using setjmp/longjmp), it is not immediately obvious that program execution may interrupt its normal top-to-bottom flow at (almost) any point in time, whenever an error occurs.

However, I believe that developers are better off embracing the idea of defensive programming and exception safety from the start. With or without exceptions, errors do tend to come up at unexpected places in the end. Even if you are writing a pure math library, someone at some point in time is going to try to divide by zero somewhere deep down in a complicated algorithm.

It is true that, if the old code handles errors with manual if() statements, adding a new error condition normally means adding extra if() sentences that make new code paths more obvious. However, when a routine gains an error return code, existing callers are often not amended to check it. Furthermore, it is unlikely that developers will review the higher software layers, or even test the new error scenario, so as to make sure that the application can handle the new error condition correctly.

More importantly, in such old code there is a strong urge to handle errors only whenever necessary, that is, only where error checks occur. As a result, if a piece of code was not expecting any errors from all the routines it calls, and one of those routines can now report an error, the calling code will not be ready to handle it. Therefore, the developer adding an error condition deep down below may need to add a whole bunch of if() statements in many layers above in order to handle that new error condition. You need to be careful when adding such if() statements around: if any new error check could trigger an early return, you need to know what resources need to be cleaned up beforehand. That means inspecting a lot of older code that other developers have written. Anything that breaks further up is now your reponsibility, for the older code was "working correctly" in the past. This amounts to a great social deterrant from adding new error checks.

Let's illustrate the problem with an example. Say you have this kind of code, which does not use exceptions at all:

void a ( void )
{
  my_class * my_instance = new my_class();
  ...
  b();
  ...
  delete my_instance;
}

If b() does not return any error indication, there is no need to protect my_instance with a smart pointer. If b()'s implementation changes and it now needs to return an error indication, you should amend routine a() to deal with it as follows:

bool a ( void )
{
  my_class * my_instance = new my_class();
  ...
  if ( ! b() )
  {
     delete my_instance;
     return false;
  }
  ...
  delete my_instance;
  return true;
}

That means you have to read and understand a() in order to add the "return false;" statement in the middle. You need to check if it safe to destroy the object at that point in time. Maybe you should change the implementation to use a smart pointer now, which may affect other parts of the code. Note that a() has gained a return value, so all callers need to be amended similarly. In short, you have to analyse and modify existing code all the way upwards in order to support the new error path.

If the original code had been written in a defensive manner, with techniques like Resource Acquisition Is Initialization, and had used C++ exceptions from the scratch, chances are it would already have been ready to handle any new error conditions that could come up in the middle of execution. If not, any unsafe code (any resource not managed by a smart pointer, and so on) is a bug which can be more easily assigned to the original developer. Unsafe code may also be fixed during code reviews before the new error condition comes. Such code makes it easier to add error checks, because a developer does not need to check and modify so much code in the layers above, and is less exposed to blame if something breaks somewhere higher up as a result.

Therefore, for the reasons above, I am not convinced that relying on old-style if() statements for error-handling purposes helps writing better code in the long run.

C++ exceptions may be unsafe in interrupt context

The C++ standard mandates that it should be safe to copy exception objects as they get propagated up the call stack. The reason is that, in some compilers, throwing an exception is implemented in the same way as if the throwing routine had returned a C++ object to the caller. The throwing routine creates the exception object in its stack context, fills it with data, and then calls the copy constructor or operator= in order to copy the object's contents to another instance in the caller's stack area. Before returning to the caller, the throwing routine destroys the local exception object instance as part of its stack clean-up procedure.

When returning C++ objects, the compiler is allowed to optimise away such copying, but that only works well between one caller and its callee. Exceptions can propagate up many levels, and it would be inefficient to attempt that kind of optimisation across big call trees, especially for exception objects, which should rarely be thrown.

Copying exception objects during call unwinding is slow, but predictable. It may be possible to optimise it by creating a separate memory area outside the standard call stack to hold the current exception object. Such an optimisation could get complicated if an exception is thrown while processing a previous one.

The trouble is, GCC's one and only C++ exception implementation follows a different approach: exceptions objects are created on the heap, and only a pointer to the current exception object is propagated up the call tree. That means that throw ends up calling malloc() behind the scenes. I would normally prefer the other "slow and predictable" strategy to GCC's "optimised" implementation, because the time it takes to execute a malloc() call is usually not deterministic. However, given that exceptions will almost always contain heap-allocated, variable-length strings for rich error messages, GCC's choice does not really matter much in practice. It may even be faster, although speed shouldn't be so important when processing exceptions.

There is a catch, though: if you are writing embedded software on a bare-metal environment, it may not be safe to call malloc() within interrupt context. Therefore, even if your restrict your interrupt handler to simple exception objects which contain only integers, throwing an exception may still be unsafe. In order to avoid nasty surprises in such restricted environments, you should make sure that your bare-metal malloc() implementation asserts that it is not running in interrupt context.

Even if you make sure that malloc() is still safe within interrupt context, the stack unwinding logic may not be reentrant or thread safe. You will probably find it hard to find any documentation whatsoever about your compiler vendor's stack unwinding implementation.

Social Resistance

This is in fact the hardest obstacle. Most developers will have heard about the supposed drawbacks of exception handling. Quick excuses will come up instantly, like "nobody else around is doing it anyway". That may be actually true for embedded software and most C/C++ Linux development, but such reasoning has little value and tends to obstruct progress in the long run.

Somebody will have to start writing a few helper functions and make sure the compiler gets the right flags. Last but not least, you may have to re-educate older team members in oder to change entrenched habits. Overcoming social inertia can be the greatest challenge.

General Do's and Don'ts

Do Not Make Logic Decisions Based on the Type of Error

Error causes are manifold and mostly unforeseeable during development. Consider a software driver for a local hardware component such as a standard serial port, where some operations can only fail because of a handful of reasons, if at all. At a later point in time, you may want to access a remote serial port on another PC as if it were locally connected, so you write a virtual serial port driver that acts as a bridge between the two PCs. Suddenly, some serial port operations may fail because of network issues, which is a different type of error altogether. If the original serial port driver did not provide a flexible way of reporting errors, the end user will lose error information every time there is a network problem.

Furthermore, generic errors are sometimes broken down into several more-specific errors in order to help troubleshoot a difficult issue that is happening at a customer site. A developer refining such error checks under time pressure will probably not realise that he might be changing the public routine interface as well.

Therefore, it is best to avoid making logic decisions based on the type of error (exception subclass, error code, etc). Normally, the action to be taken depends on the source code position (considering the whole call stack) where the error happened. If some indication about how to handle an error is required, consider setting an explicit parameter in the error information (such as retry=yes/no flag), or, even better, try to move that kind of logic outside the error-handling domain.

Let's say that you wish to automatically retry a failed operation a few times before giving up, which is already a rare scenario. Chances are that there is a particular reason why something may need to be retried, like a critical resource being busy. You could check whether the resource is free upfront. If that is not possible, you may be able to resort to a version of the ResourceLock() routine which does not throw an error in the busy case. Such a routine could have a prototype like bool TryResourceLock (void ); . If the resource locking fails, you should then pass the error up as a simple boolean indication instead of throwing an exception. After all, the busy condition is not an unforeseeable, fatal type of error, but a normal, expected condition.

In the retry scenario described above, consider what would happen if you added a retry=yes/no flag to the error information, or if you implemented an exception subclass like CErrorNoRetry, instead of adding an explicit function argument for that purpose. This means that you would be using the error-handling infrastructure deep below in the specific task code order to make logic decisions higher up in the global task retry module. If you then re-use the specific task code in some other software, it will not be immediately obvious that the new caller needs to support the retry mechanism as well, and the end user may end up getting a "retry error" before the developer realises about the missing functionality.

Besides, it is unlikely that many error conditions should prompt a retry, while a number of other errors should not. Moreover, most subroutines called deep below will not be aware of the retry mechanism. In most cases, it's also hard to decide whether to retry or not: how should the program logic decide whether a network error is likely to be transient or permanent? Therefore, if you add a retry flag to the generic error information, you will probably find that only one or two places end up using it.

Never Ignore Error Indications

I once had Win32 GetCursorPos() failing on a remote Windows 2000 without a mouse when remotely controlled with a kind of VNC software. Because the code ignored the error indication, a random mouse position was passed along and it made the application randomly fail later on. As soon as I sat in front of the remote PC, I connected the mouse and then I could not reproduce the random failures any more. The VNC software emulated a mouse on the remote PC (a mouse cursor was visible), so the cause wasn't immediately obvious. And Windows 2000 without that VNC software also provided a mouse pointer position even if you didn't have one. It was the combination of factors that triggered the issue. It was probably a bug in Windows 2000 or in the VNC software, but still, even a humble assert around GetCursorPos(), which is by no means a proper error check, would have saved me much grief.

The upshot is, everything can fail at any point in time. Always check. If you really, really don't have the time, add at least a "to do" comment to the relevant place in the source code, so that the next time around you or some other colleague eventually adds the missing error condition check.

Restrict Expected, Ignored Failures to a Minimum

Sometimes failures are expected under certain circumstances. For example, if you are writing a "clean" rule for a GNU Make makefile, you have to consider the scenario where the user runs a "make clean" on an already-clean sandbox. Or maybe the last build failed somewhere halfway through, so not all files were generated. Therefore, the makefile code cannot complain and immediately stop if some of the files it is trying to delete are not present.

This is unfortunate, as chances are you will not realise if the makefile ever tries to delete a file that is no longer generated by the build rules. In the end, the "clean" target will probably drift out-of-sync with the build rules, and nobody will notice for a long time.

However, if you write an automated script that builds a program, installs it, and then deletes the temporary files, you know which files or directories should have been generated at the end of a successful build. Therefore, you should write the code so that it does complain and stop if it tries to delete a file or directory which does not exist. That way, you will notice straight away if the makefile has changed and a particular file or directory is not generated any more or now lands somewhere else.

The best way to prevent the "clean" rule from becoming out-of-sync with the build rules is to use a tool like autoconf in order to automatically generate the makefile. Otherwise, provide two targets, "clean-if-exists" and "clean-after-successful-build". Note that, if you use the standard "clean" name, many people will not realise that there is an alternative clean rule. If you do insist on using that name, make it an alias to "clean-if-exists", so at least it's clear to the developer what the distinction is. It is best to share the code between both makefile targets. Use a flag to tell whether the files or directories are expected to exist. Finally, as part of your release script, test both corner cases: "make clean-if-exists" on an already-clean sandbox, and "make clean-after-successful-build" after a successful build.

Never Kill Threads

Threads must always terminate by themselves. They can get an external notification that they should terminate, but the actual termination must be performed by the thread itself, ideally by returning from its top-level function. Avoid C#'s Thread.Abort, POSIX pthread_cancel() or similar syscalls. Killing a thread is at least not portable among platforms and tends to leak resources and offer no guarantees about how long it takes for the target thread to terminate.

There are only 2 ways for threads to notice that they have been requested to terminate:

  1. For threads that are busy, their work packets must be divided into smaller units and the thread must manually check (poll) the termination condition between work units. If the time between checks is too long, termination can be unnecessarily delayed. If it is too short, performance may be affected.
  2. For threads that wait on an event loop, the termination notification must wake them up. This means that the event loop must include the termination condition in the list of objects waited upon, or the master thread must set some boolean termination flag first and then send some message that is guaranteed to wake the child thread's event loop up. In a Unix environment, it is common to create pipes for the sole purpose of getting a file handle to wait upon. Closing the pipe may be all that's needed in order to get the termination notification across.

Any errors synchronising with a thread's termination (for example, a failed pthread_join(childThreadId)) should be treated as fatal and should lead to immediate application termination. Do not use a condition variable for termination purposes, because the child thread terminates after setting the condition variable, which introduces a race condition with the master thread waiting on that condition variable alone.

If the main thread encounters an error and decides to terminate, it should notify all other threads, wait for their termination, and then terminate itself. A graceful process termination allows the developer to trace all unreleased resources (memory blocks, file handles, etc) at the end, in order to verify that no leaks exist.

Do not wait for child threads with a timeout, always wait for an infinite time. Deadlocks should be identified, and not just hidden and ignored by means of a timer. Besides, on a system under heavy load, threads may take longer to terminate than originally estimated. Remember that, in a dead-locked application, you can break in with a debugger and dump the call stacks for all threads in order to find out what is causing the deadlock. Setting a timeout deprives the developer from that possibility.

Check Errors from printf() too

Nobody checks the return value from printf() calls, but you should, for the following reasons:

  1. A failing printf() may provide an early warning for some other severe error condition, like the disk being full.
    If the program ignores the error and happily keeps chugging along, it may be hard to tell what went wrong. Or it may fail later on at an inconvenient time, when you actually need the software to do something useful.
  2. Writing to a log file with printf() may be part of the software's feature set, and it may be unacceptable to carry on performing actions without generating the corresponding log entries. Besides, if something else fails later on and you need to troubleshoot the problem, you'll have no log file to look at.
  3. A process' stdout may be piped to another process or even redirected over the network. These days you can easily access a remote serial port on another PC, so things that should always work, like serial port writes, could suddenly fail because of network issues. If such errors are not handled properly, the program may quit without a proper message or just crash with a generic SIGPIPE signal. Worse still, if errors are ignored, other programs down the pipe may attempt to process faulty or non-existent output from the previous process. The whole pipe may hang without an error indication.

The points above apply of course to all I/O operations on stdin, stdout and stderr, and to many languages other than C++.

Although ignoring error codes from function calls is obviously bad practice, it is so widespread that the GCC developers have added an extra warning designed to catch such sloppy code, check out function attribute warn_unused_result and compiler switch -Wunused-result for more information.

Checking the error code from printf() and the like is tedious, so you are better off writing wrapper functions. Here is a routine I have been using in Perl. It does need some improvement, but it's better than nothing. The main problem is remembering to use it instead of the prevalent print built-in routine.

sub write_stdout ( $ )
{
  my $str = shift;

  ( print STDOUT $str ) or
     die "Error writing to standard output: $!\n";
}

Use a Hardware Watchdog

Most embedded systems provide a hardware watchdog timer that restarts the firmware, should it become unresponsive. Otherwise, the device may require a manual reset, which is not always desirable or even feasable. Of course, a watchdog-induced reset constitutes a last-resort work-around.

Sometimes, you can configure the hardware watchdog so that it triggers a software interrupt instead of performing a hardware reset. However, this is something you must never do. The watchdog must ultimately perform a hardware reset without any software intervention. Otherwise, a severe software error or a nasty memory corruption could prevent the watchdog interrupt handler from resetting the device, rendering the watchdog useless.

If your system provides a last reset reason register (which it really should), make sure you read it on start-up. If the last reset was triggered by the hardware watchdog, you should alert the user, or at least write an entry on the device's log.

If you would rather handle the watchdog event in software, you can always use a standard timer with a shorter trigger period as the hardware watchdog. This kind of software watchdog will manage most firmware errors, and the hardware watchdog will still be there as the very last line of defense.

Assertions Are No Substitute for Proper Error Handling

Assertions are designed to help developers quickly find bugs, but are not actually a form of error handling, as assertions are normally left out of release builds. Even if there were not, they tend to generate error messages that a normal user would not understand.

Therefore, you need to resist the temptation of asserting on syscall return codes and the like.

However, an assertion is still better than no error check at all. If you have no time to write proper error checking code, at least add a "to do" comment to the source code. That will remind other developers (or even yourself in the distant future) that this aspect needs to be considered. For example, under Win32:

if ( FALSE == GetCursorPos( &cursorPos ) )
  assert( false );  // FIXME: Write proper error handling here.

I have grown fond of Microsoft's VERIFY macro for this kind of check.

VERIFY( FALSE != GetCursorPos( &cursorPos ) );  // FIXME: Write proper error handling here.

Exception-Specific Do's and Don'ts

Do Not Use Exceptions for Anything Else Than Error Conditions

Throwing an exception is a handy way to quickly return from a deeply-nested callee to a caller several levels up the call tree. Such a shortcut may save you the trouble of having to add a new argument to many routines all over the place. But you need to resist that temptation. Reasons are:

  1. Convention. Everybody expects that an exception means an error has occurred. Using exceptions for normal logic will cause confusion.
  2. Performance. The exception mechanism is slower than normal execution. Some implementations (like GCC's, as of version 4.8), require a malloc per throw() statement.
  3. Developer productivity. It is handy to set the debugger to stop the application at the point where an exception is thrown, especially when testing the sunny-day scenario, where no errors are expected (which is what developers do most of the time). The trouble is, if exceptions are used in the sunny-day scenario too, then breaking on exception is no longer feasible. Therefore, whenever an error occurs, the developer loses the convenience of having the debugger stopping right where the error happened in the first place. See below for more information on this subject.

You might think that the rules above are so obvious that are not worth mentioning here. After all, errors should be the exception rather than the rule (pun intended). However, I still get the odd surprise every now and then. Consider the relatively-modern Microsoft .NET framework. After opening a listening network socket with BeginAccept(), at some point in time you will want to close it. When you do, the callback routine gets called, and when this routine calls EndAccept(), an exception (an error) is raised. The callback routine needs to catch the exception and check its type in order to distinguish between the normal socket-closing scenario and a real error scenario. That's just the wrong way to indicate that a listening socket has been closed gracefully by the same application.

About automatically breaking on error, some environments such Java or .NET offer the option to start a debugger whenever exception of interest occurs. The debugger can then save detailed troubleshooting information like the call stack and a memory snapshot at the time the exception was raised.

Note that some tools, like BoundsChecker, allow you to set a general "break on exception" rule but can configure a few cases to skip, based on the source code location of the throw() statement or on the call-stack shape. That kind of flexibility is hard to find, and, in any case, you need to spend some time upfront until the tool learns all the cases to skip. That's why it's best to restrict the usage of C++ exceptions to actual errors only.

Avoid Throwing in Destructors

Destructors are often used to provide automatic resource clean-up. If such a destructor fails, it may be difficult to handle the error properly, and aborting the application abruptly might actually be a better option. See section "Errors When Handling a Previous Error" above for more information.

Even if your destructor has no clean-up semantics, most developers do not expect that a destructor could ever throw. Therefore, if your destructor does throw, chances are that the user code around it will not be exception safe.

Note also that objects whose destructors may throw cause problems when inserted into STL containers, or when composing other bigger objects. What should operator delete[] do if two out of five objects fail to destruct?

Last but not least, the upcoming C++11 standard automatically declares all destructors as non-throwing, unless you manually add noexcept(false) declarations to your source code. If you fail to do that somewhere, your destructor may end up terminating the application abruptly via std::terminate.

This recommendation could be extended to any code that cleans resources up at the end of execution. Note that malloc()'s counterpart free() cannot report any errors, and the standard signature for operator delete has a "never throws" attribute.

Handle Unknown Exceptions

If an execution thread encounters an exception that none of its catch handlers can process, the whole application will abort via std::terminate. Such a drastic course of action may not be a bad idea after all, see section "Abrupt Termination" for more information.

In user interface applications, you should display an error message before aborting, so that the user knows what went wrong without having to look at some text console or hidden system log. Unfortunately, if you don't know the exception class upfront, you will not be able to extract a meaningful error information, so you will have to display some generic "Unexpected C++ exception" message, which is still better than nothing. You should not install a last-resort unhandled exception hook, as it may not always be safe to display an error message box when that hook gets called. Instead, write your own top-level, catch-all exception handler for that purpose alone.

When starting new threads with Posix functions or the like, you should make sure that the C++ run-time library can handle unexpected exceptions on the new thread right from the beginning. If you are unsure, you should provide your own top-level, catch-all handler to deal with them.

Never Throw Pointers to Dynamically-Allocated Objects

Whenever an error happens, Microsoft's MFC library allocates an exception object in dynamic memory and then throws a pointer to it, in the hope that some catch handler will deal with it and release the associated memory in the end.

As a result, if you miss a catch statement somewhere, or if the exception hits some generic catch ( ... ) handler at intermediate level, chances are that you'll get a memory leak. Furthermore, you have to be careful in all your standard catch handlers: if the catch code throws a new exception, it'll probably leak the first exception object. For example:

try
{
  ...
}
catch ( CException * const e )
{
  DisplayMyStandardErrorDialogBox( e );  // WARNING: If this throws, you'll leak exception object 'e'.
  e->Delete();
}

In this scenario, it's best to write a small wrapper class for a MFC CException pointer, in order to make sure its associated memory gets automatically released no matter what happens in the catch handler. However, C++ functions do not usually get pointers passed as parameters for which they should immediately take ownership. Therefore, it's easy to forget protecting the pointer properly, even if a convenient wrapper class is available.

This advice comes obviously too late for the MFC. Everybody else should avoid pointers and stick to throwing and catching full objects of classes that implement a proper self-cleaning destructor.

Never Use std::uncaught_exception()

I cannot think of a single scenario where using std::uncaught_exception() would make sense, at least in release builds.

Using Smart Pointers for Clean-Up Purposes

Smart-pointers help automatically release resources no matter what. Although they can be useful alone to help with early function return statements, they are normally used in conjunction with exceptions.

Beware that smart pointers have a number of pitfalls, check out the following sites for detailed information:

Do Not Pass Smart Pointers Around

If you use a smart pointer class for clean-up purposes, it's best to have a single smart pointer instance per resource to clean up. This way, it's easier to work out the resource's lifetime. Passing a smart pointer as a function argument may imply transferring ownership or increasing a reference count, which advances or delays the resource's destruction and is not normally what you want. Furthermore, passing a raw pointer is always faster.

For example, if you use std::auto_ptr, the code should look like this:

std::auto_ptr< my_class > my_instance( new my_class() );
my_func_1( my_instance.get() );
my_func_2( my_instance.get() );

If the resource you need to clean up lives in a class instance, it is probably best to avoid adding a smart pointer for it as a class member, especially if those class instances can be moved or copied around. Instead, add a raw pointer and write code in the class destructor to manually clean up the embedded resource.

Watch Out with std::auto_ptr and Object Arrays

The standard library only provides auto pointer class std::auto_ptr for objects allocated with operator new. If you allocate an array of objects, instead of a single one, this class will compile fine, and will work on many platforms, but not on all of them. This illustrates the pitfall:

std::auto_ptr< char > my_var_1 ( new char );  // Fine.
std::auto_ptr< char > my_var_2 ( new char [ 100 ] );  // Wrong.

The problem is that operator delete is the wrong one to call for a char array, you need to call the array version of operator delete, namely operator delete[].

The Boost libraries offer scoped_ptr and scoped_array as a replacement for this usage scenario.

Avoid std::shared_ptr

Smart pointers of type std::shared_ptr are fine when used for the purpose they were originally created, but some people use them as a replacement for the kind of clean-up wrapper classes described further above. These pointers are convenient because they let you specify a custom destructor routine that does the actual resource clean-up, which saves the effort of writing a new wrapper class. However, shared pointers have their own issues, so it's best to avoid them. After all, writing a wrapper class is not really a big deal. The issues with shared pointers are:

  1. It's easy to make the mistake of passing them around, which may inadvertently extend the object lifetime and, as a result, delay the clean-up operation. In the case of a delayed file handle close operation, the file may remain locked longer than necessary, preventing another program (or the same one) from immediately re-opening it.
  2. In order to avoid potential racing conditions, a shared pointer implementation needs to synchronise access to its internal reference counter, which may mean locking a thread mutex (slow) or entering a critical section (threading bottleneck).

Write Exception-Safe C++ Constructors

Most developers place clean-up code inside class destructors, so that resources are automatically released whenever necessary. However, if a constructor fails, the corresponding destructor is not called. That applies to copy constructors too. Therefore, writing an exception-safe C++ constructor requires special attention.

For example, this constructor is NOT exception safe:

MyClass::MyClass ( void )
{
  try
  {
    m_first  = new FirstClass();
    m_second = new SecondClass();
    ...
  }
  catch ( ... )
  {
    delete m_first;   // WRONG: This may crash.
    delete m_second;  // WRONG: This may crash too.
    throw;
  }
}

If FirstClass' constructor throws an exception, which may not necessarily be an out-of-memory condition, the code will try to release uninitialised pointers.

This version fixes the problem:

MyClass::MyClass ( void )
{
  try
  {
    m_first  = NULL;
    m_second = NULL;

    m_first  = new FirstClass();
    m_second = new SecondClass();
    ...
  }
  catch ( ... )
  {
    delete m_first;
    delete m_second;
    throw;
  }
}

The trick above does not work if those pointer members are marked as const. In fact, I believe that there is no way out in that case, you will have to declare those pointers as non-const members.

There is a relatively recent addition to the C++ language that allows you to write a try/catch block in order to handle exceptions from the constructor's initialiser list. Search the Internet for "Function try blocks" for more information. I believe you can only write one function try block per constructor, so they do not help either if the pointers are const members.

Writing Good Error-Handling Code without Exceptions

While exception handling generally helps, you do not have to use it in order to write good error handlers and generate helpful error messages. You still need to keep the same prefix-adding approach, where each "catch" point along the way up during call stack unwinding adds a prefix to the error message with further context information. But you can code it using old-fashioned if() / return statements in plain C.

First of all, you need to decide how much error information you want to collect and return to the caller. A simple variable-length character string for the generated error message is normally enough. If the returned string is empty, you can assume that execution was successful. Should you wish more information than a plain text message, it's best to create your own CErrorInformation class or structure, where all error information is conveniently kept together. You can take inspiration from Microsoft's _com_error, Glib's Error or Apple's NSError classes. It may be a good idea to derive your class from std::exception, in case you want to throw instances of it in other parts of the code which do use standard exceptions.

Then you need to decide how you want to return the error information to the caller. These are the alternatives:

  • Use a global, thread-local variable, like errno.
  • Pass a pointer to the string or error class on every function call.

Passing a pointer every time means typing more source code and may slightly degrade performance, but it's safer than using a global variable, as it's harder to forget collecting the associated error information after a failed call.

If you choose a global variable, you must also remember to clear the error information upon a successful call, which could also degrade performance. Or you could opt for clearing it only after an error is detected, but that's really easy to forget, so the next successful call may end up reporting a stale error message.

Assuming you chose a simple string that gets passed around, this is how the code could look like:

void main ( void )
{
  std::string errMsg;

  performWork( &errMsg );

  if ( !errMsg.empty() )
  {
     fprintf( stderr, "Error: %s", errMsg.c_str() );
     exit( 1 );
  }
 
  exit( 0 );
}

void performWork ( std::string * const errMsg )
{
  ...
  processFile( "myfile.txt", errMsg );

  if ( !errMsg->empty() )
  {
    AddPrefix( errMsg, "Failed to perform work of type %s:", typeOfWork );
    return;
  }
  ...
}

void processFile ( const char * const filename, std::string * const errMsg )
{
   FILE * f = fopen( filename, ... );

   if ( f == NULL )
   {
     const int errnoCode = errno;  // errno is very volatile, so capture its value straight after the error.
     FormatMsg( errMsg, "Error %d opening file %s", errnoCode, filename );
     return;
   }
   ...
}

This approach works well: it's easy to understand and delivers good error messages, and, just like when using exceptions, error message generation is isolated from error message delivery. You just need to remember writing an "if ( !errMsg->empty() )" sentence after each and every call.

Mixing and Matching Error-Handling Strategies

You have to deal with different error-handling strategies in your code, in the same way that you have to deal with different naming conventions and different source code styles. There is no way around it, as every system, every framework and every library follows its own guidelines, and you have no control about external code. Therefore, I wouldn't worry too much about unifying any of those aspects in your own source code. While a minimum level of standarisation and consistency is certainly pleasant to the eye, you must remember that writing software is not like writing a book. Code reliability is more important than visual consistency or appeal. Any productivity increases you may gain in a standard environment will be quickly offset by the discomfort resulting from the necessary policing and enforcing, especially if you insist on unifying every petty detail. Furthermore, if your developers cannot improve error reporting in a particular function or module because that would break the established standard, the end result will be deeply negative due to the increased troubleshooting costs.

The following sections describe the most common interfacing scenarios you will encounter in your software.

Exception-Based Code Calls Plain C-Style Code

As soon as you get an error indication, be it a returned error code or an errno-style error, you should convert it to an exception. See section "Helper Routines" for more information.

Plain C-Style Code Calls Exception-Based Code

You will have to wrap every call in a try/catch block, and then try to recover and pass along at least the variable-length error message. Try to write a common routine to help with this task. If a generic conversion is not possible, you will have to write a trivial wrapper function for each exception-based callee.

If the C-style code can only return an error code, you will lose error information. You should write the discarded error information to an application log file, in case the error cause is needed in the future. Losing error information may be an acceptable comprimise, especially if you plan to improve the C-style code in the near future.

Mixing Different Types of Exception

Microsoft's MFC uses its own CException base class, which is not derived from std::exception. Chances are that you will use the MFC and the C++ standard library at the same time, so you will have to deal with both exception types at the same time. This can of course happen with other libraries. In the MFC case, your catch points should then look like this:

try
{
 ...
}
catch ( CException * const e )
{
  ...
}
catch ( const std::exception & e )
{
  ...
}
catch ( ... )
{
  assert( false );
  ...
}

Your code will have a few top-level error handlers, where you need to catch all 3 types above. If you have a lot of user interface event handlers, there will be many catch points like that scattered all around. However, chances are all such event handlers will be identical, so you could write a common function or macro to help reduce the boilerplate code.

For all intermediate catch points, where you just add a prefix to the error message and rethrow the same error, you could consistently convert one of the types to the other one in the hope that other higher-level catch points will have one less type to deal with. You probably don't need the "catch all" case all over the place, especially if you never expect to encounter one.

If you do have control of all the code, you could try to make all exception classes ultimately inherit from std::exception, so that the majority of catch points only have one type of exception to deal with. You could then amend your error-handling helper routines in order to distinguish between the different exception types by means of dynamic_cast. This way, all errors are handled in the same way, but any extra error information provided by different exception subclasses still gets properly collected and forwarded along the way.

Different Types of C-Style Error Information

If your function happens to return a Windows HRESULT error code and it calls an errno-style C runtime library routine, you will lose error information. Try to write the original error code to the application's log file in case it's needed later.

This is why it is always desirable to return error information as a variable-length string or as a custom CErrorInformation object. This way, no error information gets lost. You will have to write a few common conversion functions upfront, so that gathering error information in the foreign format becomes a trivial (if repetitive) task.

Improving Error Handling in an Old Codebase

Retrofitting good error handling is not actually as hard as you may think. The good news is that you don't have to reengineer the whole software at once, you can improve the source code step by step. And you don't need to start writing exception-safe code straight away either, see section "Writing Good Error-Handling Code without Exceptions" for more information.

The social or political factor is the most important one. You need to establish a target, so that there is a guideline for all new or amended code. Old code can initially remain as it is and be slowly converted as need arises.

Error Messages Must Be Returned to the Caller

The most important aspect is to avoid printing an error message to the console or sending it to the system log whenever an error condition is detected. Error messages must be returned to the caller, and only the top-level handler should deal with the error message.

Amending the whole source code at once is impractical, as you will not be able to test all modified error checks at a reasonable cost. There will be a time where both strategies will coexist: some errors will make it all the way up, while others will be output on the spot. However, this should not inconvenience the end user too much. After all, the error message will come up one way or another.

Enhancing Error-Handling Top-Levels

When it comes to error handling, there is not just the one top-level, but several of them:

  • The main() function needs to be able to handle both the old error-reporting style (like errno or function return codes) and the new one.
    If you are using exceptions, add a top-level try/catch block. On the other hand, if you are returning error information in the form of an error message or a CErrorInformation object, changes will be minimal.
  • Each thread constitutes an error-handling top-level. You should strive to forward to the main thread any error message that comes all the way up to the thread's starting point. Note that this is probably not an improvement that needs to be done right at the beginning.
  • Each user interface event handler, like a "button pressed" notification, should be considered an error-handling top-level too.
  • In a task-processing system, where independent tasks are submitted, the generic task top-level becomes an error-handling top-level too.
    You will probably want to store the error message as part of the task's processing result.

Interface Boundaries

An interface boundary can be an external library API, a good-encapsulated C module API or a network protocol.

If you are writing a public API for external users, you will have to convert the whole API at once, or the versioning costs will be too high. Otherwise, you can choose to improve single API function calls as the need arises, but there will be a trade-off between developer confusion and migration costs.

Do not overestimate the confusion risks, as most developers are already used to subtle differences in the error reporting. Just take a look at the Win32 API or at the standard C runtime library functions.

In the case of a network protocol, in most cases you will need to add a generic error response packet, see section "Delivering Error Information over the Network" for more information. Instead of converting all request packets at once, you can specify which requests could now send the new error response packet back, instead of their normal successful response packet.

About Error Codes

Error Codes Are a Waste of Time

Error codes do not really help, only the error message does. To the average user, error codes are just noise on the line. Developers will also need to translate such codes into some source code constant or human-readable error string, so it's a waste of time for them too.

The only scenario where an error code could be helpful is if the developer does not understand the end-user's language at all. For example, an English developer may get an error message in Chinese. However, most Chinese users would probably gladly provide an English translation in order to get their problems solved quickly. Furthermore, nowadays most error messages are communicated electronically, so the developer should have no trouble searching for the translated text in the source code in order to locate the corresponding error strings in English.

That is the reason why your error dialog boxes should always include a prominent button to copy the error message to the clipboard in plain-text form. If the user takes a screen snapshot as a bitmap, you may have trouble typing text in a foreign alphabet into your search box.

Beware that error codes are firmly entrenched in computer error messages. They have become part of the general computing culture. Therefore, you may find it difficult to leave them out of your error messages.

Providing Error Codes Only

There are many Operating Systems, libraries, APIs, etc. that rely on error codes exclusively in order to report error conditions. This is always a mistake, as error codes are not immediately useful without the user manual and can only provide information about a particular, low-level condition, but not about any other circumstantial information that could also help identify the problem.

Error codes are often re-used, as they would otherwise grow with each new software version to an unmanageably long list. With each reuse, error codes gain ambiguity and thus lose expressiveness. Furthermore, error codes often force you to discard error information coming from lower software layers whose error codes cannot be effectively mapped to your own codes. That may not be the case now, but it may happen to your code library in the future. Say for example that your library starts storing its data in an SQL database, but this backend change should be transparent to the library user. Any SQL error messages will get discarded if your library's API can only report errors with a set of predefined error codes. In the end, an error code alone does not normally provide enough information to be really useful, so the user needs to resort to other sources like the application's console output or its log file.

Even if you are writing firmware for a severely-constrained embedded environment, you should strive to provide a way to report proper error messages in your code and communication protocols. Until your embedded hardware platform grows, you may have to limit your error messages to a set of predefined mnemonics. You may even have to re-use some of them. However, there is a good chance that your software will be re-used on a bigger platform, and then you can start reporting errors properly without having to re-engineer your code and protocols.

Look at Linux routine open(), for example. The Linux Programmer's Manual documents 21 error codes as of september 2013. Wouldn't life be great if there were only 21 reasons why open() could fail? The entire errno enumeration consists of 34 entries (ENOENT, EINVAL, etc).

Microsoft has done a better job, the "Win32 Error Codes" page documents several thousands. Error codes are 32-bit values where you'll find some flags, a facility code and an error code. For further information, look at the documentation for GetLastError() and for HRESULT. It's a good try, but in the end, it falls short again. As the Operating System grows, and as you start dynamically loading libraries from all sort of vendors, there is no way you can accomodate all the different error codes and make any sense out of them.

Here is a good example. I got this error dialog box in 2013. By the way, error code 0x80070002 does not seem to be particularly well documented in the MSDN.

MicrosoftErrorCode0x80070002.png

I personally find it disturbing that error codes are still generally accepted as good practice even for new projects.

Error Codes Should Not Be Used for Logic Decisions

It is hard enough to design and maintain a stable Application Programming Interface, so your error codes should not be a part of it. Error codes will constantly change, as new error codes are added in order to help pin-point problems more accurately. Therefore, it's best not to document all of your error codes formally, just mention a few of them as examples of what the caller may get back from a failed operation.

The logic on the caller's side should not make decisions based on the returned error codes anyway: all the caller needs to know is whether there was an error or not. See section Do Not Make Logic Decisions Based on the Type of Error for more information.

Your library users should not actually need a header file that exposes a constant for each error code. In other words, do not publish your library's equivalent of errno.h, as the library user should not need them. Only a human operator will eventually benefit from knowing exactly what went wrong, and a flexible text message is the only thing that is going to help him.

Managing Error Codes in Source Code

Whenever a developer writes code for a new error condition, there is a strong urge to re-use some existing error code, instead of adding a new one. This is why generic error codes like EINVAL constantly get abused until they are all but meaningless. Therefore, if you insist on using error codes in order to report error conditions, you should at least install a policy that discourages error code reuse.

Having a single, company-wide file with all possible error codes is not a workable solution. Just think of the recompilation time whenever someone adds a new error code. It's best to have a top-level file that defines error code ranges like this:

#define ERRCODE_MODULE_1_BEGIN 1000
#define ERRCODE_MODULE_1_END   1999

#define ERRCODE_MODULE_2_BEGIN 2000
#define ERRCODE_MODULE_2_END   2999

Each software component will need a separate header file for its error code range, like this:

enum module_2_error_code_enum
{
  err_before_first = ERRCODE_MODULE_2_BEGIN,

  err_bad_day      = ERRCODE_MODULE_2_BEGIN + 1,
  err_its_monday   = ERRCODE_MODULE_2_BEGIN + 2,
  err_its_raining  = ERRCODE_MODULE_2_BEGIN + 3,

  err_after_last   = ERRCODE_MODULE_2_BEGIN + 4
};
    
#if err_after_last >= ERRCODE_MODULE_2_END
#error "Overstepped assigned range."
#endif

If you follow the advice above and do not publish the individual error codes, you don't need to explicitly assign a value to each enum entry, only the initial err_before_first = ERRCODE_MODULE_2_BEGIN needs an explicit value. However, if you do publish the error codes, they normally become part of the API and should not change afterwards, especially for a shared library that has to maintain binary compatilibity against pre-compiled clients. In this case, deprecated errors will cause holes in the error code sequence, as their values will not be reused (but leave a comment at the corresponding place), and new error codes will get appended at the end of the list. In this scenario, it's best to explicitly specify each enum value, as that tends to remind that developer that error codes must remain fixed. Otherwise, adding a new entry in the middle of the enum will inadvertently shift all values downwards by 1.

Providing Rich Error Information

In addition to an error message designed for human consumption, most platforms allow the developer to provide further error information, like an error class or the call stack at the point where the error was recorded. Such extra information is rarely used in practice, the error message is what matters in the end.

For example, Microsoft must have realised at some point in time that error codes, no matter how sofisticated (see section about HRESULT further above), weren't cutting it. The outcome was the IErrorInfo COM interface and its _com_error wrapper class. IErrorInfo allows you to set a help filename and a help context ID, a feature which is rarely used. Even if you take care that a help file is always correctly installed next to your code library, the file path will become invalid as soon as you cross process, user or network boundaries.

Look also at C++, .NET, Java, Perl, etc.: they all offer some sort of way of reporting error messages as variable-length character strings. Some of them offer fancy additions too: Apple's Cocoa has fields named localizedRecoverySuggestion, NSHelpAnchorErrorKey, NSURLErrorFailingURLErrorKey and even localizedRecoveryOptions and NSErrorRecoveryAttempting in its NSError class. I wonder how much those fields are actually used in real life. Should you decide to move your code to a shared library, most such options become unusable. After all, in the end the caller may not even be a human that can read a help file or attempt an automatic recovery.

In .NET you can provide nested exception objects, so you can build a chain of exception objects. As a result, error dialogs often allow you to expand the next level of error information by clicking on the next '+' icon, as if you were expanding subdirectories in a file tree view. How cumbersome is that for a normal user. A couple of standard buttons like "copy error message" and "e-mail error message" would have been more helpful.

For some software environments, it's too late. Consider process exit codes under Unix: their structure (or lack of it) is cast in stone, the exit codes are always custom values. But there's some minimum standard here too: if the exit code is non-zero, you can probably retrieve some sort of error description if you capture the process' stderr output as plain text.

Imagine now that you need to write software that interoperates with other languages in other computers over the network. Your software may get ported to another platform, and parts of it may get moved to a shared library. In the end, the only practical way to pass error information back and forth is to reduce it to a single, possibly lengthy error message string. Therefore, you should focus on generating good error strings in the first place. As soon as you get an error code from some system or library call, turn it into a string and forget the original error code. And don't worry too much about all the fancy additions listed above.

If you do decide to write your own extended error information class, bear in mind that such objects may get copied around pretty often during the stack unwinding phase. If error-handling performance or memory usage matter (they usually don't), you may want to optimise the code that handles the error message. You could use a fixed-size string buffer capable of holding most error messages and only resort to dynamically-allocated memory if the message is really long. Also consider using a string class which can share its internal string storage, so that copying a string object just means increasing the associated reference counter.

Classifying Errors into Different Error Categories

Another way to extend error information is to divide errors into different categories. In Java, you can derive your own subclasses from the standard java.lang.Throwable, java.lang.Exception or java.lang.RuntimeException classes (among others). The idea behind this hierarchy is that some errors are coding bugs that should never happen (like CloneNotSupportedException), whereas others can happen in real life even if the software was written correctly (all those derived from RuntimeException). In C++ you can define your own classes, which need not have a common ancestor at all. The standard library uses std::exception as a base class, and you are encouraged to use std::runtime_error for "normal" errors.

You can also specify in both languages which exceptions a given routine is allowed to throw. This is called exception specification in C++. In Java, look for checked exceptions for more information.

The trouble is, having different error categories increases the administration costs and does not really help in most circumstances. Whether an error is a software bug or a "normal" error does not matter much to the end-user at the time when the error happens. In most situations, there is no obvious reason why the software should handle the error differently depending on its category. Typical scenarios where a different error-handling strategy could be desirable are:

  • User input validation errors.
    When the user forgets to fill in a mandatory form field, you may not want to display a standard error. Instead, you may want to switch the input focus to the affected field, maybe scrolling the page so that it comes into view, while painting the field in red and displaying a "required" hint next to it. But soon you'll find that some other fields need different highlighting or a different kind of hint. Or maybe 2 fields are related and must be visible at the same time. Sometimes you may want a confirmation rather than a hard error. In the end, it may be better to stop using the generic error-handling support and code the validation logic with standard means. But even if you insist on creating a separate error class for user validation failures, it probably will be limited to a few places where user input gets validated in that way.
  • Automatic retry scenarios.
    Some errors are transient and it may be worth retrying a few times before giving up. However, deciding which errors fall into that category is hard. There is often one particular foreseen condition that makes it worth retrying the operation, and that can be handled separately with an extra boolean flag. Otherwise, you will probably want to retry anything that fails in a given code area, which would be position-based and not error type-based. See section "Do Not Make Logic Decisions Based on the Type of Error" for more information on this.

For the reasons above, there has been a great deal of controversy in the Java community about using different error classes. In the C++ world, dynamic function exception specifications ended up being deprecated in the C++11 standard.

The only exception specification left that can still help optimise C++ code is the "never throws" indication, expressed as a throw() suffix with an empty exception list in the function declaration. If the compiler knows that a function never throws, it can usually generate faster or smaller code. If you are using the C++11 standard, you can also resort to the noexcept operator and the noexcept specifier.

Error Message Delivery

Generally, you should strive to deliver helpful error information timely and to the right person.

Preventing Sensitive Information Leakage

You may not always want to deliver full error information to a generic end user. Reasons are:

  1. In the context of a public service, like a Web server, you may give too much technical information away that an attacker could use to compromise the software.
  2. You may give away trade secrets or facilitate reverse engineering if your error information includes detailed call stacks or hints about the program's internal structure.
  3. If you include a call stack with a full argument and local variable dump, you may end up displaying a password in clear text that happened to be in one of the dumped variables.
  4. You may confuse the user if you display too much noise.

In such situations, you may need to install a try/catch filter that forwards all or part of the error information to a log file (maybe even encrypted), and then replace the error message with a generic failure indication or an abbreviated version for the end user.

Delivering Error Information in Applications with a User Interface

Every Event Handler Needs an Error Catch Point

You will probably need to add an error catch point to every user interface event handler (button press, menu option press, etc) in your application. They will all look the same, so it is best to write a comfortable helper routine for this purpose. For example:

void Button1Press ( void )
{
  try
  {
    ... whatever the button does ...
  }
  catch ( const std::exception & e )
  {
     DisplayStandardErrorDialogBox( e );
  }
  catch ( ... )
  {
     DisplayStandardErrorDialogBoxForUnexpectedException();
  }
}

Write a Good Standard Error Dialog Box

You must provide an easy way for the user to copy the error message to an e-mail, preferrably as text, instead of a bitmap like a screen snapshot. These are several alternatives in order of preference:

  1. Provide a button labelled "Copy error message to clipboard".
    Note that all standard message boxes under Windows support Ctrl+C or Ctrl+Ins for this purpose, but it's not mentioned anywhere.
  2. Make the error text a normal read-only text box. The moment the user clicks and sees a text cursor, he'll probably know how to select and copy the text out of it.
  3. Implement a right-click pop-up menu and provide a "Copy error text" option.
  4. Mention in the error message text that there is a log file where the error information can be located afterwards.
  5. Add to the user's guide a section about how to report bugs, and explain how to copy the error message in text form.

Here is an example from a Java application. It does not satisfy all criteria above, but it is a step in the right direction. Besides, it is very easy to implement.

ErrorDialogBoxWithCopyFeature.gif

Here is another one from Calibre. It is not ideal, since you will always want to click on the "Show details" button in order to know exactly what failed and why, but at least there is a prominent "Copy to clipboard" button.

CalibreErrorMessageDialogBox.png

Provide a Log File

From the user's point of view, errors always come at the wrong time. There is a strong tendency to dismiss error dialogs immediately without reading the text inside, in order to try and get the job done quickly with other means. Afterwards, both the user and the developer will regret not having the original error message at hand. Therefore, your best bet is to save the error information to a log file before showing an error dialog box. Besides, a log file may help you find out whether an error happens often and under what circumstances.

You may want to save extra error information to the log file, in case the developer needs more details for troubleshooting purposes. See section "Throw Utility" bellow for a hint about how to implement such a feature.

If you mention in the standard error dialog that all error messages are also stored in the log file, you stand a better chance that the user or his IT administrator will find them later on.

Remember to let the user know where the log file is. You could provide a menu option or button titled "Locate Log File" that would act like Firefox's "Open containing folder" option.

It could happen that the error message is too long to display on the screen and gets chopped, so make sure you save the complete string to the log file, and not just what is displayed on the screen.

How to Handle Window Repaint Errors

If you are writing a graphical user interface application, the window repaint code (see the WM_PAINT message under Win32) will get called very often. If something fails while repainting, which is relatively rare, you are in an awkward situation. You cannot simply open a modal dialog box in order to display the error message, because closing it will trigger a repaint on the window below which will open the same dialog box again. The user will have no way of breaking this infinite loop.

You can deal with repaint errors in the following ways:

  1. Ignore any errors during the painting phase.
  2. Paint the error message instead.

Whether you are ignoring an error or not, you should send the associated error message to the system log or to the application log file. Even if you paint the error message, the user will probably not be able to copy and paste it off the screen with normal means. Do not generate an error entry on every repaint failure, especially when writing to the system log, otherwise you will fill the log with hundreds of error entries if the user happens to resize the window with the mouse. A simple throttling mechanism would be to skip logging an error if the application has seen more than 3 window paint errors in the last 60 seconds.

Ignoring Errors When Painting

If you simply ignore any errors during the painting phase, the window contents will look wrong, and the user will hopefully notice that something is not quite right. This is not ideal, as the problem may actually go unnoticed for a long time.

All ignored errors should still trigger an assertion in debug builds, in order to increase the chances that such problems are detected during development.

When writing the paiting code, you have these choices:

  1. Ignore errors at every paiting step.
    This has the advantage that, if anything fails in the middle, all other graphic elements before and after the failed one might still be repainted correctly. For every failed call, you should still call a common routine in order to log the error and trigger an assert on debug builds.
  2. Ignore errors only at top-level.
    You can write the code with normal error handling techniques, so that the first error encountered gets thrown as an exception. The exception can then be caught and ignored at the paiting top-level.
    The main advantage is that, if you re-use the code later on for other tasks, it may be possible to display a proper error message box in other scenarios. For example, if part of the same code gets called in order to generate printer output, the user would get a standard message box upon pressing the 'print' button.
    There is a downside: if anything fails in the middle, all graphic elements afterwards will be skipped.
Painting the Error Message

You could implement a top-level try/catch point which, whenever an error comes up, clears the whole drawing area and paints a simple text error message. Remember to add an assert in order to increase the chances that the bug will be discovered during development.

There are several advantages to this strategy:

  • The end-user will notice straight away that something is wrong.
  • The error-handling code will probably be the same for all painting call-backs, so that you can use standard boilerplate code that just calls some common routines.

Chances are that painting errors occur in complex, window-specific painting logic, and painting a simple text message upon error should be simple enough (and tested enough) so that it will always succeed. However, should that fail too, you will have to resort to the steps outlined above: log, assert and ignore. Alternatively, if even simple painting fails, you can assume the worst and terminate the application abruptly.

Delivering Error Information in Embedded Applications

There is nothing more frustrating than an embedded device without a user interface that crashes and immediately restarts, forgetting the error cause straight away. There are several ways to try and keep error information between restarts:

  1. Write error information synchronously to a serial console before restarting.
    However, chances are that, when the error triggers, there is no connected PC logging the serial output to disk.
  2. Send error information over the network before restarting.
    The trouble is, the kind of nasty crashes you really want to know about would probably prevent the software from using the network services too.
  3. Allocate a portion of SRAM for error logging purposes, and make sure that this SRAM area survives a soft restart.
  4. Write error information to non-volatile memory. A small FRAM chip would fit the bill perfectly.

You need to be careful with EEPROM or Flash memory. If your embedded software enters an infinite loop of failing and restarting, you could quickly exceed the memory's write endurance limit. There are some measures you can take to prevent that from happening:

  1. Skip writing a new error message if it is the same as last time.
    This means you should only try to keep the plain error message, without any date or sequence number, in the hope that the same error condition will generate exactly the same error message the next time around.
  2. Reserve enough space for several error messages (slots), and rotate among them.
  3. Discard any new errors until the last one is at least a given number of minutes old.
    This only works if your embedded device has a real-time clock or can synchronise its clock soon after starting up.
  4. Disable writing error information until a minimum uptime has been reached.
    The write-enable delay becomes then a trade-off between flash memory protection and the chance of losing error information.
    This strategy could be the best solution, because users would probably notice if an error happens short after starting the device. Many errors are often triggered by a change in the device configuration, so the user will have a better chance of troubleshooting it. However, after the device has been left alone running for some time, it is less likely that the user will be able to figure out what the error cause was.

Remember to provide a way to access the last reset reason, or write it every time to the device log on start-up, as knowing whether the last reset was caused by the software or hardware watchdog can make a difference.

Delivering Error Information over the Network

If you write your own communications protocol, chances are that you will create different request and response messages, so that a recorded conversation would look like this:

->  GET_FIRMWARE_VERSION
<-  VERSION: "v1.02"

->  SET_TIME "8:00 am"
<-  OK

->  GET_TIME
<-  TIME: "08:05 am"

->  REFRESH_CLOCK_DISPLAY
<-  OK

In the above protocol, it looks like some requests, like SET_TIME and REFRESH_CLOCK, use the same OK response. Other messages have their own separate response messages.

You should specify your protocol so that a generic ERROR: "flexible error message here" response is always an acceptable answer to any request. This way, a global helper routine can effortlessly collect an arbitrary error message for any kind of failed request.

Remember to allow for international characters in error messages. Specify from day one whether the string encoding standard is UTF-8, UTF-32 or whatever you see fit. You should avoid the need to specify a codepage alongside the string.

Delivering Error Information between Threads

Every thread should have 2 error-handling layers at top level. The outer layer should set up the top error handler, perform some basic initialisation and finally enter the inner layer. The inner layer should set up the second-level error-handler and execute the rest of the thread code.

Should an exception make it all the way up to the thread's top level, the inner layer will catch it and try to forward the associated error information to the main thread. If that is not possible, it should store, log or display the error information somehow, and then terminate the application abruptly.

If the basic initialisation at the outer layer fails, or if the inner layer fails to process an error correctly, the outer layer should terminate the application abruptly. Only the most basic and safe ways to log error information should be attempted. The main objective is to avoid that an unhandled C++ exception carries on unwinding the stack into the Operating System or library call that created the thread. The top-level catch handler in the outer layer should rarely be executed and would be a good candidate for an assert( false ); statement.

Delivering Error Information about Failed Tasks

If you have a worker thread that processes tasks, you should add an m_errorMessage string member to the task's base class. You will probably not be able to store any std:exception object you catch, so you will have to copy the necessary information out of it by hand, and a flexible error string will almost always suffice.

The task's submitter should check for an eventual error message when the task is completed. If the error string is not empty, it should consider the task as failed and forward the error information as necessary.

Just because an error happened processing one of many tasks on one of many threads does not mean that rich error information must get lost along the way, even if a particular task was submitted a long time ago by one of many users across a wide area network. You must make sure that each user gets the corresponding error message for every task he ever submitted. If the user can get an indication that a task has been completed, it must also be possible to get a proper error message if the task failed.

Internationalisation

A full translation is often required by law when selling to foreign public institutions, and error messages will probably be affected too. Preparing error messages for translation is not expensive when writing new code, and much more costly if it's an afterthought.

Selecting the User Language

Error messages should be in a language the user is comfortable with, be it English, German, Spanish or whatever he happens to speak. Sometimes it would be helpful to get an error message in several languages at the same time, but this normally isn't worth the development effort.

Your code should not guess the user's language based on the current operating system and the like at the point where an error happens. Instead, the language should be determined upfront. If you are writing a library, you could pass some language ID to the library's init() routine. Keep in mind that your library may end up in some multi-user server, so it's best to negotiate the language early as part of the connection handshake.

Providing Translations

You don't want human translators to be looking for scattered character strings in C++ source code they do not understand, so it is best to move all user messages to one or more separated files. There are several options:

  1. Under Windows, you can use resource files.
  2. In POSIX systems, there are message catalogs, look for catgets.
  3. Use some library like gettetxt.
  4. Write your own message table system.

There are web-based tools that help translate user messages by providing hints like similar translations already available in the database.

Note that gettext has some drawbacks:

  1. The English language is faster, which is not only unfair, but may hide performance problems from English-speaking developers.
  2. It's slow, as the English strings are hashed and then used as keys to access the translated equivalent.

Writing your own message table system may actually be the simplest and fastest option. Just remember that, if a translation is not available, you should retrieve the English version as a fall-back.

The solution proposed below has the following advantages:

  1. Messages are retrieved by index from an array, so it is fast.
  2. The message tables are easy to maintain.
  3. It is all standard source code, so you can use standard diff tools.
  4. Unlike resource-based alternatives, this solution is portable, you just recompile the same code on the next target platform.

Each module can define its own string table as follows:

enum module_1_messages
{
  msg_invalid_code = 0, // Zero is often a default memory value, so making it invalid helps catch bugs.
  msg_my_error,
  msg_my_warning,
  msg_my_info
};

const MSG_TABLE msg_table_english
{
  MSG_DEF( msg_invalid_code, NULL ),
  MSG_DEF( msg_my_error,    "My error" ),
  MSG_DEF( msg_my_warning,  "My warning" ),
  MSG_DEF( msg_my_info,     "My info" )
};

const MSG_TABLE msg_table_german
{
  MSG_DEF( msg_invalid_code, NULL ),
  MSG_DEF( msg_my_error,    "Mein Fehler" ),
  MSG_DEF( msg_my_warning,  "Meine Warnung" ),
  MSG_DEF( msg_my_info,     "Meine Information" )
};

Macro MSG_DEF can ignore the supplied enum symbol, but still reference it for compilation purposes; you can achieve this trick with C++'s comma operator. That way, compilation will fail if the enum symbol is not defined. Alternatively, you may leave the enum symbols out in release builds. You can also add a static_assert to check that the number of enum symbols match the length of the message table.

You may want to turn MSG_TABLE into a macro that takes the language ID as an argument and generates the table name automatically.

The advantage of this table-driven approach is that an automated tool can scan the message tables and make sure you haven't skipped any enumerated symbols. Error messages are not tested often, so such a tool could also check that all the translated format strings have the same number of arguments, and that the argument types are also the same. You may not need an external tool, you could implement it in the same source code and let it run on debug builds only.

A string access consists of a fast, index-based array access:

msg_table[ msg_my_warning ].str

For performance reasons, you may want to assign msg_table to msg_table_english or msg_table_german just once at start-up time.

You could also wrap the string access with an inline routine, in order to make sure that a valid enum symbol is used as an array index. The code would then look like this:

   inline const char * get_msg ( const module_1_messages msg_code )
   {
     assert( msg_table != NULL );  // Global pointer already initialised?
     assert( msg_code != msg_invalid_code );
     const char * const translated_msg = msg_table[ msg_code ].str;

     if ( translated_msg == NULL )
     {
        assert( false );  // Missing translation.
        return msg_table_english[ msg_code ].str;  // Return the English version as a fall-back.
     }
     else
     { 
       return translated_msg;
     }
   }
   ...
   printf_e( get_msg( msg_my_warning ) );

The main drawback of the table-based approach is the inconvenience of adding a new message. The steps are:

  1. Add an enum symbol.
  2. Add a corresponding MSG_DEF entry.
  3. Instead of writing a format string directly in the printf() call, remember to type get_msg( msg_code ) every time.

You need to resist the temptation of placing all messages into a single message table. That may make translation easier, but it will also make it harder to share and re-purpose source code. Each logical module should have its own message table.

Try to keep the code together with its message table. When writing embedded software, there is often the proposal of sending just the message ID and its parameter in a compact message to the PC application, which can then perform the message look-up and the format string processing. Separating the message IDs from the message strings will cause versioning issues and will make the embedded code harder to share, so you should evaluate first whether the message tables are actually so big that they will not fit in the target device even in a compressed form. Bear in mind that many devices end up having a user interface that runs independently from the PC counterpart application. Even if there is no physical display, there may be a serial port or a future network interface that may need to send localised user messages too.

Using Positional Arguments

Messages like "Error %u found in %s when %s at time %s" are difficult to translate, as some languages often impose a certain syntactical order when writing a sentence. For example, in German, the time indication usually comes first, then the manner, and finally the place. The %s placeholder does not allow for the necessary rearrangement of these components based on the user's language.

This is where positional arguments come into play. In the Microsoft world (look at FormatMessage's documentation), a format string looks like this:

Error %1 found in %2 when %3 at time %4

A human translator may then reorder the arguments in the translated error message without modifying the C++ source code.

There are minor inconveniences with this syntax: because a placeholder actually looks like %n!format string! , in order to generate a string like "Error opening file xxx!", the ! character must be escaped. Furthermore, if the translator needs to place a digit right after a %N marker (which is unlikely), he needs to know about the full %n!format string! syntax.

There is a POSIX extension for positional arguments in printf() too, but you need to check if your platform implements it. It looks like this:

Error %1$u found in %2$s when %3$s at time %4$s

You may want to write your own hardened printf() substitute, as a single mistake in a translated format string, like writing %s instead of %d, can open a security hole or bring the application down if you are using the standard C runtime library version.

Helper Routines

Writing robust code in the face of errors is a chore that can be alleviated with the right set of helper routines. In my experience, you don't need too many of them, most of them are trivial wrapper, and, once written, they tend to remain stable. What matters most is the convention and associated development policy to use them whenever appropriate.

It is unfortunate that the C++ language and standard libraries offer little help when dealing with errors or writing resource clean-up code, which often leads to wheel reinvention.

Throw Utility

Instead of throwing an error object directly whenever an error is detected, you may want to use a LogAndThrow( errObj ) routine instead. This way, you could easily turn some compilation option on in order to dump the call stack at the point of error to a log file, in case a developer needs such detailed information to track down a nasty bug.

In some platforms, it's hard to collect the call stack at runtime, and retrieving a crash dump file from embedded targets may not be easy either. In order to help during development, you could provide a THROW_ERROR macro that automatically collects (maybe for debug builds only) the source code position where the error has been raised. Use compiler-provided macros like __FILE__, __LINE__ and __PRETTY_FUNCTION__ for this purpose.

If you are writing a library and you provide such an optional feature, remember to mention that it may not be desirable to leak so much technical information to the end user. See section "Preventing Sensitive Information Leakage" for more information.

Even if you don't plan to collect detailed error information, a throw utility routine may prove useful in the future. For example, at the flick of a switch you could start logging basic error information for statistical purposes, like finding out how many errors a user gets to see in a day.

You will probably want to write a SimulateError() routine in order to quickly inject an error for testing purposes. In fact, developers should be expected to routinely test their new code for robustness with the help of this one routine. If you make your SimulateError() routine only available during debug builds, you will not forget to remove all such test errors from the source code when compiling the release build.

Standard Error Strings

The only ones you'll ever need at a global level are "Out of memory" and "Unexpected C++ exception".

Resist the temptation of adding a generic message like "Invalid argument". That is the indication all other developers are waiting for. It means that they do not need to bother generating precise, helpful error messages any more.

Global Panic Routines

You should provide a routine like this:

void Panic ( const char * format_str, ... );

See section "Abrupt Termination" above for more information about why these routines are desirable.

String-handling helpers to build error messages

I recommend using the standard std::string class. It is important to provide the ability to build custom error messages with a familiar printf-style format string. Otherwise, most developers will automatically resort to writing simple, generic, fixed error messages. Routines like this would be helpful:

  • std::string format_msg ( const char * format_str, ... );
  • void format_buffer ( std::string * result, const char * format_str, ... );

The format_buffer() variant allows you to re-use the same std::string instance for performance.

You may want to provide other versions that take a translated message ID instead of an inline format string.

Utility functions that collect OS error messages

You need to invest the time upfront to make it easy to collect proper error information from system calls, because most developers will fail to do it themselves later on whenever the oportunity arises. Sometimes it's not so straightforward as it may seem. For example, collecting the error message from an errno code is not easy, as strerror() has threading issues, its thread-safe variants are often not portable, and, even within the GNU/Linux platform, strerror()'s prototype has changed between glibc versions. No wonder that, if a helper routine has not been implemented beforehand, it's hard to find the time for such a complicated distraction when you are focused on some other task.

Common system error sources are:

  1. errno
  2. dlerror from the Unix Shared Object loader
  3. GetLastError in Windows
  4. IErrorInfo / _com_error from Windows OLE/ActiveX/COM APIs

The following routine makes it easy to build a custom error message after an errno-style error has been detected:

std::string format_errno_msg ( int errno_val,
                               const char * prefix_msg_format_str,
                               ... );

Such a routine should do the following:

  1. Call strerror() or strerror_r() in order to get the error message from the Operating System that corresponds to the given errno code.
  2. Build a flexible prefix text by calling vasprintf() or similar.
  3. Join the prefix together with the OS error message, and return it.

Usage example:

if ( 0 != sigaction( signal_number, &act, NULL ) )
{
  const int errno_code = errno;  // errno is very volatile, so capture its value straight after the error.
  throw std::runtime_error( format_errno_msg( errno_code, "Error setting signal handler %d: ", get_signal_number() ) );
}

Manually storing errno in a temporary variable is a pain, but you should get used to saving the errno value directly after the error occurred. In the following alternative implementations, get_signal_number() could inadvertently destroy errno. That may not happen today, but it could the future, if get_signal_number()'s implementation changes, or if another developer adds more function calls to the flexible parameter list. This is another drawback of the errno mechanism.

// Alternative 1: If format_errno_msg() automatically retrieves errno itself:

if ( 0 != sigaction( signal_number, &act, NULL ) )
{
  // WRONG: Before format_errno_msg() runs, get_signal_number() may overwrite errno.

  throw std::runtime_error( format_errno_msg( "Error setting signal handler %d: ", get_signal_number() ) );
}


// Alternative 2: If we pass errno directly to format_errno_msg():

if ( 0 != sigaction( signal_number, &act, NULL ) )
{
  // WRONG: The argument evaluation order when calling a function is not defined. Therefore, get_signal_number() may get called
  //        and overwrite errno before errno's value is retrieved and pushed to the stack.

  throw std::runtime_error( format_errno_msg( errno, "Error setting signal handler %d: ", get_signal_number() ) );
}

Wrappers for OS syscalls

You will need wrappers for open(), fopen() and so on. Try to stick with an appropiate suffix, I suggest open_e(), fopen_e(), etc. They tend to be very thin and very similar, so write them on demand. For example:

void setsockopt_e ( int socket_fd,
                    int level,
                    int option_name,
                    const void * option_value,
                    socklen_t option_len )
{
  if ( 0 != setsockopt( socket_fd, level, option_name, option_value, option_len ) )
  {
      const int errno_code = errno;  // errno is very volatile, so capture its value straight after the error.
      throw std::runtime_error( format_errno_msg( errno_code, "Error setting socket option: " ) );
  }
}

It's best to write 2 wrapper functions for the open() syscall named open_e() and open_if_exists_e(). The first version should be called when the caller expects the file to be there, that is, when the file is required. The other version should return a boolean indication if the file is not there, and throw an exception for any other kind of error like insufficient access rights for an existing file. The latter should be used when a file is optional and its absence does not constitute a critical error (think of the bash shell looking for all its optional configuration files on start-up). See section "Do Not Make Logic Decisions Based on the Type of Error" for more information.

You should provide (and always use) a wrapper for printf() as follows. See section "Checking Errors from printf()" above for more information:

void printf_e ( const char * format_str, ... );

In the case of close(), you will probably want a close_p() instead, in order to issue a panic if it fails. See section "Abrupt Termination" above for information about why this may desirable for routines that release system resources.

A malloc() wrapper could also prove useful, see section "Handling an Out-of-Memory Situation" for details.

Wrapper classes for resources

Wrapper classes implement automatic resource clean-up, as destructors are guaranteed to be executed in the event of C++ exceptions or early subroutine return.

Examples of resources to protect would be a file descriptor or the FILE * structure return from fopen().

There are 2 approaches when writing these classes:

  1. Smart pointer-like classes, where the class instance can be used instead of the resource.
  2. Automatic clean-up classes that only handle the clean-up side. You still need the original resource around.

Smart Pointer-Like Classes

Here is a usage example for a hypothetical auto_fclose class:

FILE_wrapper f;  // Note there is no pointer in this declaration.

f = fopen_e();

do_whatever( f );  // This routine expects a "FILE *", so class FILE_wrapper must provide
                   // a type-cast operator for that pointer type.

f.close();  // Optional, helps prevent scoping mistakes. This also makes sure that
            // the file is closed at this point, in case you want to re-open it straight away.

Such smart pointer-like classes are convenient, but have a number of drawbacks:

  1. They are much harder to write.
    You have to provide constructors, assignment and type-cast operators, and so on, in order to cater for all possible uses of the FILE * object they are substituting.
  2. They introduce subtle C++ syntax pitfalls.
    For example, when using the 'address of' operator (&) for FILE_wrapper, you would get the address of the FILE_wrapper object itself, not the address of the "FILE *" pointer it substitutes. You can work-around this problem by defining "operator &" too, but then you have to provide a way to get the address of the FILE_wrapper object itself, just in case. Furthermore, what happens if a routine takes as a parameter a reference to a "FILE *" pointer? What if a routine used to do "if ( file1 == file2 )" in the past, and now those variables are no longer normal pointers?

Automatic Clean-Up Classes

Here is a usage example for a hypothetical auto_fclose class:

FILE * const f = fopen_e();

auto_fclose auto_f( f );
  or
auto_fclose auto_f;
auto_f.attach( f );

do_whatever( f );

auto_f.close();  // Optional, helps prevent scoping mistakes. This also makes sure that
                 // the file is closed at this point, in case you want to re-open it straight away.

This kind of wrapper class is very easy to write. The code that uses them is rather obvious too, as there are no surprises like hidden typecast operator calls and the like.

The main disavantage is that f and auto_f are held separately, so someone could easily access f after auto_f has gone out of scope. In order to prevent such mistakes, you could limit the scope of f as much as possible, like this:

auto_fclose auto_f;

{ // Scope for f.
  FILE * const f = fopen_e();
  auto_f.attach( f );
}

do_whatever( auto_f.get_FILE() );

Beware that attach() must never fail, that is, it should be declared with throw(), indicating that it never throws an exception. Otherwise, you'll end up with a pointer leak.

You may be able to dispense with f altogether:

auto_fclose auto_f( fopen_e() );

do_whatever( auto_f.get_FILE() );

Custom Smart Pointers

An example would be a class like auto_ptr_free for memory pointers obtained through malloc().

Error Handling in Some Specific Platforms

C++ Standard Template Library (STL)

The STL designers seem to have put a lot of effort into avoiding making C++ exceptions a hard requirement in practice. The most common STL implementations still work even if C++ exception support is turned off when compiling.

As long as you use the STL containers carefully and do not trigger any errors, everthing will work fine. You don't miss exceptions very much because the STL has always favoured speed over robustness and does little error checking except for some debug-only assertions. Using an invalid iterator, for example, leads quickly to fatal crashes.

Note that the official documentation still states that the library reports errors by throwing exceptions. In fact, an out-of-memory condition when inserting elements into an STL container can only be reported by throwing a C++ exception, as the library has no other way to return such an error indication. Without C++ exception support, a release build will fail or even crash without warning if that happens. However, out-of-memory conditions are often ignored altogether during development, see section "Handling an Out-of-Memory Situation" above, so this particular shortcoming does not show up in practice either.

If you look at the STL's source code, you'll see a few other error cases that are reported via exceptions. For example, class std::bitset can throw std::overflow_error if there are too many bits to fit, and method at() in class std::string will throw an std::out_of_range exception for an invalid character index.

The C++ Standard Library

Apart from the C++ STL (see section above), the rest of the standard C++ library does not handle error information well.

Take a look at class ofstream, for instance. After calling its constructor, you need to manually check the error status flags (failbit and badbit), which is very easy to forget. If an error happens, any detailed error information is lost, as those are just boolean flags. That alone is a good enough reason to avoid using the stream library altogether.

There are a few other places where exceptions may come up. For example, operator new can throw an std::bad_alloc exception, a failed dynamic_cast to a reference type throws an std::bad_cast exception, and a typeid on a NULL pointer throws a bad_typeid exception.

Use std::runtime_error in your own programs.

Internationalisation Considerations under Windows

The error message returned by std::exception::what() is of const char * type. Under modern Linux versions, you can safely assume that the string will be coded in UTF-8, so it should be safe for any Unicode character. However, I could not find any mention of the character encoding in the documentation. I assume Windows will use the current 8-bit codepage, so you have to be careful when porting Unix code or when converting the text to UTF-16, which is what the Win32 API uses.

Windows GDI+

Microsoft's C++ wrapper classes for the system's GDI+ library do not use C++ exceptions. Most methods return an error code, see the Gdiplus::Status enumeration for more information. If you have trouble navigating the sources, search for InvalidParameter in file GdiplusTypes.h. Therefore, you have to either constantly check the return value after each GDI+ call or write a trivial wrapper function for the most common ones.

GDI+ constructors can also fail and have no return type to check. You are supposed to call a GetLastStatus() method after each constructor call in case something went wrong. Look at the documentation for Font::GetLastStatus() for an example.

Qt

Qt's routines return error codes wherever necessary, because the Qt project actually predates the C++ exception specification. Recent Qt versions are exception safe, so that you can use exceptions in your Qt applications.

Microsoft Foundation Classes (MFC)

Microsoft's MFC uses its own CException base class, which is not derived from std::exception. If you decide not to use std::runtime_error, you will have to write your own CException-derived class, because none of the MFC exception classes can store a variable-length error message.

The convention is to throw a pointer to an CException object, which can easily lead to a memory leak, see section "Never Throw Pointers to Dynamically-Allocated Objects" for more information.

Java

Always use java.lang.RuntimeException. Do not bother with checked exceptions, see "Classifying Errors into Different Error Categories" for more information.

Beware that Java does not support deterministic destructors, which is risky when they are supposed to release system resources other than plain memory, like for example a file handle. For reliable clean-up behaviour, use try/finally blocks, look at the Closeable Interface for more information. You can also resort to the newer "try-with-resources" statement.

Microsoft's .NET C#

C# does not support deterministic destructors, which is risky when they are supposed to release system resources other than plain memory, like for example a file handle. For reliable clean-up behaviour, use try/finally blocks. You can also resort to the newer using statement, look at the IDisposable Interface for more information.

When you close a listening network socket gracefully, the callback routine gets called, and when this routine calls EndAccept(), an exception is raised. That is, a graceful close is reported as an error, which is bad practice.

Perl 6

Perl 6 introduces a 'user-facing' trait for routines, so that an associated message string appears in the call-stack dump for each tagged routine the exception propagates through. All other non-tagged routines do not come up in the call-stack dump.

I don't think this compromise really helps. A developer will always want the see the full call-stack, and an end-user will probably not benefit much from seeing an automatically-generated, ugly, partial call stack. The trait syntax does not appear to conveniently lend itself to internationalisation either. And the developer ends up having to maintain a second name or short description for many routines, which is tedious.

Python

Python has the with keyword and __enter__ und __exit__ methods, in order to help factor out standard uses of try/finally statements.

Shell Scripting

Shell scripting has so many gotchas that it is virtually impossible to get the error handling right. But nevertheless, here are a few things you should try.

Set the errexit flag

Place this command at the beginning of your shell script:

 set -o errexit   # Equivalent to "set -e".

That works for most tools, but some need special care. For example, diff returns the following exit codes:

0 if inputs are the same
1 if different
2 if trouble

That means that exit code 1 may not be considered an error in your shell script. You need to turn off the errexit flag temporarily for such tools. Alternatively, you can write a couple of helper functions that help save and restore that flag. Fortunately, such exit code deviations are rare.

Avoid the AND and OR Operators in Shell Sentences

Avoid the AND ('&&') and OR ('||') operators in shell sentences, especially in combination with shell functions, because it temporarily turns errexit error-detection off, even in functions called within those expressions. For example, consider this script:

set -o errexit

f() {
  this_command_does_not_exist
  echo "The error above has been ignored."
}

f && echo "Hello!"

The script's output is:

test.sh: line 4: this_command_does_not_exist: command not found
The error above has been ignored.
Hello!

Set the nounset flag

set -o nounset   # Equivalent to "set -u".

You should get used to declaring all variables before using them. Otherwise, it's too easy to make mistakes which go unnoticed for a long time.

Set the pipe flag

set -o pipefail

Without this flag, which is only available in the bash shell, failures in some pipe components are ignored.

Expat XML Parser

The Expat XML Parser is a typical example of a library with a sub-optimal error-handling strategy.

Expat is a stream-based (push) parser: you register your call-back routines and then feed the parser with data chunks as they arrive (from a file, from a network socket, etc). Your call-back routines get called during the XML parsing phase so that you can perform any operations you wish straight away without having to wait for the whole XML file to come through. You could for example process a big Base64-encoded data block in small chunks as they trickle in. The alternative would be to parse the entire XML file beforehand and store all data in a temporary memory area before processing it, which may lead to increased memory usage and latency.

If an XML parsing error occurs, you get an error code back that you can convert to an error message. You also get the error's position (line and column numbers).

The first thing I noticed is that the error string for each code is fixed. You could never get an error message like "Tag 'mytag' is invalid", because "mytag" depends on the XML input.

Then I realised that this kind of errors are not meant to represent any generic error, but only XML parsing errors. The only non-XML error code is XML_ERROR_NO_MEMORY. These errors are not actually the kind of errors that this document is all about: they are part of the normal parser output. Such errors are expected and that's what the parser is there for. If you think about them as "syntax violations", instead of "errors", the picture gets a little clearer. You can think of thes errors as compilation errors: another XML parser might not stop at the first error, but carry on and generate a list of them, just like when compiling C++ code. That's the reason why you need the line and column position, because your editor will recognised them as compilation errors and will jump to the right source code position when you click on them.

In fact, just one line position may be not be enough, you may want two XML source positions in some cases. Consider for example an error like: "Start tag name 'mytag1' does not match end tag name 'mytag2'". Of course, the parser could generate 2 separate error messages, like most C++ compilers nowadays do, but it would be neat if you could click on a single error message and then the editor's window split into 2 halves and showed both the XML tags at the start and end positions at the same time.

The distinction between error types becomes crystal clear when you consider what happens if your call-back routine fails. Think about the scenario where a call-back has received the XML data correctly but has failed to save it to disk. That is a "real" error, in the sense that it is unexpected and fatal. It does not make sense to report an XML line and column position for that type of error. If you are using an alternative XML parser, it wouldn't make sense to carry on parsing the incoming XML in the search of further syntax errors. On the face of such a call-back error, it's probably desirable to stop the whole XML processing straight away.

With Expact, your call-back routines have to handle their generic errors themselves. You cannot let C++ exceptions cross the call-back boundary, because Expat is written in plain C and cannot handle C++ exceptions. Therefore, each call-back must implement its own try/catch block. Expat's optional parser context (the call-back's void * userData argument) becomes mandatory, because, upon catching a C++ exception, the call-back routine needs to temporary store the error information in that context before calling XML_StopParser().

If you think about it, the fact that XML_StopParser() exists is already a sign of weakness. First of all, Expat is supposed to be optimised for speed, but it must store and check some boolean parameter pretty often in order to stop parsing whenever a call-back requests it. C++ exceptions would have been faster in this situation. Secondly, XML_StopParser() does not stop straight away, other call-backs may run, but which ones get called in what situations is not properly documented. Therefore, you have to be extra careful not to do any more important processing in one call-back after a stop has been requested in another call-back due to a fatal error.

After XML processing finally stops, the top-level caller (the code that calls XML_Parse()) must check whether a call-back has stored error information in the parsing context. In that case, any XML syntax error reported by Expat should probably be discarded.