High Performance NT4 Optimization and Tuning:Demystifying The "Blue Screen"

Table of Contents

If you see a list of drivers in the stack dump that includes NTOSKRNEL.EXE and FASTFAT.SYS, you have a good indication that a file system error occurred that Windows NT could not repair. I’ve seen this error occur several times during a boot sequence after crashes. In this case, the root problem is that the registry files are damaged. NT tries to load these registry files before the auto-check of the hard drive, and, once loaded, these files are inaccessible (meaning they can’t be repaired by the auto-check process). Therefore, Windows NT displays another core dump. I solved this particular problem by running SCANDISK from a bootable MS-DOS diskette. Once the problem was solved, Windows NT booted up fine. This is one reason why I prefer to use a FAT partition rather than an NTFS partition for my boot partition. I can access a FAT partition from MS-DOS and can sometimes recover from a bad situation.

It really helps to have a listing of the kernel mode error messages. After all, if you can understand the message, you might be able to correct the problem. For that reason, the next section describes a variety of error messages you might encounter. While I have done my best to translate the error messages into plain English and point out specific instances of what might cause them, I have to warn you that some of them are only for the technically advanced user (such as someone who makes a living writing device drivers). Having said that, let’s move on and begin your tour of duty into the exciting world of kernel mode error messages.

Kernel Mode Error Messages

Windows NT includes so many different error messages that the only people to have a complete list are probably the people that developed Windows NT. I seriously doubt if even the device driver kit (DDK) has all the kernel mode error messages defined. I have done a bit of research on your behalf and the error messages that follow are those that have been defined and made available by Microsoft. This doesn’t mean that all the error messages are here, but rather, it is all the error messages I could find or have seen personally over the years.

• APC_INDEX_MISMATCH (0x00000001)—Specifies that a kernel internal error occurred. This error is often caused by a file system driver that improperly enters or leaves a critical section.

• IRQL_NOT_LESS_OR_EQUAL (0x0000000A)—Specifies that an attempt was made to access pageable memory at a process internal request level (IRQL) that is too high. This error is usually caused by device drivers using improper addresses.
If the kernel debugger is available, you can use it to get a stack back trace and determine the offending driver. The first parameter specifies the memory address that was referenced. The second parameter specifies the IRQL. The third parameter specifies a read operation (if zero) or a write operation (if one) The fourth parameter specifies the application address that referenced the memory.

TIP: The IRQL_NOT_LESS_OR_EQUAL error can also be caused by a resource conflict for a hardware device. So, it pays to check your installed hardware for any interrupt, I/O, DMA, or memory address (ROM BIOS/RAM buffer) conflicts.

• KMODE_EXCEPTION_NOT_HANDLED (0x0000001E)—Acts as the default exception handler and is invoked if no other driver handles the exception. It can be a fairly common error message on a system with hardware that is seriously incompatible with Windows NT.
Usually the exception address pinpoints the driver or function that caused the problem. Always make a note of this address as well as the link date of the driver or image that contains this address. The first parameter specifies the exception code that was not handled. If you have the Windows NT DDK, you can examine NTSTATUS.H for a possible cause of the error. The second parameter specifies the address at which the exception occurred. The third parameter specifies parameter 0 of the exception. The forth parameter specifies parameter 1 of the exception.

If you look at the driver list on the bottom right of the blue screen, you might find a file system (NTFS.SYS, FASTFAT.SYS, and so forth) or network driver (such as RDR.SYS). If you do, this often indicates a problem with the file system on the computer. Sometimes, Windows NT’s auto-check cannot detect or repair a file error. If you are able to boot the system, then, from a console window, issue the CHKDSK Drive: /F command where Drive is the drive letter of one of your volumes. Repeat this command for each volume on your system. This will then run a more thorough scan of your hard disk and attempt to repair any errors it encounters. If you have a SCSI disk drive, you might also want to use the /R parameter, which will perform a low-level scan and attempt to replace any bad sectors found on the hard disk drive.
Another commonly found error code is 0x00000080000003. This error code means that a hard-coded breakpoint or assertion was hit, but the system was booted with the /NODEBUG switch. This problem should not occur often. If it does, make sure a debugger is connected, and the system is booted with the /DEBUG switch. On non-Intel systems, if the address of the exception is 0x000000BFC0304, the bug code is the result of a cache-parity error on the CPU. If this problem occurs frequently, contact the hardware manufacturer for additional technical support.

• KERNEL_APC_PENDING_DURING_EXIT (0x00000020)—Specifies that a file system driver or network redirector was possibly written incorrectly. Third-party redirectors are often the cause of this error because they do not always receive the heavy-duty testing of the base redirectors included with Windows NT. The first parameter specifies the address of the APC found pending during exit. The second parameter specifies the thread’s APC disable count. The third parameter specifies the current IRQL.

• FAT_FILE_SYSTEM (0x00000023)—Specifies that an error occurred on a FAT volume.

• NTFS_FILE_SYSTEM (0x00000024)—Specifies that an error occurred on an NTFS volume.

• CDFS_FILE_SYSTEM (0x00000025)—Specifies that an error occurred while accessing a CD-ROM volume.

File system errors are rare occurrences. Should you find yourself in this position, check the Microsoft Knowledge Base. For example, if you receive an error on an NTFS volume, you can specify STOP FileSyetemHexCode ErrorCode within your query, where FileSystemHexCode would be 0x00000024 and the ErrorCode is the first parameter of the stop message. This might help determine if your error is a known error and if it has been fixed in a service pack or hot fix.

• INCONSISTENT_IRP (0x0000002A)—Specifies that an I/O Request Packet (IRP) was encountered that was in an inconsistent state. This means that some field(s) of the IRP were inconsistent with the remaining state of the IRP. The first parameter specifies the address of the IRP that was found to be inconsistent.

• PANIC_STACK_SWITCH (0x0000002B)—Specifies that the kernel mode stack was overrun. This error normally occurs when a kernel mode driver uses too much stack space. It can also occur as a result of serious data corruption within the kernel.

• DATA_BUS_ERROR (0x0000002E)—Generally, specifies that a parity error occurred in the system memory. This error can also be caused by a device driver accessing a 0x0000008XXXXXXX address that does not exist. The first parameter is the virtual address that caused the fault. The second parameter is the physical address that caused the fault. The third parameter is the processor status register (PSR). The fourth parameter is the faulting instruction register (FIR).

• PHASE0_INITIALIZATION_FAILED (0x00000031)—Specifies that the system initialization failed early in the boot process. The kernel debugger is required to determine the cause of the error because this code tells you almost nothing about the root cause of the problem.

• PHASE1_INITIALIZATION_FAILED (0x00000032)—Specifies that the system initialization failed during the boot process. The first parameter specifies the Windows NT status code that describes why the system thinks initialization failed. If you have the DDK, the second parameter indicates the location within INIT.C where phase one initialization failure occurred.

• NO_MORE_IRP_STACK_LOCATIONS (0x00000035)—Specifies that a higher-level driver has attempted to call a lower-level driver through the IoCallDriver() interface, but there are no more stack locations in the packet. So, the lower-level driver could not access the parameters passed to it by the higher-level driver. If you encounter this error, it is almost certainly caused by the manufacturer of the device driver. The first parameter contains the address of the IRP.

• DEVICE_REFERENCE_COUNT_NOT_ZERO (0x00000036)—Specifies that a device driver has attempted to delete one of its device objects from the system, but the reference count for that object was nonzero. This means there were still outstanding references to the device. If you encounter this type of problem, it is usually caused by a bug in the calling device driver, and you should contact the manufacturer for additional support. The first parameter specifies the address of the device object.

• MULTIPROCESSOR_CONFIGURATION_NOT_SUPPORTED (0x0000003E)—Specifies that the system has multiple processors, but they are asymmetric in relation to one another rather than symmetric. To be symmetric, all processors must be of the same type and level. Trying to mix two Pentium processors with different steppings, for example, could cause this error. Additionally, on x86 systems, floating-point capabilities must be present on all or no processors.

• MUST_SUCCEED_POOL_EMPTY (0x00000041)—Specifies that a request for a nonpaged pool memory resource could not be fulfilled. Generally speaking, if you encounter this error in a retail product, you might need to increase the amount of physical RAM on your system. The documentation included with the product should indicate its memory requirements. If your physical memory falls within the amount required by the manufacturer, either there is a bug in the product or you have other applications running (such as SQL Server) that are using too much of the nonpaged memory pool. If the latter is the case, you must add additional RAM or reconfigure the service to use less memory. If kernel debugger is available, you can use the VM command to list the size of the various memory pools. The first parameter specifies the size of the request that could not be satisfied. The second parameter specifies the number of pages used in a nonpaged pool. The third parameter specifies the number of pages that could not be provided by the nonpaged pool. The fourth parameter specifies the number of pages available.

• MULTIPLE_IRP_COMPLETE_REQUESTS (0x00000044)—Specifies that a device driver has requested that an IRP be completed, but the packet has already been completed. Determining the device driver that caused the problem is difficult because the tracks of the first driver have been covered by the second. The driver stack for the current request, however, can be found by examining the DeviceObject fields in each of the stack locations. The first parameter specifies the address of the IRP.

• CANCEL_STATE_IN_COMPLETED_IRP (0x00000048)—Specifies that an I/O Request Packet (IRP) that is to be canceled has a cancel routine specified in it. This means that the packet is in a state in which the packet can be canceled. However, the packet no longer belongs to a driver, as it has entered I/O completion. This usually indicates either a driver bug or more than one driver accessing the same packet. The first parameter specifies the pointer to the IRP.

• PAGE_FAULT_WITH_INTERRUPTS_OFF (0x00000049)—Specifies that a request to load a page from the virtual memory pool failed because interrupts have been disabled. It should be treated similar to an IRQL_NOT_LESS_OR_EQUAL (0x0000000A) error code.

• FATAL_UNHANDLED_HARD_ERROR (0x0000004C)—Specifies that a hard error occurred during the system boot. The following are some common examples:

• 0x218—Specifies that a necessary registry hive file could not be loaded. The hive file could be corrupted or missing. The Emergency Repair disk might be required to recover from this situation. The device driver might have corrupted the registry data while loading it into memory, or the memory where the registry file was loaded is not actually physically present. This last condition could be caused by an incorrect configuration on a computer with an EISA bus. In this case, examine the EISA configuration to specify the actual amount of physical memory installed on the system.

• 0x21A—Specifies that either Winlogon or CSRSS (Windows) terminated unexpectedly. The exit code provides more information. Usually the exit code is C0000005, meaning that an unhandled exception crashed either of these processes.

• 0x221—Specifies that a device driver is corrupted, or a system DLL was determined to be corrupted. Windows NT does its best to check the integrity of drivers and important system DLLs. If a corrupted file is detected during the load process, a message box might be displayed with the offending DLL listed. If the error occurs during the boot process, a blue screen might be displayed with the name of the corrupted file.

• NO_PAGES_AVAILABLE (x0000004D)—Specifies that no free pages are available to continue operations. The first parameter is the number of dirty pages. The second is the number of physical pages on the computer. The third is the extended commit value in pages. The fourth specifies the total commit value in pages.

• PFN_LIST_CORRUPT (0x0000004E)—Specifies that the error is caused by corrupting I/O driver structures. If the first parameter is one, the second parameter specifies the ListHead value that was corrupted, and the third parameter specifies the number of pages available. The fourth parameter should be zero. If the first parameter is two, the second parameter is the entry list being removed, while the third parameter is the highest physical page number, and the fourth parameter is the reference count of the entry being removed.

• REGISTRY_ERROR (0x00000051)—Specifies that something has gone wrong with the registry. The first and second parameters specify where the bug code occurred. The third might be a pointer to the hive, and the fourth might be the return code from the HvCheckHive system call if the hive is corrupted.

The REGISTRY_ERROR error can also indicate that the registry received an I/O error while trying to read one of its files. This means the error could be caused by hardware problems or file system corruption. The error can also occur because of a failure in a refresh operation used only by the security system (if resource limits are encountered during the refresh operation). If you see this error code, note whether the machine is a PDC or BDC. Determine how many accounts are in its security account manager (SAM) database, if it can be a replication target, and if the volume where the hive files reside is nearly full. This can be useful in determining the root cause. If the computer is a replication target, for example, and disk space is low on the SystemRoot partition, the lack of disk storage could be the cause of the problem.

• FTDISK_INTERNAL_ERROR (0x00000058)—Specifies that your system was booted from a mirrored set when the slave (shadow) drive is more up-to-date than the master (original) partition. This error might occur, for example, when you boot from the slave drive while the master is offline. You must boot from the slave (shadow) drive to correct the problem.

• CONFIG_INITIALIZATION_FAILED (0x00000067)—Specifies that the registry could not allocate the memory pool needed to contain the registry files. This error should never occur in reality because it is early enough in system initialization that there should always be plenty of space in the paged pool. Parameter one should be five, and parameter two indicates the location in NTOS\CONFIG\CMSYSINI that failed.

• IO1_INITIALIZATION_FAILED (0x00000069)—Specifies that initialization of the I/O system failed for some reason. There is usually no other information available. Generally, this error occurs because Setup made some incorrect decisions about the installation of the system or the user has reconfigured the system.

• PROCESS1_INITIALIZATION_FAILED (0x0000006B)—Specifies that a process failed to initialize early in the boot process. Parameter one specifies the status code that suggests why the Windows NT initialization failed, while parameter two specifies the location in NTOS\PS\PSINIT.C where the failure was detected.

The PROCESS1_INITIALIZATION_FAILED error can be caused by incompatibilities in the I/O subsystem. It is often seen on computers that have an older BIOS that does not support drives with greater than 1,024 cylinders. It can also be related to generic drive geometry problems.

• CONFIG_LIST_FAILED (0x00000073)—Specifies that one of the core system hives is corrupted or unreadable. The hive can be either SOFTWARE, SECURITY, or SAM. Parameter one is usually set to one, parameter two is usually set to two, parameter three is the index of the hive in the list, and parameter four is a pointer to a UNICODE_STRING containing the file name of the hive.

• BAD_SYSTEM_CONFIG_INFO (0x00000074)—Specifies that the error might indicate that the SYSTEM hive loaded by the OSLOADER/NTLDR was corrupted. This is unlikely, however, because OSLOADER checks a hive to make sure it is not corrupted after loading the hive. This error can also indicate that some critical registry keys and values are not present in the hive. Rebooting and using the Last Known Good option might correct the problem. If that fails, you probably will need to reinstall or use the Emergency Repair Disk.

• CANNOT_WRITE_CONFIGURATION (0x00000075)—Specifies that the SYSTEM hive files (SYSTEM and SYSTEM.ALT) cannot be grown to accommodate additional data written into the hive between registry initialization and phase one initialization (when the file systems are available). Usually, this error means there are zero bytes of free space available on the drive, although it could be caused by trying to store the registry on a read-only device.

• PROCESS_HAS_LOCKED_PAGES (0x00000076)—Specifies that a device driver is not cleaning up completely after an I/O operation. The first parameter specifies the process address. The second parameter specifies the number of locked pages. The third specifies the number of private pages, and the fourth parameter is usually zero.

• KERNEL_STACK_INPAGE_ERROR (0x00000077)—Specifies that the requested page of kernel data could not be read. This error is usually caused by a bad block in a paging file or a disk controller error. In very rare cases, it is caused by running out of resources, specifically, the nonpaged pool with a status code of c0000009a. If the first and second parameters are zero (0), the third parameter is the PTE value at the time of the error, and the fourth parameter is the address of the signature on the kernel stack. If the first parameter is non-zero, the first parameter is a status code, and the second is an I/O status code. The third parameter is the page file number, and the fourth is an offset into the page file.

If you receive a KERNEL_STACK_INPAGE_ERROR error and the first and second arguments are both zero, the stack signature in the kernel stack was not found. This error is usually caused by bad hardware. An I/O status of c000009c (STATUS_DEVICE_DATA_ERROR) or C000016AL (STATUS_ DISK_OPERATION_FAILED) normally specifies that the data could not be read from the disk due to a bad block. Upon rebooting the system, auto-check will run and attempt to map out the bad sector. If the status is C0000185 (STATUS_IO_DEVICE_ERROR) and the paging file is on a SCSI disk device, the cabling and termination should be checked.

• MISMATCHED_HAL (0x00000079)—Specifies that the HAL revision level and HAL configuration type do not match that of the kernel or the machine type. This error probably occurs because the user has manually updated either NTOSKRNL.EXE or HAL.DLL. Or, the machine has a multiprocessor HAL and a uniprocessor kernel, or the reverse. Parameter one specifies the type of mismatch that can be one of the following:

• The PRCB release levels mismatch (i.e., something is out of date). If this is the case, parameters two and three are one of the following:

• 2—Major PRCB level of NTOSKRNL.EXE

• 3—Major PRCB level of HAL.DLL

• The build types mismatch. If this is the case, parameters two and three are one of the following:

• 2—Build type of NTOSKRNL.EXE

• 3—Build type of HAL.DLL, where the build types can be one of the following:

• 0—Free multiprocessor-enabled build

• 1—Checked multiprocessor-enabled build

• 2—Free uniprocessor build

• Micro Channel Architecture (MCA) computers require an MCA-specific HAL. If this is the case, parameters two and three are one of the following:

• 2—Machine type as detected by NTDETECT.COM, where a value of 2 means the computer is MCA

• 3—Machine type that HAL supports, where a value of 2 means the HAL is built for MCA

• KERNEL_DATA_INPAGE_ERROR (0x0000007A)—Specifies that the requested page of kernel data could not be read. This error is commonly caused by a bad block in the paging file or a disk controller error. The first parameter specifies the lock type that was held (a value of 1, 2, 3, or PTE address). The second parameter specifies the error status (normally an I/O status code). The third specifies the current process (a virtual address for lock type three or PTE). The fourth specifies the virtual address that could not be paged in.

• INACCESSIBLE_BOOT_DEVICE (0x0000007B)—Specifies that during the initialization of the I/O system, the driver for the boot device might have failed to initialize the device that the system is attempting to boot from. Or, the file system that is supposed to read the device might have either failed its initialization or simply not recognized the data on the boot device as a file system structure. In the first case, the first argument is the address of a Unicode string data structure that is the ARC name of the device from which the boot was being attempted. In the second case, the first argument is the address of the device object that could not be mounted.

If the INACCESSIBLE_BOOT_DEVICE error occurs during the initial setup of the system, the error might have occurred because the system was installed on an unsupported disk or SCSI controller. This error can also be caused by the installation of a new SCSI adapter or disk controller or by repartitioning the disk with the system partition. If this is the case, on x86 systems, the BOOT.INI file must be edited, and, on ARC systems, Setup must be run.

• INSTALL_MORE_MEMORY (0x0000007D)—Specifies that you need to add additional RAM. The first parameter specifies the number of physical pages found, the second the lowest physical page, the third the highest physical page, and the forth should be zero.

• UNEXPECTED_KERNEL_MODE_TRAP (0x0000007F)—Specifies that a trap occurred in kernel mode—either a kind of trap that the kernel is not allowed to have or catch (a bound trap), or a kind of trap that is always instantly fatal, such as a double fault. The first parameter is the number of the trap (0x8 for example, is a double fault).

• NMI_HARDWARE_FAILURE (0x00000080)—Specifies the HAL should report whatever specific data it has and tell the user to call its hardware vendor for support.

• MBR_CHECKSUM_MISMATCH (0x0000008B)—Specifies that the master boot record checksum that the system calculates does not match the checksum passed in by the loader. This usually indicates a virus has infected the boot sector. The first parameter is the disk signature from the MBR, the second is the MBR checksum calculated by OSLOADER, and the third is the MBR checksum calculated by the system.

• PP0_INITIALIZATION_FAILED (0x0000008F)—Specifies that the phase zero initialization of the kernel mode Plug And Play Manager failed.

• PP1_INITIALIZATION_FAILED (0x00000090)—Specifies that the phase one initialization of the kernel mode Plug And Play Manager failed.

• UP_DRIVER_ON_MP_SYSTEM (0x00000092)—Specifies that a uniprocessor-only driver is loaded on a multiprocessor system with more than one active processor. Parameter one specifies the base address of the driver.

• INVALID_KERNEL_HANDLE (0x00000093)—Specifies that the kernel code (server, redirector, or other driver) attempted to close a handle that is not a valid handle. The first parameter specifies the handle that was passed to NtClose. If the second parameter is zero, it means that a protected handle was closed. If the second parameter is one, it means an invalid handle was closed.

• KERNEL_STACK_LOCKED_AT_EXIT (0x00000094)—Specifies that a thread exited while its kernel stack was marked as not swappable.

• INVALID_WORK_QUEUE_ITEM (0x00000096)—Specifies that the kernel attempted to remove a queue item that contained a null parameter. This message usually indicates a device driver that incorrectly utilizes worker thread work items.

• BOUND_IMAGE_UNSUPPORTED (0x00000097)—Specifies that MmLoadSystemImage was called to load a bound image. Bound images are not supported in the kernel. To resolve the problem, make sure BIND.EXE was not run on the image in question.

• END_OF_NT_EVALUATION_PERIOD (0x00000098)—Specifies that your Windows NT system is an evaluation version with an expiration date and the trial period has expired. To resolve this problem, you must purchase a retail copy of the product and reinstall. The first parameter specifies the low order 32 bits of your installation date, the second parameter specifies the high order 32 bits of your installation date, and the third parameter specifies the trial period in minutes.

• INVALID_REGION_OR_SEGMENT (0x00000099)—Specifies a device driver called ExInitializeRegion or ExInterlockedExtendRegion has an invalid set of parameters.

• SYSTEM_LICENSE_VIOLATION (x0000009a)—Specifies that a violation of the software license agreement has occurred. This can often be caused by attempting to change the product type of an offline system or an attempt to change the trial period of an evaluation unit of NT. If the first parameter is set to:

• 0—An offline product type change was attempted. The second parameter should be set to 1 if the product is Windows NT Server or 0 for Windows NT Workstation. The third parameter should be a partial serial number, and the fourth parameter should be the first two characters of product type from the product options.

• 1—An offline change to a Windows NT evaluation version was attempted. The second parameter should be the registered evaluation time, the third parameter should be a partial serial number, and the fourth should be a registered evaluation time from an alternate source.

• 2—The setup key could not be opened. The second parameter specifies the status code associated with the open key failure.

• 3—The SetupType value from the setup key is missing. Therefore, the GUI setup mode could not be detected. The second parameter should specify the status code associated with the key lookup failure.

• 4—The SystemPrefix value from the setup key is missing. The second parameter should then be the status code associated with the key lookup failure.

• 5—Offline changes were made to the number of licensed processors. The second parameter should be a status code, the third should be the invalid value found in licensed processors, and the fourth should be the officially licensed number of processors.

• UDFS_FILE_SYSTEM (0x0000009B)—Specifies that an error occurred on a FAT volume.

• MACHINE_CHECK_EXCEPTION (0x0000009C)—A fatal machine check exception has occurred. The parameters vary based on the processor type. If the processor has the ONLY MCE feature available (the Intel Pentium, for example), the first parameter specifies the low 32 bits of P5_MC_TYPE MSR. The second parameter is not defined. The third parameter is the high 32 bits of P5_MC_ADDR MSR. The fourth parameter is the low 32 bits of P5_MC_ADDR MSR. If the processor also has the MCA feature available (Intel Pentium Pro, for example), the first parameter is the bank number. The second parameter is the address field of MCi_ADDR MSR for the MCA bank that had the error. The third parameter is the high 32 bits of MCi_STATUS MSR for the MCA bank that had the error. The fourth parameter is the low 32 bits of MCi_STATUS MSR for the MCA bank that had the error.

The next step in the process of getting down and dirty with a core dump is to examine the MEMORY.DMP file. While this might not always prove useful to you, it will prove useful to the Microsoft technical support person you call when trying to solve a recurring problem that you cannot solve yourself. So, let’s take a look at the available tools and see what you can find in a dump file.

Table of Contents