BSODTutorials: August 2013

Thursday 29 August 2013

Update: Flipped Bits

Firstly, thanks to YoYo155, for asking his friend about flipped bits and Segmentation faults. If the flipped bit is occurring within the exact same register consistently, then it's a problem with the CPU for sure, and less likely to be associated with the motherboard or PSU. If flipped bits are happening in different registers, then the problem is most likely to be associated with the ALU, which also means the CPU is at a fault.

The ALU stands for the Arithmetic and Logic Unit, and is used to perform arithmetic and other logic.

Segmentation Faults

A user at Sysnative Forums, asked a interesting question, and in my attempt to answer the question. I also started to learn myself some information about misaligned pointers and segmentation faults (also known as access violations and bus errors).

The thread can be found here - Misaligned IP - Sysnative Forums

A posted a few links related to the question in my two posts answering that question.

"A segmentation fault (often shortened to segfault), bus error, or access violation is generally an attempt to access memory that the CPU cannot physically address. It occurs when the hardware notifies an operating system about a memory access violation."

All the links to information about the specifics of misaligned pointers can be found in the thread.

The main reason for this blog post, is because I think I've managed to relate a potentially bad CPU to Access Violation error within this dump file.

The second parameter is the address of the exception record, and the third parameter contains the address of the context record. It's important to note that segmentation faults are usually caused by drivers and in rare cases is it really a hardware issue.

Notice the random b at the beginning of the address for the rcx register? This indicates a flipped bit which is usually related to a CPU, PSU or motherboard error.

Using the .formats command on the rcx register and rbx register, we can get the binary representation of the address, and therefore see which a bit has been flipped. In fact, in a earlier dump file, there was also a flipped bit within the exact same register.

There seems to be random 0 within the rcx register, whereas, this isn't present within the normal rbx register.

As I've said in other blog posts, access violation errors are usually a result of a invalid memory address being referenced, this may be slightly true here too. A program or driver has referenced a memory address, which the CPU isn't physically able to translate into a physical memory address.

I'll continue to read and learn about misaligned pointers.

Monday 26 August 2013

Debugging Stop 0xD5 - DRIVER_PAGE_FAULT_IN_FREED_SPECIAL_POOL

Another debugging lesson, with a simple bugcheck, which is very similar to a Stop 0xD1 or Stop 0xA. Although, I'm sure this bugcheck only occurs with the use of Driver Verifier, but I may be wrong about this.

The parameters are very similar to those of the Stop Code mentioned above, for instance we can see the memory address referenced and the type of operation being performed.

This bugcheck is caused by drivers referencing memory addresses which already been freed, and therefore addresses which they do not own. The page fault may have resulted, since the driver has referenced a page which isn't committed to it's address space, which in turn would mean a access violation being raised.

We can see that the page fault, resulted upon the memcpy function call, which is used to copy data between two different buffers or memory addresses.

More Documentation Here - memcopy function Windows

From viewing the call stack, we can see the klif.sys driver belongs Kaspersky, which is known to cause BSODs with Windows 7 (I'm not too sure about other operating systems).

I've suggested the driver and program be removed with the Kaspersky Removal Tool.

Saturday 24 August 2013

Debugging Stop 0x18 - REFERENCE_BY_POINTER

I've found a new interesting bugcheck which is commonly related to objects and handles. Please note before reading, this a Minidump file, and some of the extensions which you would use may not work in this example, unless you have a Kernel Memory dump available.

Objects are system resources and managed by a sub system called the Object Manager. There are a range of different objects, and here's the complete list:

Process - The 'box' of virtual address space in which thread objects execute and are controlled.

Thread - Executable object for a process.

Job - A group of processes connected to one object (More Information here - Job Objects (Windows))

Section - A area of shared memory.

File - Opened file or I/O device.

Token - Security privileges and properties for a process or thread.

Event - Synchronization and notification mechanism.

Semaphore - Synchronization mechanism which allows a certain number of threads access to a resource.

Mutex - Allows only one thread to have access to a resource (it's like a key).

Timer - Notifies a thread when a certain amount of time has occurred.

IoCompletion - I/O Completion port

Key - Registry key (managed by configuration manager)

Directory - Held within the object manager namespace and used to store other objects.

TpWorkerFactory - Collection of threads assigned to complete a certain task (Worker Threads).

TmRm (Resource Manager), TmTx (Transaction), TmTm (Transaction Manager) and TmEn (Enlistment) - Used by the Kernel Transaction Manager.

WindowStation - Read here - Windows Stations

Desktop - Read here - Windows Desktops

Now, you understand all the different types of objects, you must remember that references to an object are called handles. In order, to fully deference and delete an object, all the handles to that object must be closed.

In the current case of this bugcheck, it seems that the handle/reference count may have dropped below zero when there were still open handles to that object or not, or the reference count dropped to zero when there were still remaining open handles. We can see from reading the description of the bugcheck, that the reference count is incremented by 1 when a handle is opened and decremented by 1 when a handle is closed. Please also note that Kernel Mode components of the operating system can reference objects with pointers and not handles.

In this example, we would use the !object extension with the address of the object to obtain the handle count and pointer count, and the type of object being referenced.

We could also use the dt nt!_OBJECT_HEADER data structure with the same parameter to obtain familiar information about the object.

I noticed in the call stack the nt!NtWriteFile routine which is used to write data to file objects. I'm guessing the object may have been a file object which was being used by Google Chrome since parameter 2 matched the object address for the chrome.exe process.

Kernel Mode components and device drivers must increment the reference count for an object if they wish to acquire it, since doing so, will ensure the object isn't deallocated whilst in this use, which may be another reason why the bugcheck has occurred.

Thursday 22 August 2013

Debugging Stop 0x1A - Out of Sync PFNs and Page Tables

I've seen this bugcheck and it's parameter 403, becoming more common recently, and therefore thought I would share how I go around debugging the problem.

Stop 0x1As rarely tell us what the parameters actually indicate and substitute to, therefore we need to check some documentation provided by Microsoft on their WDK (Windows Driver Kit) on MSDN. Stop 0x1A Documentation - Bug Check 0x1A: MEMORY_MANAGEMENT

"The page table and PFNs are out of sync . This is probably a hardware error, especially if parameters 3 & 4 differ by only a single bit."

The above is the meaning for the first parameter of 403, remember to always check the first parameter, the other parameters are usually meaningless unless you probably work for Microsoft. So, in this example, we need to examine the binary representation of the two parameters and then compare their bits.

We can use the .formats command to examine and compare the two parameters together like so:

The parameters differ greatly, and thus leads me to believe this is more of a software related issue. To support my point further, using the !thread extension I was able to find a pending IRP for the crashed thread, although, since this is a Minidump and not a Kernel Memory dump, I'm not able to use the !irp extension and view the stack for the IRP.

I've requested the use of Driver Verifier for the user. In an idea world, I would have had a Kernel Memory dump and checked the IRP.

Side Note: I hope this article helps anyone, and I do try to update my blog as much as possible, but it may be only a few blog posts a month since I attempt to find good debugging examples and write blog posts with examples to support concepts e.g. Working Set Internals

Wednesday 14 August 2013

Process Working Set Internals

Hey everyone, I thought I'll add some information about the the internals mechanisms of Windows, in this example I'm going to write about the process working set and the basic management of working set.

The process working set is the number of pages referenced by a process. The working set is the number of virtual pages which are currently in physical memory.

By default, the minimum working set is 50 pages and the maximum limit is 345 pages for a single process. Although, these limits can be ignored, if the system has enough free memory.

We can examine the current working set for a process by using the !process extension with either the Process ID or the process' address.

We see the current working set, and the working set maximum and minimum sizes. We can also see the page fault count for the current process and it's pool usuage.

With a page fault, the Working Set Manager must examine if the current pages are to replaced (trimmed) and sent to the disk, or additional pages can be simply added to the working set of the process. The Working Set Manager arranges the processes into an order, with processes with the largest working sets and have pages which haven't been accessed in a while, are nominated first for trimming and then smaller processes are considered for trimming.

The Working Set Manager will examine processes with working sets above their minimum of 50 pages, and then will check the PTE Status Bits for the page, if the accessed bit is clear, then the page hasn't been accessed recently and is considered to be aged. The Working Set Manager while scanning for pages to trim, may come across a page which does have the Accessed status bit set, and will therefore clear this bit, upon the next scan, if the page still hasn't been accessed then this page is considered aged.

The scans will continue until required.

The Working Set of a process can also be found in Process Explorer.

Monday 12 August 2013

Debugging Stop 0xFC - ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY

I've found a Minidump, in which I finally show the !pte extension, with this example a driver has attempted to execute a non-executable region of memory. Basically, a device driver has referenced a invalid memory address.

The first parameter is the virtual address which was referenced by the device driver. The !pte shows the mapping between the virtual and physical memory address, the PDE is the address of the Page Directory Entry.

I've highlighted the PTE Status bits and Protection bits, and here is where I will explain what the currenty set bits mean:

D (Dirty) - The page has been written to previously.

A (Accessed) - The page has been read.

K (Kernel Mode) - The page has a Kernel Mode owner.

W - The page is writtable and able to written to.

E - The page is executable.

V - The page is valid, and has a physical page corresponding to it.

There are a few other PTE bits, which are not with this current example, although, I'll quickly explain them here:

C = Copy on Write (explained here)

G = Global; address translation will apply to all processes.

L = Large Page (not set means a Small Page).

Please note, that with many x86 systems which do not support the hardware execute/non-execute bit, the E may still be displayed.

There wasn't much information within the dump file, so with this example, I'll recommend running Driver Verifier.

Debugging Stop 0xC4 - Invalid Handle

Stop 0xC4 is a bugcheck produced when Driver Verifier finds a driver which violates one or more of it's current settings. The first parameter points to the type of violation, and in this example, the violation is the use of a invalid handle; a user-mode handle is being used within kernel-mode.

A handle is very simply a reference to an object. An object is usually some kind of system resource, but for this example, the handle belongs to a process object.

We can see the driver we caused the problem, but let's investigate further into dump file (please note this is a Minidump), the third parameter contains the address of the process which owns the handle.

We can use the !process extension with the third parameter to gain some information about the process, and any associated threads owned by the process.

We can view the working set for the current process (useful for Stop 0xF4), and also the currently associated thread.

Here, we could use the !handle extension with the address of the process, to view all the handles owned by the process, but unfortunately this information was not retained within the Minidump.

For all those interested, the process was related to ASUS AI Suite II, or more commonly ASUS bloatware.

Sunday 11 August 2013

Debugging Stop 0x101 - Solved Example

I thought I'll add a solved example for a Stop 0x101, to give you an idea of some the commands which are useful to use when debugging. Please note, I found that pending IRP packet with !irpfind extension, which finds all the IRPs for the system.

Stop 0x101 - Solved

Hope you enjoy reading :)

Stop 0x9F - Checking Devices and Sleep Compatiblity

Hey everyone, I've got another Stop 0x9F example to show you, in this example I'm going to explain how to find the supported sleep states for a device and find the model of the hardware. I thought this would be especially helpful with the athrx.sys BSODs, since the looking up on the Driver Reference Table usually points to a generic entry.

Firstly, we'll use the !devstack extension on the second parameter of the bugcheck. The !devstack extension will display the device stack for a associated device object. Remember parameter two is the physical device object.

The > symbol points to the entry which matches the device object address used. I should also explain what a device stack is. A device stack is simply a list of device objects associated with a device node, each device object also has a associated a driver objects.

IRPs are usually processed by multiple device stacks. It's important to remember that a single driver object can have multiple device objects.

A device node is simply a physical device within the device tree.

Notice, the !devnode extension? (refer to screenshot). We can expand upon that information, by entering the !devnode extension again with the exact same address.

The Compatibility flags section shows the sleep states, and the Instance Path shows us how to locate the exact model of the hardware device.

Use the VEN_XXX and DEV_XXXX values in a PCI Database, and you should find the exact hardware device.

Wednesday 7 August 2013

Stop 0x19 - Corrupt Pool Header - !pool, !poolval, dt nt!_POOL_HEADER

This blog post is going to more of a link to a thread again, but I'm going to quickly explain the fields within the _POOL_HEADER data structure. Please note !pool and !pooval are explained in my Stop 0xC2 blog post.

Stop 0x19 Example - With Stop 0xC5 and Stop 0xC2

Every pool allocation, has a data structure called the Header, this is used to store information about the pool allocation such as it's size, it's owner and the previous allocation before it within the linked list.

The Block Size means the current size of the pool allocation.

Previous Size contains the size of the previous pool allocation.

Pool Tag is the owner of the pool allocation. You could use the !poolfind extension to find the allocation owned by that pool tag.

Debugging Stop 0xBE - !pte

I've found another interesting debugging example, which I would like to explain to anyone who follows and reads this blog. This particular example is a Stop 0xBE - Attempted_To_Write_To_ReadOnly_Memory. The bugcheck is usually caused by device drivers or memory.

The reason for the system crashing is quite obvious, a read-only memory address was referenced with a pointer, and then a write operation was attempted to performed to this address.

I'm assuming you already understand about PTE's and virtual address translation, so I will not explain these terms here.

We can see from the parameters, the virtual address which was attempted to written to and contents of the PTE. I decided to examine the PTE with the !pte extension, please also note this documented in the debugger documentation.

The !pte extension would usually give some status flags for the given address, but since this was a Minidump, then this may not always appear. The most important thing, I noticed from this extension was the message about the virtual address (VA), I'm not sure if this is just because I specified the PTE instead of the virtual address, or the virtual address is invalid. The 1 flag tells the debugger, that this is a PTE and not a virtual address.

Anyhow, non-canonical VA, means that the virtual address is invalid and will result in a crash.

This may be helpful, since the crash resulted in a invalid memory address being referenced, and written to.

Just to add, from this bugcheck, the problem seemed to be linked to a outdated AMD/ATI graphics card driver from 2010.

Understanding !ipi - Stop 0x101

I've created a discussion at Sysnative.com, regarding this mainly undocumented extension, I'm going to post the link to the discussion as I did before with Linked lists.

[QUESTION] !ipi extension - Sysnative Forums

Saturday 3 August 2013

Common Exceptions and Interrupt Numbers - Using .trap

You may notice within call stacks, some traps occur as a result of a certain exception, using the .trap command with the address of the the trap frame specified within the call stack, you may notice a small segment from the output called Error Code = # The number associated with the predefined interrupt number within the IDT. You can use the !idt extension to view the IDT.

There are 19 different interrupt numbers for 32-bit operating systems:

0 = Divide Error

This is usually when a instruction has attempted to divide data by zero. This is caused by device drivers.

1 = Debug (Single Step)

This is a debugging feature used by compilers, each line of code is reviewed for any errors, and then line before and after the line of code can be checked to see how the error occurred.

2 = Non-Maskable Interrupt (NMI)

A very high priority interrupt, and usually indicates a unrecoverable hardware error.

3 = Breakpoint

Another debugging feature, breakpoints can inserted into code, when a breakpoint is reached, it pauses the program to allow the programmer to check for any errors, and examine the current state of the program.

4 = Overflow

Buffer Overflow is when the device driver has overwritten it's allocated buffer.

Stack Overflow is when a program has made too many calls, running outside of the call stack.

5 = Bounds Check

Checking that a array or variable is within the correct allocated range, and doesn't overrun this boundary (device drivers or programs).

6 = Invalid Opcode

Invalid operation code, a invalid assembly or machine code instruction was called.

7 = NPX Not Available

http://www.gsp.com/cgi-bin/man.cgi?section=4&topic=npx

8 = Double Fault

A exception has occurred while processing and handling a another exception.

9 = NPX Segement Overrun

http://www.rcollins.org/secrets/NPXError.html

10 = Invalid TSS (Task State Segment)

http://wiki.osdev.org/Exceptions#Invalid_TSS

11 = Segment Not Present

http://wiki.osdev.org/Exceptions#Segment_Not_Present

12 = Stack Fault

http://wiki.osdev.org/Exceptions#Stack-Segment_Fault

13 = General Protection Fault

Covers several different exceptions, I would view the call stack for more information.

http://en.wikipedia.org/wiki/General_protection_fault

14 = Page Fault

A device driver accesses a page which is not available within physical memory. The Memory Manager makes the page available. This is a exception when a invalid memory address has been referenced.

15 = Intel Reserved

16 = Floating Point

http://wiki.osdev.org/Exceptions#x87_Floating-Point_Exception

17 = Alignment Check

http://wiki.osdev.org/Exceptions#Alignment_Check

18 = Machine Check

The CPU has found a hardware error and has raised the exception.

19 = SIMD Floating Point

http://wiki.osdev.org/Exceptions#SIMD_Floating-Point_Exception

Understanding the !thread extension

The !thread extension is probably one of most often used extensions when I'm debugging. It gives you several key pieces of information, which I will briefly describe in this post.

THREAD = This is the address of the current thread, we can use this address with the ETHREAD data structure.

IRP List = The list of currently associated IRPs with the current thread.

WAIT = Contains the current state of the thread, and any dispatcher objects in which the thread may waiting upon.

Owning Process = The process which the thread is currently associated with, we can use this address with the EPROCESS data structure.

Teb = Address of the Thread Environment Block (use !teb)

Cid = Thread ID (useful with the !locks)

Priority = The current priority of the thread.

Stack = The call stack of the current thread.