Wednesday, October 1, 2014

dump packets in memory buffers with tshark without leaving gdb

I got tired of running my error-prone eyeball packet decoder on GDB buffers, so I took a few minutes to figure out how to send packet data to tshark to let the computer do what it's good at. Here's the ~/.gdbinit snippet if you want to use it:
define pktdump
   dump binary memory /tmp/dgoodell-pktdump.bin (char*)$arg0 (((char*)$arg0)+$arg1)
   !od -Ax -tx1 -v /tmp/dgoodell-pktdump.bin | text2pcap - - | tshark -V -x -i -
end
document pktdump
Dump a network packet in memory with tshark.
Example usage: pktdump ADDRESS LENGTH
end
(obviously change "dgoodell" to your user name or stick the file in your home directory, environment variables don't work in GDB, unfortunately)

The results look like:
(gdb) x/16xb hdr_buf
0x60f5d0:       0xfc    0x99    0x47    0x25    0x2c    0x13    0xfc    0x99
0x60f5d8:       0x47    0x25    0x42    0xbd    0x08    0x00    0x45    0x88
(gdb) help pktdump
Dump a network packet in memory with tshark.
Example usage: pktdump ADDRESS LENGTH
(gdb) pktdump hdr_buf 64
Input from: Standard input
Output to: Standard output
Wrote packet of 64 bytes at 0
Read 1 potential packet, wrote 1 packet
Capturing on Standard input
Frame 1: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface 0
   Interface id: 0
   WTAP_ENCAP: 1
   Arrival Time: Oct  1, 2014 14:46:42.000000000 PDT
   [Time shift for this packet: 0.000000000 seconds]
   Epoch Time: 1412200002.000000000 seconds
   [Time delta from previous captured frame: 0.000000000 seconds]
   [Time delta from previous displayed frame: 0.000000000 seconds]
   [Time since reference or first frame: 0.000000000 seconds]
   Frame Number: 1
   Frame Length: 64 bytes (512 bits)
   Capture Length: 64 bytes (512 bits)
   [Frame is marked: False]
   [Frame is ignored: False]
   [Protocols in frame: eth:ip:udp:data]
Ethernet II, Src: Cisco_25:42:bd (fc:99:47:25:42:bd), Dst: Cisco_25:2c:13 (fc:99:47:25:2c:13)
   Destination: Cisco_25:2c:13 (fc:99:47:25:2c:13)
       Address: Cisco_25:2c:13 (fc:99:47:25:2c:13)
       .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
       .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
   Source: Cisco_25:42:bd (fc:99:47:25:42:bd)
       Address: Cisco_25:42:bd (fc:99:47:25:42:bd)
       .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
       .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
   Type: IP (0x0800)
   Padding: 0000
   Frame check sequence: 0x00000000 [incorrect, should be 0x2cf4146a]
       [FCS Good: False]
       [FCS Bad: True]
           [Expert Info (Error/Checksum): Bad checksum]
               [Message: Bad checksum]
               [Severity level: Error]
               [Group: Checksum]
Internet Protocol Version 4, Src: 10.10.0.1 (10.10.0.1), Dst: 10.10.0.2 (10.10.0.2)
   Version: 4
   Header length: 20 bytes
   Differentiated Services Field: 0x88 (DSCP 0x22: Assured Forwarding 41; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
       1000 10.. = Differentiated Services Codepoint: Assured Forwarding 41 (0x22)
       .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
   Total Length: 44
   Identification: 0x0000 (0)
   Flags: 0x00
       0... .... = Reserved bit: Not set
       .0.. .... = Don't fragment: Not set
       ..0. .... = More fragments: Not set
   Fragment offset: 0
   Time to live: 8
   Protocol: UDP (17)
   Header checksum: 0x9e23 [correct]
       [Good: True]
       [Bad: False]
   Source: 10.10.0.1 (10.10.0.1)
   Destination: 10.10.0.2 (10.10.0.2)
User Datagram Protocol, Src Port: 34066 (34066), Dst Port: 47559 (47559)
   Source port: 34066 (34066)
   Destination port: 47559 (47559)
   Length: 24
   Checksum: 0xf4fe [validation disabled]
       [Good Checksum: False]
       [Bad Checksum: False]
Data (16 bytes)
   Data: 00000000000000000000000000000000
   [Length: 16]

0000  fc 99 47 25 2c 13 fc 99 47 25 42 bd 08 00 45 88   ..G%,...G%B...E.
0010  00 2c 00 00 00 00 08 11 9e 23 0a 0a 00 01 0a 0a   .,.......#......
0020  00 02 85 12 b9 c7 00 18 f4 fe 00 00 00 00 00 00   ................
0030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................

That's also way easier to read on x86 than casting to a struct ether_header *, etc. and dereferencing, especially on x86 because all the fields end up displayed in the wrong endianness.

Friday, May 11, 2012

Text::CharWidth::PurePerl available on GitHub

I've put up a new perl module on github, Text::CharWidth::PurePerl.  As the name suggests, it's a pure-perl implementation of Text::CharWidth.  I needed it to be able to ship nowrap as a script without any dependencies.  Amusingly, it seems to be more correct than the "native" wcwidth routine.

It is packaged in the usual CPAN format, but I'm not currently planning on uploading it to CPAN (I don't have a CPAN id right now).  That seems like a real PITA right now, and I'm not sure if anyone else might even need it.  If you do want me to go through the steps, feel free to add a comment here on the blog or over on the github issue tracker.

Friday, April 15, 2011

Debugging Wily Stack Overflows in MPI Programs

I recently was debugging a problem in MPICH2 with the new MPI-3 nonblocking collectives implementation that involved a stack overflow.  The code in question was usually blowing up with a SIGABRT that came from a stack trace ending in __stack_chk_fail.  This bug was particularly nasty because it was a Heisenbug: it kept changing the failure location and mode whenever I would insert debugging code in order to track it down (hence the aforementioned "usually").

For reasons I will explain later, my usual trick for this sort of thing did not help.  Adding in those macros would cause the bug to vanish, or to reappear in a different function that had not been instrumented with MPIU_SG_ macros.

Luckily, I ran across this very helpful blog post that eventually helped me track down the real problem.  My basic approach was the same as the one outlined there, but it needed modification due to the pain inherent in debugging a parallel MPI application.  This particular bug required running at least 4 processes in order to manifest itself, and it would only appear after several iterations within the test program.  So I used mpiexec + xterm + gdb + a gdb script file to handle the job.

First, here's an example stack trace of the failure (using valgrind to easily obtain stack traces):

% mpiexec -l -n 4 valgrind -q ./icbarrier |& sort_labeled_output

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Abort trap (signal 6)
[1] ==47313== 
[1] ==47313== Process terminating with default action of signal 6 (SIGABRT)
[1] ==47313==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[1] ==47313==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[1] ==47313==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[1] ==47313==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[1] ==47313==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[1] ==47313==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[1] ==47313==    by 0x10000141D: main (icbarrier.c:37)
[2] ==47314== 
[2] ==47314== Process terminating with default action of signal 6 (SIGABRT)
[2] ==47314==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[2] ==47314==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[2] ==47314==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[2] ==47314==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[2] ==47314==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[2] ==47314==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[2] ==47314==    by 0x10000143B: main (icbarrier.c:45)
[3] ==47315== 
[3] ==47315== Process terminating with default action of signal 6 (SIGABRT)
[3] ==47315==    at 0x10027D616: __kill (in /usr/lib/libSystem.B.dylib)
[3] ==47315==    by 0x1003022CB: __stack_chk_fail (in /usr/lib/libSystem.B.dylib)
[3] ==47315==    by 0x10005D9D3: MPIDI_CH3I_Progress (ch3_progress.c:539)
[3] ==47315==    by 0x100109B0C: MPIR_Wait_impl (wait.c:67)
[3] ==47315==    by 0x100109FFA: PMPI_Wait (wait.c:171)
[3] ==47315==    by 0x100004506: MPI_Barrier (nbc_pmpi_adapter.c:24)
[3] ==47315==    by 0x10000143B: main (icbarrier.c:45)

So I took a quick gander at the assembly for MPIDI_CH3I_Progress to find the canary value initialization and comparison code:

% objdump -d icbarrier > icbarrier.s
% sed -n -e '/MPIDI_CH3I_Progress>:/,/^$/ p' icbarrier.s | head -n 15

000000010005c7f0 <_MPIDI_CH3I_Progress>:
   10005c7f0:   55                      push   %rbp
   10005c7f1:   48 89 e5                mov    %rsp,%rbp
   10005c7f4:   41 57                   push   %r15
   10005c7f6:   41 56                   push   %r14
   10005c7f8:   41 55                   push   %r13
   10005c7fa:   41 54                   push   %r12
   10005c7fc:   53                      push   %rbx
   10005c7fd:   48 81 ec 08 02 00 00    sub    $0x208,%rsp
   10005c804:   48 89 bd 10 fe ff ff    mov    %rdi,-0x1f0(%rbp)
   10005c80b:   89 b5 0c fe ff ff       mov    %esi,-0x1f4(%rbp)
   10005c811:   48 8b 05 60 c8 10 00    mov    0x10c860(%rip),%rax        # 100169078 <_pvars+0x78>
   10005c818:   48 8b 10                mov    (%rax),%rdx
   10005c81b:   48 89 55 c8             mov    %rdx,-0x38(%rbp)
   10005c81f:   31 d2                   xor    %edx,%edx

So in this code it looks like the offset is -0x38 rather than -0x14.  Let's find the two addresses at which we will set our breakpoints:

% sed -n -e '/MPIDI_CH3I_Progress>:/,/^$/ p' icbarrier.s | grep -- '-0x38(%rbp)'

   10005c81b:   48 89 55 c8             mov    %rdx,-0x38(%rbp)
   10005ce61:   48 8b 55 c8             mov    -0x38(%rbp),%rdx

And now for our gdb script:

% cat gdb.script

break *0x10005c81b
break *0x10005ce61
commands 1
silent
set variable $cow = (unsigned long*)($rbp - 0x38)
watch *$cow
continue
end
set variable $count = 3
commands 2
silent
delete $count
set variable $count = $count + 1
continue
end
run

% mpiexec -n 4 xterm -e gdb -x gdb.script ./icbarrier

That command launches four xterm windows, each containing a single gdb instance that was running icbarrier as the inferior.  Here's the resulting xterm+gdb screenshot for the process that hit the watchpoint first:

Unlike in many situations, this backtrace alone was not sufficient for me to figure out what the real bug was.  But, given that I changed some high level code (the nonblocking version of barrier, MPIX_Ibarrier), I knew it was extremely unlikely that I had tickled some long-latent bug in the frequently used routine MPIDI_CH3U_Request_unpack_uebuf.  So after a few "WTFs" and some further head scratching, I took a hard look at the upper level code (error checking and boring code omitted for clarity):


int MPIR_Ibarrier_inter(MPID_Comm *comm_ptr, MPID_Sched_t s)
{
    int mpi_errno = MPI_SUCCESS;
    int size, rank, root;
    int buf = 0;

    /* do a barrier on the local intracommunicator */
    mpi_errno = MPIR_Ibarrier_intra(comm_ptr->local_comm, s);
    MPID_SCHED_BARRIER(s);

    /* rank 0 on each group does an intercommunicator broadcast to the
       remote group to indicate that all processes in the local group
       have reached the barrier. We do a 1-byte bcast because a 0-byte
       bcast will just return without doing anything. */

    /* first broadcast from left to right group, then from right to
       left group */
    if (comm_ptr->is_low_group) {
        root = (rank == 0) ? MPI_ROOT : MPI_PROC_NULL;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
        MPID_SCHED_BARRIER(s);
        /* receive bcast from right */
        root = 0;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
    }
    else {
        /* receive bcast from left */
        root = 0;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
        MPID_SCHED_BARRIER(s);
        /* bcast to left */

        root = (rank == 0) ? MPI_ROOT : MPI_PROC_NULL;
        mpi_errno = MPIR_Ibcast_inter(&buf, 1, MPI_BYTE, root, comm_ptr, s);
    }


fn_exit:
    return mpi_errno;
fn_fail:
    goto fn_exit;
}

This function is called, adding operations to a communication schedule, then an MPI_Wait call is made later to actually force the bulk of the schedule to complete.

Do you see the bug?  I'd be surprised if you did, since there are few people familiar with the style of MPICH2 code, and even fewer people who know how the NBC implementation works.  But if you do, give yourself a pat on the back, and then please send me bug reports and patches for my other bugs  :)

Take a closer look at the buffer used for the intercomm broadcast.  It's a stack allocated variable, which worked great in the blocking version of this routine.  Unfortunately, this is a nonblocking barrier, so this routine exits before the ibcast is actually ever performed.  So now we've managed to schedule a broadcast to occur that will overwrite a location on the stack that no longer actually refers to int buf;.  Instead, if we are unlucky it will refer to some random piece of a different function's stack frame.  In the backtraces above, this random piece was the -fstack-protector/SSP canary memory.  The bug moved around when I would insert debugging variables because they would affect the specific location of the canary value in memory, sometimes yielding a corrupted parameter or local variable instead.

Changing that buffer to be malloced instead fixed the problem.

Sunday, April 3, 2011

Roku stands behind their product

My older Roku player (this model, N1000-B) died a couple of months ago.  The wireless NIC went out to lunch for some reason, showing 00:00:00:00:00:00 as its MAC address and totally unable to see any wireless networks, just as described here.  I was busy and lazy at the time, so we just switched over to watching Netflix streaming content on our Wii instead.

Last Friday I finally got around to calling Roku about the issue and the technician that I got was very helpful.  He didn't give me the "reboot runaround"; he just asked me what troubleshooting steps I had taken, and ensured that my MAC address was showing all zeroes.  After a bit more discussion, he offered to replace the unit for free, even though it was nearly a past its warranty expiration.  They mailed the replacement unit (looks like it might be factory refurb?) via FedEx and 3 business days later the new unit was in my hands.  The new unit setup was painless (as expected), and a return shipping label was included so even the round-trip shipping was free.

I was extremely pleased that Roku stood behind their product, even after it was out of warranty.  They did exactly the right thing to change my mind from "I'll never buy another one of these" to "I'd happily buy something from them again."

(full disclosure: I have no affiliation with Roku, other than being a fairly satisfied customer)

Saturday, March 5, 2011

m4_include and aclocal

I ran into some surprising behavior a little while back while hacking on the MPICH2 build system, that I thought I should mention here.  But first, some background info is in order.  We have numerous subsystems that are currently configured (in the autoconf sense) by running an autoconf-generated configure script from various points in the top level configure.  Think AC_CONFIG_SUBDIRS, but the subconfig happens at the place where the macro is used, rather than being deferred to config.status-time.  There are some advantages to this scheme, but there are two fairly severe disadvantages:
  1. The subconfigure scripts tend to re-test a whole bunch of information about the build environment (e.g., six repeated checks for "how to run the C preprocessor").  Not only is this wasteful in terms of time, but there is always unlikely possibility that two configure scripts will come to subtly different conclusions about the build environment, which could cause wacky bugs later in the build.
  2. There's no easy mechanism to utilize the results of subconfigure tests that should be global.  That is, if we add -I/path/to/includes to CPPFLAGS that should be used for all compilation, then this won't be seen in the top-level configure without some additional work.  We currently "pass" these values back via a "localdefs" file that is listed in the subconfigure's AC_CONFIG_FILES and contains some @FOO@ substitutions that are either AC_SUBSTed or "precious" (which implies AC_SUBST for that var).  This stinks because it's a fairly manual process, and it's impenetrable to outsiders who are unfamiliar with our build system because the sourcing of "localdefs" is hidden inside a custom autoconf macro.
So I'm taking a stab at converting these configure files to be configure.in fragments that are then m4_included.  This will cause those tests to be run in the same shell as the top-level configure and will only run the additional tests that are needed by the subsystem.  In the process of that conversion, I ran into some odd error messages from the autotools that are represented by the following example program.

Quick Quiz: The code in Listing A will successfully autoreconf and configure.  The code in Listing B will fail during autoreconf's invocation of aclocal.  Why?


Common code:
## contents of configure.in
AC_INIT([foo],[1.0])
AM_INIT_AUTOMAKE([-Wall -Werror foreign 1.11 subdir-objects])
LT_INIT([])
AC_PROG_CC
var="yes"
AM_CONDITIONAL([COND],[test "$var" = "yes"])
dnl this makes m4 bail, claiming that AM_COND_IF is undefined
m4_foreach([subsys_i],[[subconfigure]],[m4_include(subsys_i[.m4])])
AC_OUTPUT([Makefile]) 

Listing A
## contents of subconfigure.m4
AC_PROG_GREP

Listing B
## contents of subconfigure.m4
AM_COND_IF([COND],[echo COND is true],[echo COND is false])


Do you see the problem here?  I sure didn't, even after some moderate debugging.  Go ahead and think about it for a minute, I'll wait...
.
.
.
.
.
.
.
.
.
Luckily, the nice folks on the autoconf list helped me understand what was going on here.  There are two things happening here:
  1. Not all macros are created equal.  In my original world view, AC_PROG_GREP and AM_COND_IF are both just AC_DEFUNed m4 macros and should behave more or less identically.  However this assumption is totally false.  AC_PROG_GREP is a macro that is built-in to autoconf, and therefore does not get distributed with your package via aclocal.m4, while AM_COND_IF comes with automake and is managed by aclocal, and must be placed into your package's aclocal.m4 in order to be properly expanded.
  2. It turns out that aclocal isn't as smart as I thought it was.  When it traces the m4 files looking for macro definitions that it must add to aclocal.m4, it doesn't perform real m4_includes at that time.  Instead it uses some regex heuristics in order to determine what file is being included and then traces that file separately.  Since the name of the file that is being included is being computed via an m4 macro in the example above, aclocal gets rather confused and just doesn't trace the subconfigure.m4 file.
The upshot of all of this is that I can't use m4 code to programmatically include subsystem m4 fragments.  They must either be hard-coded m4 file names or be autogenerated by some other non-autotools step, such as a quick shell script run before autoreconf.

    Friday, January 21, 2011

    Midwest Ice Climbers Ice Festival, 2011



    Last weekend I got to try ice climbing out at the Ice Pit near Green Bay, Wisconsin.  It was a very cool setup, sort of an outdoor ice climbing gym in an active quarry.  Normally you must be a member to climb there, but for the festival it was just $20 for the whole weekend.  I did the trip with five other guys, all rock climbers, three of which also work at the Lab.

    I found the climbing itself surprisingly difficult, despite my (at least modest) rock climbing abilities.  According to others that I spoke with at the festival, most of the routes that we were climbing were in the WI3-WI4 range, all about 100 feet high.  The hardest part was getting accustomed to the crampons and really trusting my feet.  When I didn't trust my feet enough, I would put too much weight on my hands which led to totally pumped out forearms after 100 feet of ice.  We were each able to get three climbs in each of the two days, which was enough to tire me out pretty effectively.

    There was a fair amount of demo gear to borrow from various vendors, including boots, crampons, and ice tools.  However, we got in touch with Down Wind Sports (in Michigan's UP) via email before the festival and arranged to rent gear from them on Friday evening instead.  That way we were able to save a lot of time each morning and were certain that we would have gear that fit.  Bill @ Down Wind Sports was extremely nice and helpful, I'd do business with them again without hesitation.  For $40/person we had boots and crampons for the whole weekend.  Ice tools were easy to come by as demo gear, and someone in our party already had a set as well.

    Overall it was a very enjoyable trip.  I might do a little easy ice climbing in the future as the opportunity presents itself, but for now I'm not planning on picking it up as a regular hobby.  I already have too many hobbies as it is, and this isn't the cheapest or most accessible hobby around.  A brand new set of top-rope ice equipment (boots, crampons, tools) adds up to $800-$1000, while a used set is probably half of that plus more time hunting for a decent deal.  And northern Illinois isn't exactly teeming with waterfall ice to climb.

    You can find a few more photos of the event on my Smugmug page.

    Thursday, January 20, 2011

    sort_labeled_output script

    Quick MPICH2 tip: passing the "-l" option to MPICH2's mpiexec (hydra) yields rank-labled output like so:

    % mpiexec -l -n 2 ./examples/cpi
    [0] Process 0 of 2 is on g2-2.local
    [1] Process 1 of 2 is on g2-2.local
    [0] pi is approximately 3.1415926544231318, Error is 0.0000000008333387
    [0] wall clock time = 0.000210


    This is handy for figuring out which process is printing what without reaching in and modifying all of your printf statements to explicitly print the rank as well.

    For more complicated output, and especially multiline output that you would get from running valgrind, this interleaved output can be hard to follow. Enter sort_labeled_output:

    #!/bin/zsh
    # assumes output that matches the following regex: /^\[\d+\] .*$/
    exec sort --stable --key=1.2,1.0 -n


    The script just stably sorts the output by rank prefix, which makes it much easier to read multiline output.

    Given that this is just a thin wrapper around GNU sort, I actually didn't build this script for a long time. But it turns out that this is a task that I do all the time and reconstructing the sort arguments always distracts me a bit from the debugging task at hand. So the script has been a big productivity win.