Doesn't Numpy already use a C inner loop to do the main data loading, since Numpy is wrapping around C data structures that the loading process loads into? Surely for a reasonably sized array, the overhead of parsing the header info in Python would not be the bottleneck?
I don't mean to write off the possibility (especially in 2014) that OP could have written (especially hardware-specific) faster C code for the loading process. But I would take that as more of a reason to contribute to the project than to do what's described here.