Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Understanding C by learning assembly (hackerschool.com)
188 points by davidbalbert on Sept 12, 2012 | hide | past | favorite | 66 comments


Mixed feelings here:

* Knowing what the machine is doing is useful and important, especially while debugging. I think every really good C coder I know has been comfortable with assembly at least on some platform.

But…

* C is not a fancy macro assembler. C's behavior drives an abstract machine, and any machine code that achieves the same output assuming that all the C behavior is defined is equally valid. The believe that compiler is just transliterating your code to ASM is responsible for many serious bugs. C Compilers haven't been like that for 20 years (or more). E.g. aliasing rule violations. The fact that it isn't is utterly essential for performance, especially as things like good SIMD use become more critical for performance.

Even when the compiler is mostly transliterating the expectations from the simplified mental machine model can be misleading, e.g. alignment requirements on load on some architectures.


  $ CFLAGS="-g -O0" make simple
  cc -g -O0    simple.c   -o simple
  $
This is so handy. I never knew that you could call make without writing a default Makefile. Thanks!


Note that this uses the Bourne shell feature of modifying the environment of the make process by prefixing the command with an environmental variable assignment. The more portable/shell-agnostic version is to pass such assignments as arguments to make instead, i.e:

    make CFLAGS="-g -O0" simple
(This also works with many other build tools like CMake or configure scripts generated by autoconf.)

If you're curious which rules exist by default, try running:

    make -p
(This produces a lot of output; a bit much to study in detail, but still useful to scan just to get an idea of what default rules exist.)


"make -p"

Of course this only applies to GNU make.


Works on FreeBSD's make as well (with a sligthly different output format).


Thanks. I was first exposed to this in Zed's Learn C the Hard Way: http://c.learncodethehardway.org/book/. I think it's quite handy.


You can use the default makefile with no targets to set options globally for a directory as well i.e.:

   echo 'void main() { printf("omg\n"); }' > simple.c
   echo 'CFLAGS=-g' > Makefile
   make simple
   ./simple
(yes I know that is not valid C but it serves this example and compiles fine :-)


If you set CC=c99, then you can change void to int.


You probably heard it already, but you should read make documentation. It's full of convention like this aimed at helping the C ecosystem.


And after you have read it, read the documentation of apenwarr's implementation of djb's "redo", and see how everything about make can be simplified to the point that a 150 line portable bash script can be used as a non-dependency-tracking replacement (that is, rebuild everything on each run).

redo makes everything much simpler, more consistent, more dependable, and more robust. e.g all files are atomically replaced; dependencies are checked by a crypto hash of the content; dependency setup is sane; and it's all faster than make.


I wasn't praising make design though. Just mentioning that there are implicit decisions that could be of use if you have to deal with it.

Thanks for the info, I'll look into redo. I just watched a recent talk about Shake (make-like in Haskell) interesting results like 10x smaller ~makefiles and 2x speed improvement.


Isn't dependency the whole point of make?


It's most of the point of make. The other point is templates. And it does both of them, but redo shows that it does them in an unnecessarily complex and inconsistent way.

A redo specification is generally much shorter[1] than the equivalent Makefile, yet simpler to write, can be guaranteed (unlike make / make depend) to rebuild whenever necessary and only when necessary. And while a supersmart dependency tracking incremental build version is not trivial, it's probably an order of magnitude or two shorter than make; And a 150 line bash script is enough to interpret the same specification without regard to dependencies or prior builds (that is: rebuild everything on every attempt).


There are actually a fair number of default rules you can use that way. Check the documentation for details...


A thousand times yes.

If you are going to use C, then at least know what it's doing!


Can't this reasoning be applied recursively?

If you're going to learn C, you better learn assembly, so you know what the C is doing.

If you're going to learn assembly, you better learn CPU architecture, so you know what the assembly is doing.

If you're going to learn CPU architecture, you better learn digital logic, so you know what the adders and multiplexers are doing.

If you're going to learn digital logic, you better learn solid-state physics...

I mean, it's always useful to know what's happening under the hood, but it's not always realistic to expect that to happen. I don't think the average C programmer need to know assembly any more than the average Ruby programmer needs to know C.


I think someone who claims expertise in 'software' should understand the process until it stops being software, which is architecture. Honestly I think the 'average' Ruby programmer ('average' reflecting their understanding of ruby) who really claims to understand software should know C and ASM.

I really don't have an issue with someone being a dev in Ruby who doesn't understand pointers or how if/else statements can be constructed from jmp primatives. But if you're on the road to expertise I don't think you can remain ignorant of these.


I think a good C programmer should have a grasp on how assembly works. This can come in incredibly handy for debugging weird errors. Yes there are bugs that are weird enough to only become evident in assembly code. An understanding of assembly also makes memory management and pointers in C so much more understandable. Also, assembly reveals quite a bit about performance: Less instructions == faster code. Then there are some platform-specific things that aren't available[1] from C (at least the standard library) like SSE or MMX instructions. These can give you a big speedup (think of it as "micro-parallelization") if used correctly. Some things in glibc (mainly in string.h) are implemented using them, but as far as I know the standard library doesn't provide an interface to them and the compiler usually isn't smart enough to use them when appropriate. So if you're trying to squeeze the last bit of performance out of a program you'll often have no choice but implementing a bit of assembly code as well.

Concerning architecture: One might argue it's not neccessary but I think a good C programmer should at least have scratched the surface here. No deep understanding, but a bit of an overview of what's going on. That is - I think - the point where it no longer concerns you as a programmer. Architecture and Assembly really go hand in hand, since every architecture comes with its own assembly code (yes I know, some assemblers are abstracting that away sometimes). That is why some things work faster on, say SPARC than on x86. It also helps understand the limits we're given and what "64 bit" really means.

Understanding architecture doesn't really help writing better C code, but it explains things the compiler does, since in a pipelined architecture, the compiler will arrange assembly instructions, so there are as little stalls as possible and therefore gaining performance.

[1] I might be wrong on that though. Please correct me if I am.


"I think someone who claims expertise in 'software' should understand the process until it stops being software, which is architecture."

Having a broad background is useful even if it's considered "architecture". Knowledge of networks and protocols is essential. Having IT chops is a great toolkit to draw from. A DBA will need to know more about storage than most developers. Understanding some RF basics helps when dealing with wireless. I've probably used more of my EE background dealing with lower levels of the stack than my CS background.


It certainly works back up the other way, lots of bits of Perl make a lot more sense with some C background.

And, FWIW, when I did CompSci back in the '80s, the first courses/tutorials involved building half-adders and flip-flops from NAND gates and wires on a plug-board.

You don't _need_ to know that sort of stuff, especially not in any level of detail, to build another CRUD webapp - and with luck a hugely successful business on the back of your CRUD app - but I think there's a personality trait that comes along with all the other stuff that makes some people "hackers" which characterises many of us as insanely curious. Many of us don't want to treat the latest Ruby ORM features or Node.js's async callbacks as a "black box", we want to "know how it works" - sometimes out of pure curiosity, and sometimes because it fails _hard_ and we can't work out why without understanding what it's doing "under the hood".


I've played around with all of those levels at various points in school and my career. My knowledge of solid state physics mostly goes unused, mostly, but every other level is something I've had in mind at various points. I'd file them all under "good to know".

Sure, you can generally get by without low level knowledge, but it's nice knowing why certain design decisions were made. And as far as C goes, I think it's extremely useful to be familiar with how it translates to assembly, at least in general. It is knowledge that you will use regularly if you're programming in C in any significant capacity.

For example, reasoning about ABIs is something you'll have to do if you ever write a stable library interface. Reasoning about ABIs requires you to know the asm-level calling convention for the platform you're coding on.


The implication was that you need to know assembly if you're going to be a C programmer. If you can "get by fine without," that's close to the definition of not specifically needing it. Knowing the next layer down is definitely useful, though.


He stated that 'you can get by', not that 'you can get by fine'.

Writing C without knowing basic assembly is writing code you can't debug. I'd say that's clearly not acceptable.


What's an example of a bug in C that cannot be diagnosed without knowledge of assembly?


Can I include compiler bugs? I've had to deal with a slew of those over the years.

If we put those aside, then the simplest example is probably when someone else's code crashes due to your input (though the bug may be theirs), and you don't have their source code. I've spent a great deal of quality time tracking down bugs in vendor libraries for which I had no source, and then working around them.

There are a ton of other possibilities, though. For instance, ordering issues related to thread-safety, out-of-order execution, and synchronization points/barriers.

Or, hell, just being able to read a failure quickly, whether or not you have debug symbols handy. Contrived example:

  Program received signal EXC_BAD_ACCESS, Could not access memory.
  Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
  0x0000000100000f25 in main ()

  (gdb) disassemble 
  Dump of assembler code for function main:
  ...
  0x0000000100000f25 <main+21>:	movb   $0xff,(%rax)
  ...
  End of assembler dump.

  (gdb) info reg rax
  rax            0x0	0


I think this is a valid example of reasons where it would be helpful to know assembly as a C programmer, but not justification for the claim that writing C without knowing assembly is writing code you can't debug.

Even if you are having problems interfacing with a third party library, most of this stuff is pretty Googleable.

I think a good summary of my opinion would be that if you're going to primarily write C professionally throughout your career, it's best to understand fundamentals of assembly, but doing projects here and there? I don't think it's worth the time; you just won't use it that much.


C isn't the kind of thing you dabble in.


Anything have to do with binary formats and I/O would qualify.

Let's say for example you're trying to load a binary file and the header format is like this:

16 bit magic number, 8 bit file size, 3 bit version, 5 bit header size

Now, are you confident enough in how structs are laid out in memory that you could use a bitfield to read these values in? Are you sure it would work on 32 bit, 64 bit, big and little endian machines? What is all that packed structure nonsense in GCC?

That's something that C abstracts away from you and can lead to errors.


I once worked with a png decoder which crashes with some input format. Later on found out it is an alignment issue, which requires some basic knowledge about Computer architecture and assembly of course.


A bug in C caused an alignment issue? That's pretty surprising to me. That should be abstracted from the programmer. If it was written in assembly, it makes sense, though.


> A bug in C caused an alignment issue? That's pretty surprising to me. That should be abstracted from the programmer. If it was written in assembly, it makes sense, though.

Memory alignment in C is most definitely not abstracted from the programmer. The compiler does add some padding to fit alignments but they can either work for you or against you. 98% of the time it works the way you want it but when it goes wrong, it's going to be a hard problem to debug. So it's absolutely critical for a C programmer to know a thing or two about alignment, especially the cases where the compiler attempts to fix it for you.

Examples of alignments that can have a performance effect or in some scenarios (multithreading, kernel programming) may affect the correctness of the program: 4 bytes (word size), 8 bytes (64b pointer size), 16 bytes (SSE/NEON SIMD register size), 32 bytes (ARM cache line size, AVX SIMD register), 128 bytes (ARM and x86 cache line size), 4k (ARM and x86 page size). Add relevant figures for any other target archs you're dealing with.

When doing GPU work (using glMapBufferRange or something), there are other alignments to care about and they may be GPU-specific. Aligning everything to 2^N can never hurt.


Not that hard. Apply improper member alignment via aligned/packed attributes, or hit a compiler bug, or just do something funny, and then execute code that requires specific alignment, like, say, an SSE code path in a PNG decoder that expects natural alignment.


It can be applied recursively, but with diminishing returns and an increasingly alien (to a software developer) set of rules as you go down. If learning one thing made you 10% better at the thing above it, you would also become 1% better at the thing above that.


With a diminishing amount of knowledge you might be right.

A master C programmer knows assembly well, a good chunk about CPU architecture, some things about digital logic and superficially about the physics.


I mean, it's always useful to know what's happening under the hood, but it's not always realistic to expect that to happen. I don't think the average C programmer need to know assembly any more than the average Ruby programmer needs to know C.

I disagree. Given that C exposes you to some hardware details that are abstracted when developing in some higher level language such as Ruby, for instance having to deal with overflow on signed/unsigned types and register variables (I know that compilers will mostly ignore that, but it's still a language feature).

The average Ruby programmer, on it's turn, can perfectly live without knowing C programming. A more accurate analogy would be to a Ruby programmer to know about the language implementation being using (MRI, JRuby, Rubinius, etc.)


Signed vs. unsigned data types isn't an artifact of assembly language. It's more so one of digital data representation. I think that knowledge is just as useful to a C programmer as a Ruby programmer. Fire up irb and type 0.1 + 0.2, for instance. You covered what I would have said about the register keyword, but you're right, it's essentially impossible to understand that language "feature" without having a basic understanding of the CPU.


Indeed. My main point was that programming C requires more knowledge about the "layers" underneath its execution than higher level languages such as Ruby.


There's a course adressing this, full stack knowledge but from bottom to top. Building a system from 'Nand to Tetris' as they pitch it. They probably left ceramics as an exercise for the reader.

http://www1.idc.ac.il/tecs/


My comp. sci. program stopped at the solid-state physics point in your recursion example. Every step has made me a better programmer.


Same here. But how much digital logic made me a better programmer is on the order of how much my calculus classes made me a better programmer.

I think that my knowledge of architecture has influenced my C skills more than my assembly knowledge.


I would argue that knowledge of Assembly and knowledge of Architecture go hand-in-hand. I can't imagine many people who actually 'get' Assembly saying 'I don't understand Architecture'

Also, I went through some low-to-mid range math in school (calc-3, diff. eq., linear alg.) and those classes very much made me a better programmer. The analytic thought processes you learn through those classes is invaluable.


My calculus classes actually have made me a better programmer. Depends on the problems you're facing though.


The best undergraduate project I ever did was one where we had to design and simulate a 4 bit microprocessor (this was 1991). Whilst it pulled together strands of our analogue and digital electronics classes the biggest impact was on my software development. It brought the programmers model of a CPU completely to life and stuff that I knew I now Knew. I became a pretty mean C programmer (even if say so myself) because of this.

How far you go down the rabbit hole depends of how much of this supporting knowledge you need to work at the level you work. Given the layers of software abstraction that exist in systems these days is it really necessary for a programmer to have a detailed knowledge of how a transistor works? Not at all. Your much better spending the time building a mental model of how your VM or OS layer is working and going to interpret the code you write. Or even looking up and understanding the layers that will use what you write - like users and customers (now there is a real challenge!) For sure take an interest in all things, but as you go down the layers your thinking can be increasingly abstract and like any model, plain wrong and it won't really matter.


I think you should know at least one level of abstraction, in each direction, beyond what you presently know. So if you're writing a C driver, you should know what the assembly will look like and also what the application developer will be writing with the help of your driver. This probably has more to do with communication and understanding of the people problems involved than with performance though.


In a strange way I sort of did this with my EE and math degrees. I did draw the line at the Hartree-Fock approximation for doing ab-initio band structure calculations for silicon though so there's still room at the bottom for me. Since then I've come to realize that programming is more fulfilling to me but I do feel prepared for this particular line of rabbit-hole questions :-)


I don't think that is true. If you're writing C, it's probably because you want to be close to the metal and want many of the guarantees that such proximity provides. If you're focused on that sort of thing, then it's definitely useful to understand what your compiler is actually outputting so you can react accordingly.

(I'm a C programmer, I don't look at assembly that often at all, but it's certainly helpful to be able to do so and my knowledge of the x86 ISA, calling conventions, etc has informed many decisions I've made in the past, especially re: performance)


If you want to know what GCC is outputting, you need to learn a hell of a lot more than just x86 assembly. You're going to have to do quite a bit of study of compiler theory. Run objdump on some non-trivial C program and the direct flow of logic will not likely directly resemble the C you wrote.


You don't need to study compiler theory to understand the kind of code the compiler will output. That's very easy to reason about, and much harder to implement.

Knowing x86 assembly is more than enough to reason backwards from assembly to the C that generated it.


Domain of working with C implies working with hardware and raw code - instructions that CPU executes. Same as working with JavaScript(in browser) suggests knowledge of how browsers parse HTML, generate DOM tree and subsequently render it.

When things go wrong it is useful to have knowledge of n-steps deep to immediately cut through ton of output and nail down a few places things seem to work not as expected.


There are diminishing returns to going deeper and deeper. But there's a big return for the first level.

I think it's useful to have a basic understanding of assembly so that you can debug better, or understand why some behaviors are undefined. I think that only takes a basic understanding of assembly though.

If you want to be an expert at assembly, you probably want to go deeper, but most people don't have that goal.


If you wish to make an apple pie from scratch, you must first invent the universe.

-Carl Sagan


Welcome to the MIT EECS degree program, quite literally :-D


It worked for me.


You can run gcc with the -S flag, which will dump the assembler output (so you dont have to go through gdb)

You can also use `objdump` from binutils


gcc -S is usually preferable, since you don't get all the code produced/included by the linker to get the thing running.


And using gcc -S allows you to get a few extra annotations (variable names etc.) using the -fverbose-asm flag.


I sometimes think that an ideal approach to learn programming might be to learn it on several abstraction levels at once to see how things interact.

One good combination would be Assembly + C + Python for instance. Assembly helps understanding C and C helps with understanding Python.


I'm not sure how long this has been the case, but this is how C is taught at my school (Penn) and I'm sure many others.

In my CIS240 class (currently enrolled) we learn computers from the ground up and eventually learn C. First we do binary arithmetic by hand to get a feel for it, then we design basic circuits (and some complicated) that can perform the basic computer instructions (ADD. MULT, SHIFT, AND, XOR, etc.) then we keep building up off of these basics until we are finally writing C, at which point we will (hopefully) have a good grasp at what is going on and appreciate our programming language of choice a whole lot more.


Assembly was always daunting to me. Then I took Computer Systems at the University. It ended up being seriously fun. The way it made me think and try to solve problems were engaging and different it seemed. It required a new way of thinking and made me think about things I never thought about. Although we only wrote smaller programs for the class, I found them fun and I still remember the class nearly 10 years later.

In regards to the original post, I'm not sure I would be putting C with assembly or learn one over or for the other. They both have their uses and reasons for knowing. Forcing yourself to learn one before the other does not seem like a logical way to go. I think it's best to start with what makes the most sense for your end goal and / or you are most interested and motivated to begin and get most deeply into.


I used to teach Assembly and I spend a lot of time talking about how it connected with C since assembly is almost always used in conjunction with C. I have a free text on the web that was used in the class. It only covers 32-bit x86 since that's all there was back then. http://www.drpaulcarter.com/pcasm


Fascinating, reading it made me remember how fun it was to use Softice.


I loved SoftICE, especially the NMI button that you could slide into an ISA slot.


I think I lost track because of the lack of support after Windows XP. But I did read (on some dedicated forum) someone had it running with Vista and onwards, just never been able to find it. I think there was a symbol table no show release and a company buy out. Do you use something with Ring Zero now, to substitute it?


No, I moved on to other things, now I'm doing mostly embedded with external debug probes for debugging.


It's been more than a decade since I've worked in C, but don't C compilers have an option to emit assembly directly?


Hmm, interesting..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: