Basic Pointers

I saw this question on reddit, and decided to help. It’s a little more fundamental than I usually blog about, but I thought it might help someone one day who only seen the abstractions that high-level languages present to us, never the nitty-gritty underneath.

So as a beginner programmer the line of code:

int *pApples = &apples;

means that I am creating “something” “somewhere” and this “something” points a finger at the memory address of apples. When I output the “somewhere” of pApples by typing in &pApples I get the same address as &apples. To me, it looks like two things are occupying the same space at the same time. I’m just fiddling with pointers at the moment. This is way beyond what I have been taught in class so far so sorry if this is a stupid question.

Literally “where” is an unanswerable question. The compiler, CPU and OS all contribute to that.

However, broadly, the pointer is, like every other variable, a block of memory. Allocated either on the stack, the heap, or global memory spaces.

The particular CPU (and compiler) you are using determines how big a block of memory is allocated for any particular variable type. For example, an unsigned int on my system (and probably yours) is 32-bits wide, so takes up four bytes. Let’s say it happens, in one particular run of a program, to be allocated at 0x40000000 in memory, and is equal to 0xaabbccdd.

0x40000000 0x40000001 0x40000002 0x40000003
[0xdd]     [0xcc]     [0xbb]     [0xaa]

This particular CPU is little-endian (hence the least significant byte of the 32-bit unsigned int is stored first in memory); on big-endian CPUs, the opposite would be true. Big-endian is more friendly for humans to read, but it’s often easier for a CPU and whatever algorithm you’re writing to start at the least significant end, so little-endian has ended up more common.

It’s actually better (and one of the advantages of using a higher-level language) to not write code that knows the endian-ness of its target and let the compiler sort out those details. We’ll do the same by just labelling the whole memory block as one entity with our unsigned int in it.

0x40000000
[0xaabbccdd]

The compiler makes this bit of memory available to us, labelled and typed, so we’ll treat this as.

(unsigned int i)
@ 0x40000000
[0xaabbccdd]

With this information the compiler knows enough about i to be able to perform all your integer operations on it. So when we loaded i with something:

unsigned int i;  // as the programmer we don't know that
                 // this is stored at 0x4000000
i = 0xaabbccdd;

The compiler then knows to write 0xdd into 0x4000000, 0xcc into 0x40000001, etc (actually on a 32-bit system it leaves it to the CPU to write it all in one operation, but imagine it’s done byte-by-byte for now). Similarly if you do

i = i / 10;

It knows to translate this to (making up some assembler language):

mov  [0x4000000], R1    ; put contents of 0x4000000 in register 1
mov  0x000000a, R2      ; put 10 in register 2
call _unsigned_int_long_division  ; divide R1 by R2, answer in R3
mov  R3, [0x40000000]   ; answer goes from register back to memory

Importantly, if it was a signed int, then it would know to call a different subroutine, but your i = i / 10 line would be unchanged. Similarly, for float or long long, or even uint8_t. Also note that the division operation is really just another function call, it’s just so fundamental that languages tend to give it a concise symbol.

Now, we’ve already seen pointers in this little bit of pseudo assembly as we copied the number from memory to a CPU register so the CPU could work on it, then back into the memory location representing the variable. We can do similar things in C though.

unsigned int i;   // an integer, stored at say 0x40000000
unsigned int *p;  // a pointer to an integer, stored at say 0x40000004

i = 0xaabbccdd;

// Point p at i
p = 0x40000004;

You would never do this. You can’t in fact, because it’s only because this isn’t a real example that we even know that the compiler/linker has decided that i is stored at 0x40000000. Fortunately, we don’t need to know, the compiler comes with a handy operator which lets us query it’s knowledge of where it chose to store this variable – the “address-of” operator.

p = &i;

Now, note that we’re not doing anything different from when we assigned 0xaabbccdd to i earlier – we’re just putting a number in a variable. It’s just that this time the number has an additional meaning.

Let’s look at the memory:

(unsigned int i)   (unsigned int *p)
@ 0x40000000       @ 0x40000004
[0xaabbccdd] <---- [0x40000000]

Both 32-bits of storage, but different types.

Remember how the compiler knew when we performed operations with i which subroutines to call because it also knew what type the variable was. Exactly the same applies to pointers – the compiler knows that the number stored in p can have some “pointery” operations performed on it that can’t be done on an unsigned int (and vice versa in fact). In particular, the inverse operation of “address-of”, “dereference”, which is the unary operator symbol “*”.

i = i / 10;   // legal, division is defined for `unsigned int`
p = p / 10;   // illegal, division is meaningless for a pointer

i = *p;       // legal,
p = *i;       // illegal, we can't dereference an integer

The compiler is protecting us – even though p and i are both 32-bit numbers, dereferencing i is not possible because the compiler has been told it doesn’t point at anything. Note that if we were using assembly language, there would be no such protection, the CPU doesn’t know anything about data types (for the most part). So, we can dereference a pointer, but what type should the dereferenced number have? We’ve already told the compiler that.

unsigned int *p;

We’re saying that “*p” will be an unsigned int. Hence while we can’t perform division on a pointer, we can perform it on a dereferenced integer pointer.

i = *p / 10;

Let’s look at the psuedo-assembly:

mov  [0x40000004], R4   ; put the contents of p into a register
mov  [R4], R1           ; put contents of 0x4000000 in register 1
mov  0x000000a, R2      ; put 10 in register 2
call _unsigned_int_long_division  ; divide R1 by R2, answer in R3
mov  R3, [R4]           ; answer goes from register back to memory
                        ; pointed to by R4

The middle of this is the same as before, but we’ve added some indirection around the outside.

Nothing stops you continuing this idea. Since we can point at anything, and we have a way of telling the compiler the type of the thing we’re pointing at, we can go again:

unsigned int i;
unsigned int *p;
unsigned int **pp;

p = &i;
pp = &p;

pp is now a pointer-to-a-pointer-to-an-unsigned int. Same as before; we’ve told the compiler that **pp is equal to an unsigned int. Or *pp must be something that can points to an unsigned int *.

To go back to your question directly.

When I output the “somewhere” of pApples by typing in &pApples I get the same address as &apples.

Hopefully now you can see that &pApples is not equal to &apples. &pApples is a pointer to a pointer to an integer; and &apples is a pointer to an integer. To modify my last example a little to use your naming:

int apples;          // let's say this is at 0x40000000
int *pApples;        // let's say this is at 0x40000004
int **ppApples;      // let's say this is at 0x40000008

pApples = &apples;   // 0x40000004 now holds 0x40000000
ppApples = &pApples; // 0x40000008 now holds 0x40000004

// Note because the pointers point at apples regardless of what it
// holds, it doesn't matter that we assign this after we assign to
// the pointers
apples = 10;         // 0x40000000 now holds 10

Let’s fill in our last pretend memory map:

(int apples)     (int *pApples)   (int **ppApples)
@ 0x40000000     @ 0x40000004     @ 0x40000008
[0x0000000a] <-- [0x40000000] <-- [0x40000004]

Now we can use those pointers…

// Derferencing the pointer-to-the-integer gets us an integer
assert(*pApples == apples);
// Dereferencing the pointer-to-the-pointer once gets us a pointer to
// apples, pApples
assert(*ppApples == pApples);
// Since *ppApples equals pApples; then dereferencing *ppApples is the same
// as derferencing pApples... i.e. apples
assert(**ppApples == apples);

The compiler picked the storage locations of these variables; and we fetched those locations with the address-of operator and wrote them to other variables. Those other variables have to be typed such that the compiler will allow assignment of pointers to them, but a pointer is just another variable once you scratch the surface.

This entry was posted in FussyLogic and tagged , , , , , . Bookmark the permalink. Trackbacks are closed, but you can post a comment.

Post a Comment

You must be logged in to post a comment.