Software Testing Blog

Memory Allocation in C/C++

This week, we had a basic question about memory allocation. Reader Jeff asked:

I’m trying to create an array of structs to count the number of times each word is encountered in an input file. My code looks like this, but it doesn’t work:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct word {
  int count;
  char *word;
};

struct word *words[5000] = {0};

int find_word(char *word, struct word *words[])
{
  int idx;
  for (idx = 0; idx < 5000 && words[idx]; idx++)
  {
    if (!strcmp(word, words[idx]->word))
      return idx;
  }
  return -1;
}

int main()
{
char line[1000];
char *tmp;
int idx;

/* Get input by lines */
while (fgets(line, 1000, stdin))
{
/* Get first word */
char *word = strtok_r(line, " \t\n\r\b", &tmp);

/* Get all words in line */
while (word)
{
int wordlen = strlen(word);
idx = find_word(word, words);
if (idx == -1)
{
while (idx < 5000 && words[++idx] != NULL);
words[idx] = (struct word *)calloc(1, sizeof(int) + wordlen + 1);
strncpy(words[idx]->word, word, wordlen);
words[idx]->count = 1;
} else {
words[idx]->count++;
}

/* Get next word */
word = strtok_r(NULL, " \t\n\r\b", &tmp);
}
}

for (idx = 0; idx < 5000 && words[idx]; idx++)
printf("%s: %d\n", words[idx]->word, words[idx]->count);

return 0;
}

It compiles okay, but when I run it I get a segmentation fault. I ran it in a debugger and see that the crash happens on line 47 [ed. note: line 47 contains the strncpy() call]. I’m allocating space for the int and the string on the line above, so I don’t understand why it is failing.

All help appreciated!

Jeff

Hi Jeff,

First of all, thanks for sending in your question. It’s a great opportunity to explain the interactions between structures, arrays and pointers.

I presume that this is for an educational assignment, so you probably don’t need to worry about the finer points. I also assume that you’re intentionally working at a low level, for educational purposes, so I’ll try to keep my responses in terms similar to your code.  I’ll address your specific question, but there are a number of problems with the code that I won’t dwell on. Those problems relate mostly to the code being ready for the real world, where you rarely have the luxury of assuming that things will have fixed sizes, or reliably work as expected.

If this is an educational exercise, please consider other aspects of how you can improve the code (once you get it working): can you really expect the input file to have no more than 5000 unique words, and lines no longer than 1000 characters? Should words contain punctuation or other special characters? What happens if you get a really large input file (possibly not in a text format)? You get the point.

Overall, the code looks pretty good. Your problem seems to be due to misunderstanding how character pointers work. You are allocating enough space for the word and the count, but not in the way that you think. A char pointer typically requires 4 or 8 bytes of storage, and contains the starting address of the memory where the word itself is stored. You’re actually allocating space for the int and the pointer, plus (potentially) a bunch of storage after the pointer. For small words, you’re not allocating enough space for the pointer.

In the strncpy() call, you are copying the contents of word to the address pointed to by the word member. The problem is that you haven’t initialized that member to point at anything! Since you allocated the memory using calloc(), it is initialized to 0 (i.e. NULL) and you are trying to store the word there – leading to the segmentation fault.

One option is to define the word member as an array instead of a pointer. That way, the memory will automatically be allocated and you can just copy the word into the array.  For example, you could define the structure as something like this:

struct word {
int count;
char word[500];
};

And then use code like this to set up the structure:

words[idx] = (struct word *)calloc(1, sizeof(struct word));
strncpy(words[idx]->word, word, MIN(wordlen,499));

This allocates 500 bytes for each word as part of the structure, so you don’t need to allocate that memory explicitly. Note that there is a 499 byte limit to the length of words in this case, and you’re wasting space for any words less than that length (which will likely be the vast majority of words).

Better, you can use a pointer for the word member and initialize the pointer before you use it. For example, you could do something like this:

words[idx] = (struct word *<strong>)calloc(1, sizeof(struct word</strong>));
words[idx]-&gt;word = (char*)malloc(wordlen + 1);
strncpy(words[idx]-&gt;word, word, wordlen);

In this case, you allocated space for the pointer in the structure, and you allocate additional memory for the word – with that address stored in the word member. You can then safely copy the word.

I actually find it silly to do two allocations for each element in the array, although I realize this may just be an intermediate step to a data structure other than an array.

If you have a fixed number of words, you can base the array on “struct word” and dynamically allocate just the word storage for each element:

struct word words[5000] = {0};
…
if (idx == -1)
{
while (idx &lt; 5000 &amp;&amp; words[++idx].count != 0);
words[idx].count = 1;
words[idx].word = (char*)malloc(wordlen + 1);
strncpy(words[idx].word, word, wordlen);

Instead of using malloc and strncpy, you could alternatively use strdup(). Note that there will be other code changes needed, especially where words[idx] is used (change the “->” to a “.” since it’s no longer a pointer). Hopefully you get the point.

One problem with the current implementation, is that you potentially need to walk over all of the array elements each time you read a word. You actually double that for each new word, as you walk the array to determine that it isn’t present, and then walk the array again to find the end. You also should free the memory that you allocate with calloc(), malloc(), strdup(), and their brethren.

I think a better solution is to use a data structure like a linked list or tree. With those structures, you don’t need to worry about wasting unused space in an array, and they will help you improve performance for large input sets. You need to find the right match for your situation – something that balances ease/speed of implementation, manageability, performance, and so forth for the conditions within which you expect the application to run.

Please share your thoughts in the comments, and send in questions or topics that you’d like to see us cover in the future!

– Jon

  1. I think something closer to the Jeff’s original code would be to use a structure containing a variable-size array and then allocate each one (using essentially his original allocation code) to be just the right size. That fixes the original crash problem, avoids the insufficient allocation for longer-than-500-character words problem, avoids the wasted space for shorter-than-500-character words problem, and avoids the two allocations per word problem.

    By the way, never write code like:

    while (idx < 5000 && words[++idx].count != 0);

    Better to write:
    while (idx < 5000 && words[++idx].count != 0)
    ;

    lest you later (attempt to) add a body to that loop and wind up with:

    while (idx < 5000 && words[++idx].count != 0);
    do_something_else(words[idx]);

    (see the problem?)

    1. Thanks for pointing this out, Roger! I neglected to proofread the code samples sufficiently. I definitely agree that empty loops, when necessary, should stand out. In fact, this may have been a better choice:

      while (idx < 5000 && words[idx].count != 0)
      {
      idx++;
      }

    2. What I’ve seen many coding guides recommend in this situation is to write:
      while (idx < 5000 && words[++idx].count != 0) continue; // or on new line with braces
      in my opinion the best option because it makes it as clear as humanly possible what's going on, even a single ; can be overlooked relatively easy after all.

  2. Agreed that a hashmap or something along the lines would make for a vastly superior data structure than an array, but the question of how to best structure the struct is important anyhow. And I’m surprised you didn’t mention the to me most obvious fix (single allocation of memory, probably better for cache locality although that depends on the usage of the data structure)
    struct word {
    int count;
    char word[];
    };
    struct *word = malloc(offsetof(struct word, word) + wordlen + 1);

    1. I have to admit that I debated mentioning a variable length structure member, as it is very close to the original code. I decided not to, because such syntax requires a C99 compiler (admittedly very common), or utilization of the controversial struct hack in older compilers.

      1. I see. Admittedly having to explain why the given code wouldn’t work for MSVC (does it? They did have some extension I think) would make this longer than it should be.

        And the only interesting thing it’d add otherwise would be why you ought to use offsetof and not sizeof in this case (although for a char array alignment and padding won’t be a problem obviously) so I can see why you cut it out.

Leave a Reply

Your email address will not be published. Required fields are marked *