在 C 中，如何正确拷贝字符串

在每个程序员都必读的 K&R C 一书中，有一个非常经典的 strcpy 的实现：

1
2
3
4
5
6
strcpy(s, t)
char *s, *t;
{
    while (*s++ = *t++)
    ;
}

代码小巧紧凑，但是问题也比较严重，比如：

如果 s 不是以 NUL 结尾，循环可能会访问到非法地址
如果 t 的长度不够，可能会覆盖 t 之外的内容

利用函数调用时 stack frame 的特点，可以比较简单地实现代码注入，可参考：

Strcpy security exploit – How to easily buffer overflow « Pointerless

strncpy

在 string.h 中，还有另一个版本拷贝函数：

1
char *strncpy(char *dest, char *src, size_t n);

加了第三个参数，表示复制的最大长度，man page 上是这样介绍它的：

The strncpy() function is similar, except that at most n bytes of src are copied. Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.

这看起来情况并没有比 strcpy 好多少，特殊情况也比较多，下面是它的大致实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
char *
strncpy(char *dest, const char *src, size_t n)
{
    size_t i;

   for (i = 0; i < n && src[i] != '\0'; i++)
        dest[i] = src[i];
    for ( ; i < n; i++)
        dest[i] = '\0';

   return dest;
}

这里给出 n 为 5 的三组示例：

src	W	E	L	C	O	M	E	\0
dest	W	E	L	C	O

dest 结尾没有 NUL

src	H	E	L	L	O	\0
dest	H	E	L	L	O

dest 结尾没有 NUL

src	H	I	\0
dest	H	I	\0	\0	\0

dest 结尾有三个 NUL

乍看之下，其实觉得 strncpy 的行为有些无厘头，尤其是第三种情况，为什么补充多个 NUL ？

著名的博主 oldnewthing 在这篇文章中阐述了背后原因，大致如下：

在 UNIX 早期，文件名最长是 14 个字符，可以理解为 char name[14] ， strncpy 最初就是为了恢复文件系统中的文件名使用，由于采用的是固定长度，因此末尾不需要 NUL 结尾。至于为什么要做 NUL 的填充，这样是为了当名字小于 14 时，填充使得可以直接用 memccmp 去进行文件名比较。
C 语言标准委员会为了兼容这个行为，因此就把这个行为沿用到 strncpy 了。

可以想到，有了这个函数，下面一行代码就可以恢复 dirent 中的文件名

1
strncpy(dirent->d_name, filename, 14);

既然 strncpy 是专门为了这种情况优化的，那么我们还有什么选择吗？答案就是下面的 strlcpy

strlcpy

strlcpy 最初是在 BSD 系统中提出来的，目的就是解决 strncpy 中的问题。

1
size_t strlcpy(char *dst, const char* src, size_t dstsize);

它的 man page 是这么介绍的：

strlcpy take the full size of the destination buffer and guarantee NUL-termination if there is room. Note that room for the NUL should be included in dstsize.
strlcpy copies up to dstsize-1 characters from the string src to dst, NUL-terminating the result if dstsize is not 0.
It returns the length of src. If the return value is >= dstsize, the output string has been truncated. It is the caller's responsibility to handle this.

一个检查结果有没有截断的示例：

1
2
3
char *dir, *file, pname[MAXPATHLEN];
if (strlcpy(pname, dir, sizeof(pname)) >= sizeof(pname))
  goto toolong;

可以看到， strlcpy 使得检查结果有没有被截断变的简单，BSD 的作者还专门写过一篇文章来介绍它：

strlcpy and strlcat - consistent, safe, string copy and concatenation.

但在很长一段时间里， strlcpy 并没有进入 glibc，熟悉 glibc 开发历史的可能会知道 Ulrich Drepper 维护者极力反对：

This is horribly inefficient BSD crap. Using these function only leads to other errors. Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy). Beside, those who are using strcat or variants deserved to be punished.

我这里翻译一下：

这是非常低效的 BSD 垃圾。使用这些函数只会导致其他错误。正确的字符串处理意味着你总是知道你的字符串有多长，因此你可以使用 memcpy（而不是 strcpy）。此外，那些使用 strcat 或其变体的人理应受到惩罚。

为什么会这么说的？我们可以看看 Linux kernel 中的实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
size_t strlcpy(char *dest, const char *src, size_t size)
{
  size_t ret = strlen(src);

  if (size) {
    size_t len = (ret >= size) ? size - 1 : ret;
    memcpy(dest, src, len);
    dest[len] = '\0';
  }
  return ret;
}

可以看到，如果 src 不是以 NUL 结尾，这个函数也是错误的；而且由于使用了 strlen ，这就导致该函数必须去读取 str 的全部内容，即便 size 非常小。

因此，在Linux kernel 内，又诞生了另一个变种：strscpy，它可以避免上面的问题，当然，这意味着实现也更复杂。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
ssize_t strscpy(char *dest, const char *src, size_t count)
{
	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
	size_t max = count;
	long res = 0;

	if (count == 0 || WARN_ON_ONCE(count > INT_MAX))
		return -E2BIG;

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
	/*
	 * If src is unaligned, don't cross a page boundary,
	 * since we don't know if the next page is mapped.
	 */
	if ((long)src & (sizeof(long) - 1)) {
		size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
		if (limit < max)
			max = limit;
	}
#else
	/* If src or dest is unaligned, don't do word-at-a-time. */
	if (((long) dest | (long) src) & (sizeof(long) - 1))
		max = 0;
#endif

	while (max >= sizeof(unsigned long)) {
		unsigned long c, data;

		c = read_word_at_a_time(src+res);
		if (has_zero(c, &data, &constants)) {
			data = prep_zero_mask(c, data, &constants);
			data = create_zero_mask(data);
			*(unsigned long *)(dest+res) = c & zero_bytemask(data);
			return res + find_zero(data);
		}
		*(unsigned long *)(dest+res) = c;
		res += sizeof(unsigned long);
		count -= sizeof(unsigned long);
		max -= sizeof(unsigned long);
	}

	while (count) {
		char c;

		c = src[res];
		dest[res] = c;
		if (!c)
			return res;
		res++;
		count--;
	}

	/* Hit buffer length without finding a NUL; force NUL-termination. */
	if (res)
		dest[res-1] = '\0';

	return -E2BIG;
}

题外话：在 2023-06-14，strlcpy 已经添加到 glibc 2.38 中，结束了这个长达 25 年的争议。来源：https://news.ycombinator.com/item?id=36765747

最佳实践

著名博主 Chris Wellons 在 strcpy: a niche function you don't need 一文中进行了总结，观点其实和 Ulrich Drepper 有些类似，既然一定要知道 src 的长度，那么最直接的替代品是 memcpy ：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
char *my_strdup(const char *s)
{
    size_t len = strlen(s) + 1;
    char *c = malloc(len);
    if (c) {
        strcpy(c, s);  // BAD
    }
    return c;
}

char *my_strdup_v2(const char *s)
{
    size_t len = strlen(s) + 1;
    char *c = malloc(len);
    if (c) {
        memcpy(c, s, len);  // GOOD
    }
    return c;
}

另一个场景的例子是大小固定的字符串

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
struct err {
    char message[16];
};

void set_oom(struct err *err)
{
    strcpy(err->message, "out of memory");  // BAD
}

void set_oom_v2(struct err *err)
{
    static const char oom[] = "out of memory";
    static_assert(sizeof(err->message) >= sizeof(oom));
    memcpy(err->message, oom, sizeof(oom)); // GOOD
}

通过上面的分析，读者也可以看出，问题的根源在于 C 中没有单独的字符串类型，而是采用一种约束：以 NUL 结尾的字符数组就是字符串。网络上时不时就可以看到由此引发的问题，比如：

How I cut GTA Online loading times by 70%

作者发现，耗时最多的地方竟然是 JSON 解析时，会调用 strlen ，而输入有 10M！因此，一个更彻底的方案就是自定义字符串类型：

1
2
3
4
5
#define s8(s) (s8){(u8 *)s, lengthof(s)}
typedef struct {
    u8  *data;
    size len;
} s8;

s8 宏可以直接把一个字符串字面量转成 s8 结构体，之后在定义一些与之配套的操作函数即可：

1
2
3
4
5
6
static s8   s8span(u8 *, u8 *);
static b32  s8equals(s8, s8);
static size s8compare(s8, s8);
static u64  s8hash(s8);
static s8   s8trim(s8);
static s8   s8clone(s8, arena *);

使用示例如下：

1
2
3
if (s8equals(tagname, s8("body"))) {
    // ...
}

在 C 中，如何正确拷贝字符串

文章目录

strncpy

strlcpy

最佳实践

参考

评论