UNIXworkcode

1 # String 2 3 UCX strings store character arrays together with a length and come in two variants: immutable (`cxstring`) and mutable (`cxmutstr`). 4 5 In general, UCX strings are *not* necessarily zero-terminated. 6 If a function guarantees to return a zero-terminated string, it is explicitly mentioned in the documentation. 7 As a rule of thumb, you _should not_ pass a character array of a UCX string structure to another API without explicitly 8 ensuring that the string is zero-terminated. 9 10 ## Basics 11 12 The following listing shows basic string functions. 13 14 > To simplify documentation, we introduce the pseudo-type `AnyStr` with the meaning that 15 > any UCX string and any C string are supported. 16 > The implementation is actually hidden behind a macro which uses `cx_strcast()` to guarantee compatibility. 17 {style="note"} 18 19 ```C 20 #include <cx/string.h> 21 22 struct cx_string_s {const char *ptr; size_t length;}; 23 24 struct cx_mutstr_s {char *ptr; size_t length;}; 25 26 typedef struct cx_string_s cxstring; 27 28 typedef struct cx_mutstr_s cxmutstr; 29 30 cxstring cx_str(const char *cstring); 31 32 cxstring cx_strn(const char *cstring, size_t length); 33 34 cxmutstr cx_mutstr(char *cstring); 35 36 cxmutstr cx_mutstrn(char *cstring, size_t length); 37 38 cxmutstr cx_strdup(AnyStr string); 39 40 cxmutstr cx_strdup_a(const CxAllocator *allocator, AnyStr string); 41 42 int cx_strcpy(cxmutstr *dest, cxstring source); 43 44 int cx_strcpy_a(const CxAllocator *allocator, 45 cxmutstr *dest, cxstring source); 46 47 void cx_strfree(cxmutstr *str); 48 49 void cx_strfree_a(const CxAllocator *alloc, cxmutstr *str); 50 51 52 #define CX_SFMT(s) (int) (s).length, (s).ptr 53 #define CX_PRIstr ".*s" 54 #define cx_strcast(s) // converts any string to cxstring 55 ``` 56 57 The functions `cx_str()` and `cx_mutstr()` create a UCX string from a `const char*` or a `char*` 58 and compute the length with a call to stdlib `strlen()` (except for `NULL` in which case the length is set to zero). 59 In case you already know the length, or the string is not zero-terminated, you can use `cx_strn()` or `cx_mutstrn()`. 60 61 The function `cx_strdup_a()` allocates new memory with the given `allocator` and copies the given `string` 62 and guarantees that the result string is zero-terminated. 63 The function `cx_strdup()` is equivalent to `cx_strdup_a()`, except that it uses the [default allocator](allocator.h.md#default-allocator). 64 65 The functions `cx_strcpy_a()` and `cx_strcpy()` copy the contents of the `source` string to the `dest` string, 66 and also guarantee zero-termination of the resulting string. 67 The memory in `dest` is either freshly allocated or re-allocated to fit the size of the string plus the terminator. 68 69 Allocated strings are always of type `cxmutstr` and can be deallocated by a call to `cx_strfree()` or `cx_strfree_a()`. 70 The caller must make sure to use the correct allocator for deallocating a string. 71 It is safe to call these functions multiple times on a given string, as the pointer will be nulled and the length set to zero. 72 It is also safe to call the functions with a `NULL`-pointer, just like any other `free()`-like function. 73 74 When you want to use a UCX string in a `printf`-like function, you can use the macro `CX_PRIstr` for the format specifier, 75 and the `CX_SFMT(s)` macro to expand the arguments. 76 77 > When you want to convert a string _literal_ into a UCX string, you can also use the `CX_STR(lit)` macro. 78 > This macro uses the fact that `sizeof(lit)` for a string literal `lit` is always the string length plus one, 79 > effectively saving an invocation of `strlen()`. 80 > However, this only works for literals - in all other cases you must use `cx_str()` or `cx_strn`. 81 82 ## Comparison 83 84 ```C 85 #include <cx/string.h> 86 87 int cx_strcmp(AnyStr s1, AnyStr s2); 88 89 int cx_strcmp_p(const void *s1, const void *s2); 90 91 int cx_strcasecmp_p(const void *s1, const void *s2); 92 93 bool cx_strprefix(AnyStr string, AnyStr prefix); 94 95 bool cx_strsuffix(AnyStr string, AnyStr suffix); 96 97 int cx_strcasecmp(AnyStr s1, AnyStr s2); 98 99 bool cx_strcaseprefix(AnyStr string, AnyStr prefix); 100 101 bool cx_strcasesuffix(AnyStr string, AnyStr suffix); 102 ``` 103 104 The `cx_strcmp()` function compares two strings lexicographically 105 and returns an integer greater than, equal to, or less than 0, if `s1` is greater than, equal to, or less than `s2`, respectively. 106 107 The `cx_strcmp_p()` function takes pointers to UCX strings (i.e., only to `cxstring` and `cxmutstr`) and the signature is compatible with `cx_compare_func`. 108 Use this as a compare function for lists or other data structures. 109 110 The functions `cx_strprefix()` and `cx_strsuffic()` check if `string` starts with `prefix` or ends with `suffix`, respectively. 111 112 The functions `cx_strcasecmp()`, `cx_strcasecmp_p()`, `cx_strcaseprefix()`, and `cx_strcasesuffix()` are equivalent, 113 except that they compare the strings case-insensitive. 114 115 > In the current version of UCX, case-insensitive comparisons are only guaranteed to work with ASCII characters. 116 {style="note"} 117 118 ## Concatenation 119 120 ```C 121 #include <cx/string.h> 122 123 cxmutstr cx_strcat(size_t count, ... ); 124 125 cxmutstr cx_strcat_a(const CxAllocator *alloc, size_t count, ... ); 126 127 cxmutstr cx_strcat_m(cxmutstr str, size_t count, ... ); 128 129 cxmutstr cx_strcat_ma(const CxAllocator *alloc, 130 cxmutstr str, size_t count, ... ); 131 132 size_t cx_strlen(size_t count, ...); 133 ``` 134 135 The `cx_strcat_a()` function takes `count` UCX strings, 136 allocates memory for a concatenation of those strings _with a single allocation_, 137 and copies the contents of the strings to the new memory. 138 `cx_strcat()` is equivalent, except that it uses the [default allocator](allocator.h.md#default-allocator). 139 140 The `cx_strcat_ma()` and `cx_strcat_m()` append the `count` strings to the specified string `str` and, 141 instead of allocating new memory, reallocate the existing memory in `str`. 142 If the pointer in `str` is `NULL`, there is no difference to `cx_strcat_a()`. 143 Note, that `count` always denotes the number of variadic arguments in _both_ variants. 144 145 The function `cx_strlen()` sums the length of the specified strings. 146 147 > There is no reason to use `cx_strlen()` for a single UCX string. 148 > You can access the `length` field of the structure directly. 149 150 > You can mix `cxstring` and `cxmutstr` in the variadic arguments without the need of `cx_strcast()`. 151 152 ## Find Characters and Substrings 153 154 ```C 155 #include <cx/string.h> 156 157 cxstring cx_strchr(cxstring string, int chr); 158 159 cxstring cx_strrchr(cxstring string, int chr); 160 161 cxstring cx_strstr(cxstring string, cxstring search); 162 163 cxstring cx_strsubs(cxstring string, size_t start); 164 165 cxstring cx_strsubsl(cxstring string, size_t start, size_t length); 166 167 cxstring cx_strtrim(cxstring string); 168 169 cxmutstr cx_strchr_m(cxmutstr string, int chr); 170 171 cxmutstr cx_strrchr_m(cxmutstr string, int chr); 172 173 cxmutstr cx_strstr_m(cxmutstr string, cxstring search); 174 175 cxmutstr cx_strsubs_m(cxmutstr string, size_t start); 176 177 cxmutstr cx_strsubsl_m(cxmutstr string, size_t start, size_t length); 178 179 cxmutstr cx_strtrim_m(cxmutstr string); 180 ``` 181 182 The functions `cx_strchr()`, `cx_strrchr()`, and `cx_strstr()`, behave like their stdlib counterparts. 183 184 The function `cx_strsubs()` returns the substring starting at the specified `start` index, 185 and `cx_strsubsl()` returns a substring with at most `length` bytes. 186 187 The function `cx_strtrim()` returns the substring that results when removing all leading and trailing 188 whitespace characters. 189 190 All functions with the `_m` suffix behave exactly the same as their counterparts without `_m` suffix, 191 except that they operate on a `cxmustr`. 192 In _both_ variants the functions return a view into the given `string` 193 and thus the returned strings must never be passed to `cx_strfree()`. 194 195 ## Replace Substrings 196 197 ```C 198 #include <cx/string.h> 199 200 cxmutstr cx_strreplace(cxstring str, 201 cxstring search, cxstring replacement); 202 203 cxmutstr cx_strreplace_a(const CxAllocator *allocator, cxstring str, 204 cxstring search, cxstring replacement); 205 206 cxmutstr cx_strreplacen(cxstring str, 207 cxstring search, cxstring replacement, size_t replmax); 208 209 cxmutstr cx_strreplacen_a(const CxAllocator *allocator, cxstring str, 210 cxstring search, cxstring replacement, size_t replmax); 211 ``` 212 213 The function `cx_strreplace()` allocates a new string which will contain a copy of `str` 214 where every occurrence of `search` is replaced with `replacement`. 215 The new string is guaranteed to be zero-terminated even if `str` is not. 216 217 The function `cx_strreplace_a()` uses the specified `allocator` to allocate the new string. 218 219 The functions `cx_strreplacen()` and `cx_strreplacen_a()` are equivalent, except that they stop 220 after `replmax` number of replacements. 221 222 ## Basic Splitting 223 224 ```C 225 #include <cx/string.h> 226 227 size_t cx_strsplit(cxstring string, cxstring delim, 228 size_t limit, cxstring *output); 229 230 size_t cx_strsplit_a(const CxAllocator *allocator, 231 cxstring string, cxstring delim, 232 size_t limit, cxstring **output); 233 234 size_t cx_strsplit_m(cxmutstr string, cxstring delim, 235 size_t limit, cxmutstr *output); 236 237 size_t cx_strsplit_ma(const CxAllocator *allocator, 238 cxmutstr string, cxstring delim, 239 size_t limit, cxmutstr **output); 240 ``` 241 242 The `cx_strsplit()` function splits the input `string` using the specified delimiter `delim` 243 and writes the substrings into the pre-allocated `output` array. 244 The maximum number of resulting strings can be specified with `limit`. 245 That means, at most `limit-1` splits are performed. 246 The function returns the actual number of items written to `output`. 247 248 On the other hand, `cx_strsplit_a()` uses the specified `allocator` to allocate the output array, 249 and writes the pointer to the allocated memory to `output`. 250 251 The functions `cx_strsplit_m()` and `cx_strsplit_ma()` are equivalent to `cx_strsplit()` and `cx_strsplit_a()`, 252 except that they work on `cxmustr` instead of `cxstring`. 253 254 > The `allocator` in `cx_strsplit_a()` and `cx_strsplit_ma()` is _only_ used to allocate the output array. 255 > The strings will always point into the original `string` 256 > and you need to use `cx_strdup()` or `cx_strdup_a()` if you want copies or zero-terminated strings after performing the split. 257 {style="note"} 258 259 ## Complex Tokenization 260 261 ```C 262 #include <cx/string.h> 263 264 CxStrtokCtx cx_strtok(AnyStr str, AnyStr delim, size_t limit); 265 266 void cx_strtok_delim(CxStrtokCtx *ctx, 267 const cxstring *delim, size_t count); 268 269 bool cx_strtok_next(CxStrtokCtx *ctx, cxstring *token); 270 271 bool cx_strtok_next_m(CxStrtokCtx *ctx, cxmutstr *token); 272 ``` 273 274 You can tokenize a string by creating a _tokenization_ context with `cx_strtok()`, 275 and calling `cx_strtok_next()` or `cx_strtok_next_m()` as long as they return `true`. 276 277 The tokenization context is initialized with the string `str` to tokenize, 278 one delimiter `delim`, and a `limit` for the maximum number of tokens. 279 When `limit` is reached, the remaining part of `str` is returned as one single token. 280 281 You can add additional delimiters to the context by calling `cx_strtok_delim()`, and 282 specifying an array of delimiters to use. 283 284 > Regardless of how the context was initialized, you can use either `cx_strtok_next()` 285 > or `cx_strtok_next_m()` to retrieve the tokens. However, keep in mind that modifying 286 > characters in a token returned by `cx_strtok_next_m()` has only defined behavior, when the 287 > underlying `str` is a `cxmutstr`. 288 289 ### Example 290 291 ```C 292 #include <cx/string.h> 293 294 cxstring str = cx_str("an,arbitrarily;||separated;string"); 295 296 // create the context 297 CxStrtokCtx ctx = cx_strtok(str, CX_STR(","), 10); 298 299 // add two more delimters 300 cxstring delim_more[2] = {CX_STR("||"), CX_STR(";")}; 301 cx_strtok_delim(&ctx, delim_more, 2); 302 303 // iterate over the tokens 304 cxstring tok; 305 while(cx_strtok_next(&ctx, &tok)) { 306 // to something with the tokens 307 // be aware that tok is NOT zero-terminated! 308 } 309 ``` 310 311 ## Conversion to Numbers 312 313 For each integer type, as well as `float` and `double`, there are functions to convert a UCX string to a value of those types. 314 315 Integer conversion comes in two flavors: 316 ```C 317 int cx_strtoi(AnyStr str, int *output, int base); 318 319 int cx_strtoi_lc(AnyStr str, int *output, int base, 320 const char *groupsep); 321 ``` 322 323 The basic variant takes a string of any UCX string type, a pointer to the `output` integer, and the `base` (one of 2, 8, 10, or 16). 324 Conversion is attempted with respect to the specified `base` and respects possible special notations for that base. 325 Hexadecimal numbers may be prefixed with `0x`, `x`, or `#`, and binary numbers may be prefixed with `0b` or `b`. 326 327 The `_lc` versions of the integer conversion functions are equivalent, except that they allow the specification of an 328 array of group separator chars, each of which is simply ignored during conversion. 329 The default group separator for the basic version is a comma `,`. 330 331 The signature for the floating-point conversions is quite similar: 332 ```C 333 int cx_strtof(AnyStr str, float *output); 334 335 int cx_strtof_lc(AnyStr str, float *output, 336 char decsep, const char *groupsep); 337 ``` 338 339 The two differences are that the floating-point versions do not support different bases, 340 and the `_lc` variant allows specifying not only an array of group separators, 341 but also the character used for the decimal separator. 342 343 In the basic variant, the group separator is again a comma `,`, and the decimal separator is a dot `.`. 344 345 > The floating-point conversions of UCX 3.1 do not achieve the same precision as standard library implementations 346 > which usually use more sophisticated algorithms. 347 > The precision might increase in future UCX releases, 348 > but until then be aware of slight inaccuracies, in particular when working with `double`. 349 {style="warning"} 350 351 > The UCX string to number conversions are intentionally not considering any locale settings 352 > and are therefore independent of any global state. 353 {style="note"} 354 355 <seealso> 356 <category ref="apidoc"> 357 <a href="https://ucx.sourceforge.io/api/string_8h.html">string.h</a> 358 </category> 359 </seealso> 360