-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy pathwork.Rmd
More file actions
636 lines (424 loc) · 24.5 KB
/
work.Rmd
File metadata and controls
636 lines (424 loc) · 24.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
# Getting work done {#work}
This chapter covers common analytical tasks (without any discussion of their statisical justification).
Unlike earlier chapters, this one leans more towards saying what the right function is called rather than giving examples. R has weird function names but good per-function examples. Remember that typing `example(function_name)` will run them in the console.
## Sets {#sets}
### Unary operators
To treat a vector or data.table as a set (in the limited sense of dropping duplicates), use `unique(x)`. I guess the function is called "unique" since in the resulting vector (or table), every element (or row) is unique.
For the data.table version, there is a `by=` option, so we can drop duplicates in terms of some of the table's columns. By default, it will keep the first row within the `by=` group, but `fromLast = TRUE` can switch it to keep the last row instead.
A complementary function `duplicated` also exists, but is rarely needed.
### Testing membership
Besides comparisons, `x == y`, we can test membership like `x %in% y`:
```{r in}
c(1, 3, NA) %in% c(1, 2)
"A" %in% c("A", "A", "B")
```
We get one true/false value for each element on the left-hand side. Duplicate values on the right-hand side are ignored. That is, the right-hand side is treated like a *set*.
Unlike `==`, here we always get a true/false result, even when missing values are present.
### Other binary operators
As documented at `?sets`, for two vectors `x` and `y` we can construct various statements:
- $x\cup y$ as `union(x,y)`,
- $x\cap y$ as `intersect(x,y)`,
- $x\ \backslash\ y$ as `setdiff(x,y)` and
- $x = y$ as `setequal(x,y)`.
As mentioned in \@ref(classes), R does not actually support an unordered set data type. However, these functions treat `x` and `y` as sets in the sense that they ignore duplicated values in them. So `union(c(1,1,2), c(3,4))` is `c(1,2,3,4)`, with the repeating 1 ignored because it has no meaning in this context.
Corresponding functions exist for data.tables, prefixed like `f*`: `funion`, `fintersect`, `fsetdiff` and `fsetequal`. Two tables can be combined in this way only if their columns line up (in terms of class, etc.).
Looking at the source code for these functions, we'll see a lot of them use the `match` function.
### Operators on lists of sets
R only includes binary versions of these operators. However, the `Reduce` function can help for finite unions or intersections. It collapses a list of arguments using a binary operator successively, so the union of four sets
$$
\bigcup_{i=1}^4 x_i = ((x_1\cup x_2) \cup x_3) \cup x_4
$$
is written as
```{r setreduce}
xList = list(1:2, 2:3, 11L, 4:6)
Reduce(union, xList)
```
Another alternative for the union of a set of vectors is `unique(unlist(xList))`. The `unlist` function just combines the four vectors into a single vector. `Reduce(intersect, xList)` makes sense for taking the intersection of a bunch of sets.
For a list of data.tables, the same can be done (with `funion` and `fintersect`). However, again, the alternative for finite unions `unique(rbindlist(dList))` may be better in terms of performance in some cases.
### Exercises
1. Define a "subset or equal" function for two vectors, `subsetequal(x,y)`, which returns `TRUE` iff `x` is a (nonstrict) subset of `y`, $x \subseteq y$. How would the function change for testing a strict subset, $x \subset y$?
2. Define a "symmetric difference" function for two vectors, `symdiff(x,y)`, which returns unique items appearing in `x` or in `y` but not in both.
3. With `DT = data.table(g = c(1,1,2,2,2,3), v = 1:6)`, we can select the first row of each `g` group with `unique(DT, by="g")` or `DT[!duplicated(DT, by="g")]` or `DT[, .SD[1L], by=g]` or `DT[.(unique(g)), on=.(g), mult="first"]` or `DT[DT[, first(.I), by=g]$V1]`. Find several ways to select the *last* row in each `g` group. (Hint: one uses `order(-(1:.N))`.)
## Combinations {#combos}
```{r cj, eval=FALSE}
CJ(x, y, z, unique = TRUE)
```
return a table with all combinations:
$$
\{(x_i,y_j,z_k): x_i \in x, y_j \in y, z_k \in z\}
$$
The `unique = TRUE` clause means we are treating the vectors like sets (\@ref(sets)), ignoring their duplicated values.
So to take all pairs of values for a vector...
```{r CJ, echo=-c(3,7)}
x = LETTERS[1:4]
xcomb = CJ(x,x)
xcomb
# limit to distinct pairs
xpairs <- xcomb[ V1 < V2 ]
xpairs
# create a single categorical var
xpairs[, pair_code := sprintf("%s_%s", V1, V2)][]
```
`CJ` is essentially a Cartesian product, or a "Cross Join."
In base, `combn(x, n, f)` finds all `x`'s subvectors of length `n` and optionally applies a function to them. And `outer(x, y, f)` applies a function to all pairs of elements of `x` and `y`, returning a matrix.
All of these options can get out of hand, in terms of memory consumption, but that's normal for combinatorial operations.
### Exercises
1. Define a function, `triples`, which takes a vector `x` and returns all length-3 subsets, one on each row. For `x = 1:5`, there are ten `= choose(5,3)` such subsets, so it should look something like...
# V1 V2 V3
# 1: 1 2 3
# 2: 1 2 4
# 3: 1 2 5
# 4: 1 3 4
# 5: 1 3 5
# 6: 1 4 5
# 7: 2 3 4
# 8: 2 3 5
# 9: 2 4 5
# 10: 3 4 5
## Randomization
To make a draw reproducible, use `set.seed` before it:
```{r set-seed}
set.seed(1)
rnorm(10) # 10 random draws from the standard normal distribution
```
### Distribution draws
To draw `n` random numbers from the uniform distribution on the unit interval, use `runif(n)`. All random-draw functions follow this naming convention -- `rnorm`, `rpoisson`, `rbinom` -- for a full list of the built-in ones, type `?distributions`. Many more are offered in packages that can be found by searching online.
To get a quick look at the density being drawn from, use `curve`:
```{r dist-glance}
# glance at chi-squared with 10 degrees of freedom
df = 10
curve(dchisq(x, df), xlim = qchisq(c(.01, .99), df))
```
`d*` is the density and `q*` is the quantile function. We're using the latter to find good x-axis bounds for the graph. The final function is `p*`, for the cumulative density function.
### Urn draws
To take `n` samples from `x` with replacement, where `p[i]` is the probability of drawing `x[i]`:
```{r sample, eval=FALSE}
sample(x, prob = p, size = n, replace = TRUE)
```
The help page `?sample` covers many other use cases. There is one to be very wary of, however:
```{r sample-bad}
set.seed(1)
x = c(6.17, 5.16, 4.15, 3.14)
sample(x, 2, replace = TRUE)
y = c(6.17)
sample(y, size = 2, replace = TRUE)
```
Where did `4` and `6` come from? Those aren't elements of `y`!
```{block2 sampling, type='rmd-caution'}
**Sampling from a numeric vector.** Vanilla `sample` cannot be trusted on numeric or integer vectors unless we are sure the vector has a length greater than 1. Read `?sample` for details.
```
I should have instead sampled from `y`s indices like
```{r sample-better, eval=FALSE}
y[sample.int(length(y), size = 2, replace = TRUE)]
# or
y[sample(seq_along(y), size = 2, replace = TRUE)]
```
To take draws without replacement, just use the option `replace = FALSE`.
### Permutations
Taking a random permutation of a vector is just a special case of drawing from an urn. We're drawing *all* balls from the urn without replacement:
```{r permute-danger, eval=FALSE}
y = c(1, 2, 3, 4)
sample(y) # using default replace = FALSE
```
The same warning for length-zero or length-one vectors applies here. The safer way is:
```{r permute-better, eval=FALSE}
y = c(1, 2, 3, 4)
y[sample.int(length(y))]
# or
y[sample(seq_along(y))]
```
### Simulations
For a random process to repeat `nsim` times, the `replicate` function is the answer. Suppose we want to take 10 standard normal draws and return the top three:
```{r replicate}
nsim = 2
n = 10
set.seed(1)
replicate(nsim, rnorm(n) %>% sort %>% tail(3))
```
The `set.seed` line is added so that the results are reproducible.
#### Complicated output
By default, `replicate` simplifies the result to a matrix. For more complicated results, it is often better to have a list. Suppose we want to grab the top three values as well as the mean:
```{r replicate-list}
nsim = 2
n = 10
set.seed(1)
replicate(nsim, rnorm(n) %>% { list(mu = mean(.), top3 = sort(.) %>% tail(3)) }, simplify = FALSE)
```
Thanks to use of the same seed (with `set.seed`), the simulated values here are the same as in the last section.
#### Speed and pipes
As a reminder, magrittr pipes -- `%>%` -- are slow. Here's a benchmark (with results measured in seconds):
```{r magrittr-perf, cache=TRUE}
nsim = 1e4
n = 1e2
system.time(replicate(nsim, simplify = FALSE,
rnorm(n) %>% { list(mu = mean(.), top3 = sort(.) %>% tail(3))}
))
system.time(replicate(nsim, simplify = FALSE, {
x = rnorm(n)
list(mu = mean(x), top3 = tail(sort(x),3))
}))
```
So while it may be convenient to write a quick simulation using pipes, it's better to switch to vanilla R when speed is important.
```{block2 sim-speed, type='rmd-details'}
**Other options for speed.** Beyond writing in vanilla R instead of pipes, further improvements might be found by writing as much of the simulation as possible in matrix algebra (possibly even combining "separate" simulations into a single computation). Translating to C++ or parallelizing may also be worth considering. For more details, the [High Performance Computing Task View on CRAN](https://cran.r-project.org/view=HighPerformanceComputing).
```
#### Multiple parameters
If running multiple simulations with different parameters, I recommend starting with results in a data.table with one row per run:
```{r sim-fancy-pre, echo=-2}
simparmDT = data.table(set_id = LETTERS[1:2], nsim = 3, n = 10, mu = c(0, 10), sd = c(1, 2))
simparmDT
```
```{r sim-fancy, echo=-3}
set.seed(1)
simDT = simparmDT[, .(
# assign a unique ID to each run
run_id = sprintf("%s%03d", set_id, seq.int(nsim)),
# run simulations, storing results in a list column
res = replicate(nsim, simplify = FALSE, {
x = rnorm(n, mu, sd)
list(mu = mean(x), top3 = tail(sort(x),3))
})
), by=set_id]
simDT
```
For cleaner code and easier debugging, it's probably best to write the core simulation code as a separate function:
```{r sim-fn}
sim_fun = function(n, mu, sd){
x = rnorm(n, mu, sd)
list(mu = mean(x), top3 = tail(sort(x),3))
}
```
Then the code to call it can be agnostic regarding the details of the function:
```{r sim-fn-use}
set.seed(1)
simDT = simparmDT[, .(
run_id = sprintf("%s%03d", set_id, seq.int(nsim)),
res = replicate(nsim, do.call(sim_fun, .SD), simplify = FALSE)
), by=set_id, .SDcols = intersect(names(simparmDT), names(formals(sim_fun)))]
```
We are setting `.SDcols` to be those names that are used as arguments to `sim_fun` and also appear as columns in `simparmDT`. That is, we are taking the intersection of these two vectors of names.
The `do.call` and `intersect` functions are covered in \@ref(do-call) and \@ref(sets), respectively.
Next we can extract all components as new columns (still sticking with one row per simulation run):
```{r sim-fancy-fin}
simDT[, (names(simDT$res[[1]])) := res %>%
# put non-scalar components into lists
lapply(function(run) lapply(run, function(x)
if (is.atomic(x) && length(x) == 1L) x
else list(x)
)) %>%
# combine rows into one table
rbindlist
][]
```
## Working with sequences {#sequences}
Often we care not only about the value in the current row, but also in the previous `k` rows or all previous rows.
### Lag operators {#lag}
Data.table provides a convenient tool for the lag operator and its inverse, a "lead operator":
```{r shift}
x = c(1, 3, 7, 10)
shift(x) # lag
shift(x, type = "lead") # lead
shift(x, 0:3) # multiple lags
```
`shift` always returns vectors of the same length as `x`, which works well with data.tables:
```{r shift-dt, echo=-7}
# add to an existing table
DT = data.table(id = seq_along(x), x)
DT[, sprintf("V%02d", 0:3) := shift(x, 0:3)][]
# make a new table
shiftDT = setDT(c(list(x = x), shift(x, 0:3)))
shiftDT
```
The `embed` function from base R may also be useful when handling lags. It extracts all sequences of a given size:
```{r embed}
embed(x, 3)
```
### Taking differences
A simple common use for `shift` is taking differences across rows by group:
```{r shift-diff}
DT = data.table(id = c(1L, 1L, 2L, 2L), v = c(1, 3, 7, 10))
DT[, dv := v - shift(v), by=id]
# fill in the missing lag for the first row
DT[, dv_fill := v - shift(v, fill = 0), by=id][]
```
There is also a `diff` function in base R, but it is less flexible than computing differences with `shift`; and it does not return a vector of the same length as its input.
### Rolling computations {#rolling}
If we want a rolling sum, the naive approach is to take the sum of elements 1:2, then the sum of 1:3, then the sum of 1:4. In the context of
data.tables, this would look like a self non-equi join (\@ref(joins-nonequi)):
```{r badroll-pre, echo=-4}
# example data
set.seed(1)
DT = data.table(id = rep(LETTERS[1:3], each = 3))[, `:=`(
sid = rowid(id),
v = sample(0:2, .N, replace = TRUE)
)][]
DT
```
```{r badroll}
DT[, vsum :=
DT[.SD, on=.(id, sid <= sid), sum(v), by=.EACHI]$V1
, .SDcols = c("id", "sid")]
```
(As a side-note, `rowid` conveniently builds a within-group counter, as seen here. For more counters, see \@ref(gids).)
The number of calculations performed with this non-equi join approach scales at around `N^2` as the number of rows `N` increases, which is pretty poor. Nonetheless, for any rolling computation, this will always be an option.
In this case (and many others), there's a better way, though:
```{r goodroll}
DT[, vsum2 := cumsum(v), by=id][]
```
Besides `cumsum` for the cumulative sum, there's also `cumprod` for the cumulative product; and `cummin` and `cummax`, for the cumulative min and max. For more rollers and to roll new ones, see the [zoo](https://CRAN.R-project.org/package=zoo) and [RcppRoll](https://CRAN.R-project.org/package=RcppRoll) packages.
To "roll" missing values forward, it is common to use `na.locf` ("last observation carried forward") from [the widely-used zoo package](https://cran.r-project.org/web/packages/zoo/index.html). Alternately, `DT[, v := v[1L], by=cumsum(!is.na(v))]` can do the trick.
### Run-length encoding
In time series, it is common to have repeated values, called "runs" or "spells." To work with these in a vector, use `rle` ("run length encoding") and `inverse.rle`:
```{r rle}
x = c(1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1)
rle(x)
```
Because the result is a list, we can use `with` on it (as discussed in the last section). For example, if we just want the final three runs:
```{r with-rle}
r = rle(x)
new_r = with(r, {
new_lengths = tail(lengths, 3)
new_values = tail(values, 3)
list(lengths = new_lengths, values = new_values)
})
inverse.rle(new_r)
```
As we saw in \@ref(function-environment), `with(L, expr)` attaches objects in a list or environment `L` temporarily for the purpose of evaluating the expression `expr`.
### Grouping on runs {#gids}
`cut` and `findInterval` are good ways to construct groups from intervals of a continuous variable.
To group by sequences, it is often useful to take advantage of...
- *Cumulative sums*: `cumsum`
- *Differencing*: `diff` or `x - shift(x)`, explained in \@ref(lag)
- *Runs*: `rleid` and `rle`
To label rows *within* a group, use data.table's `rowid`. To label groups, use data.table's `.GRP`, like `DT[, g_id := .GRP, by=g]`.
Remember that while it's often tempting in R to run computations by row, like `DT[, ..., by=1:nrow(DT)]`, this is almost always a bad idea, both in terms of clarity and speed.
## Combining table columns
As noted in \@ref(dt-lapply), it is generally a bad sign if many calculations are made across columns. Nonetheless, some occasionally make sense.
For example, if each row represents a vector and columns represent dimensions of the vectors, we might want to take the Euclidean norm rowwise:
```{r combcol-norm}
DT = data.table(id = 1:2, x = 1:2, y = 3:4, z = 5:6)
DT[, norm := Reduce(`+`, lapply(.SD, `^`, 2)) %>% sqrt, .SDcols = x:z][]
```
Or we might have several categorical string variables that we want to collapse into a single code (essentially reversing `tstrsplit` from \@ref(char-input)):
```{r paste-pre}
# example data
set.seed(1)
DT = data.table(city = c("A", "B"))[,
state := sample(state.abb, .N)][,
country := c("US", "CA")]
```
```{r paste}
DT[, code := do.call(paste, c(.SD, sep = "_"))][]
```
Or we might want to take the minimum over the last `n` rows, a rolling-window operation (\@ref(rolling)):
```{r shiftroll-pre}
set.seed(1)
DT = data.table(id = rep(LETTERS[1:20], each=10))[,
v := sample(-10:10, .N, replace = TRUE)]
```
```{r shiftroll}
DT[, lag_min := do.call(pmin, shift(v, 0:2)), by=id][]
```
This is an instance of combining columns, since `shift(v, 0:2)` essentially are three columns. Try `DT[, shift(0:2), by=id]` and review \@ref(lag) to see.
While all of these examples are doing rowwise computations, none of them actually condition on each row, with `by=1:nrow(DT)` or similar. As a result, they are much more efficient.
All of these functions (`pmin`, `pmax`, `paste`) take a variable number of arguments. Looking at the functions' arguments with `args(function_name)`, we can see that they do this via dots `...` (\@ref(fundots)).
## Regression {#reg}
Most basic regression models have functions to "run" them, as a search online will reveal.
### Extracting results
The return value from a regression function is a special object designed to hold the results of the run. In the case of a linear regression, the `lm` function returns a `lm` object.
To see how this object can be used, look for the object's attributes and for methods associated with its class:
```{r lm}
set.seed(1)
n = 20
DT = data.table(
y = rnorm(n), x = rnorm(n),
z = sample(LETTERS[1:3], n, replace = TRUE)
)
res = DT[, lm(y ~ x)]
attributes(res)
class(res)
methods(class = "lm")
```
Attributes can always be accessed like `attr(obj, attr_name)`; while some have their own accessors, like `coefficients(res)`. Methods are functions that can be applied directly to the object, like `plot(res, 1)`. For documentation on methods, add a dot and the class, like `?plot.lm` and `?predict.lm`.
Besides usage like `DT[, lm(y ~ x)]`, most regression functions also offer a `data` argument, like `lm(y ~ x, data = DT)`.
### Formulas
The `y ~ x` expression above is called a formula. These can greatly simplify the specification of a regression equation.
For example, if we want the categorical variable `z` on the right-hand side, there is no need to construct and list a set of dummies (as noted in \@ref(relational-updatejoin)):
```{r reg-factor}
lm(y ~ z, data = DT)
```
And there are many other simple extensions:
```{r reg-formulas}
# including all levels of z by dropping the intercept
lm(y ~ z - 1, data = DT)
# including an interaction term
lm(y ~ z:x, data = DT)
# including all of z's levels by dropping x and the intercept
lm(y ~ z*x - 1 - x, data = DT)
# including a polynomial
lm(y ~ poly(x, 2), data = DT)
```
The doc at `?formula` explains all the ways to write formulas concisely. Many functions besides `lm` allow the same sorts of formulas.
We saw `dcast` (\@ref(dcast-browse)) also uses formulas. These only come in very simple forms, as documented in `?dcast`.
```{r todo-learn-formula, eval=FALSE, echo=FALSE}
# i'll just leave this collection here until i learn more
#
# needed for regression, whether with `model.matrix` or a regression function like `lm`
#
# also needed for reshaping, whether with `aggregate` or `dcast`
#
# read the docs carefully for the syntax, which is quite nice. however, it's not so easy to change a formula
#
# do what you can to minimize its scope
http://stackoverflow.com/documentation/r/1061/formula#t=201703291535130876176
http://stackoverflow.com/q/4951442
http://stackoverflow.com/q/18017765
?formula
http://ww2.coastal.edu/kingw/statistics/R-tutorials/formulae.html
http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf
http://stackoverflow.com/q/4392042/
http://stackoverflow.com/questions/9585890 -- a very nice usage, i think
http://stackoverflow.com/q/21330633/
http://stackoverflow.com/q/29563622/
http://stackoverflow.com/q/2427279/
http://stackoverflow.com/q/1300575/
http://stackoverflow.com/q/32616762/ -- very helpful
```
### Custom regression
If no one has implemented the regression procedure we want, it is possible to program up a basic version of it (assuming we understand it). Writing up a version that works nicely with formulas and has all the attributes and methods seen with `lm` would be a tall order, of course.
It's best for speed and simplicity to lean on matrix algebra as much as possible. If speed becomes a real issue, translating to C++ with [the Rcpp package](https://CRAN.R-project.org/package=Rcpp) may help.
## Numerical tasks
### Optimization
To find an extremum of a function of a length-one number (a real scalar), use `optimize`. For functions of multiple arguments, there's `optim`. (R is agnostic on Brit vs American English, so `optimise` does the same thing as `optimize`.)
The documentation covers everything (optimization methods available, numerical issues to be aware of, stopping rules, "trace" options).
### Integration
Given a vectorized function (\@ref(vectorization)) mapping to and from the real line and an interval, `integrate` will approximate the associated definite integral. The documentation explains the method used and the numerical issues.
## String operations {#strings}
Here are various string-related functions:
- `nchar(x)` tells how long a string is, and `nzchar(x)` checks if it is non-empty.
- `substr(x, start, end)` extracts a contiguous substring, as does `substring`.
- `sprintf` creates a string according to a specified format (with, e.g., leading zeros).
- `format` prints a string representation of an object (see, e.g., `?format.Date`).
- `paste` combines strings according to some simple rules, with `paste0` and `toString` as special cases.
- `grep`, `grepl`, `regexpr` and `gregexpr` search strings, with `regmatches` for extracting matches.
- `sub` and `gsub` rewrite strings.
- `strsplit` and `tstrsplit` can split strings based on a pattern.
- `chartr` changes single characters.
- `lower` and `upper` change cases.
Most of these are vectorized (\@ref(vectorization)) whenever it makes sense. Examples for a couple of them can be found in \@ref(char-input).
For searching, rewriting or splitting strings, some familiarity with regular expressions will often be helpful. See `?regex`. Because the syntax of the functions can also get cumbersome, it may be worth it to use a helper like [the stringi package](https://CRAN.R-project.org/package=stringi).
As mentioned in \@ref(format-cols), the number of digits printed on numbers can be tweaked via the scipen option (see `?options`), though I think this is only likely to be useful when printing out results.
`print` and `cat` can both be used to print output to the screen, with various options. Of the two, `cat` will actually parse the string, so `cat('Egad!\nI dare say, "It\'s you!"\n')` will print the newlines and quotes.
```{block2 message-programming, type='rmd-details'}
**Writing to console.** While I have simply used `cat`, `print` and `stopifnot` throughout this document, the proper way to handle messages is more complicated, as [explained by coatless on Stack Overflow](http://stackoverflow.com/a/36700294/). I still haven't learned it myself, but it is likely to be important when writing a package.
```
In my opinion, very few string operations (and no regex operations) are necessary outside of input-output processing (\@ref(input-output)) and printing messages.
The input and output functions have their own string processing tools, so plenty of manual string tweaking can often be avoided. For example, `na.strings` in `fread` can recognize missing values as such when a file is read in; and `dateTimeAs` in `fwrite` can format date and time columns when printing a table out.
```{block2 qw, type='rmd-details'}
**Saving keystrokes.**
Finally, for the sake of my sanity, I borrow [a Perl function](https://www.google.com/#q=perl%20qw) for creating character vectors.
qw <- function(s="") unlist(strsplit(s,'[[:space:]]+'))
my_cols <- qw("id city_name state_code country_code")
setkeyv(DT, my_cols)
Those coming from Stata, where columns are always barewords, may appreciate this style, and I guess such quality-of-life functions are fairly harmless so long as no function using them is shared as if it's standalone.
```