Overview
kit provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection.
Key features include:
-
Parallel statistical functions: Row-wise operations
(
psum,pmean,pfirst) using OpenMP. -
Vectorized conditionals: Fast
if-elselogic (iif,nif,vswitch) that preserves attributes. -
Efficient set operations: Faster
unique,duplicated, andcountfor vectors and data frames. -
Partial sorting: Retrieve top N elements without
sorting the entire vector (
topn). -
Factor utilities: Fast character-to-factor
conversion (
charToFact) and level manipulation (setlevels).
Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets.
Parallel Statistical Functions
Computing row-wise statistics across multiple vectors or data frame
columns is a common task. While base R has pmin() and
pmax(), it lacks efficient equivalents for sum, mean, or
product. kit fills this gap.
Row-wise Arithmetic
psum(), pmean(), and pprod()
compute parallel sum, mean, and product respectively. They accept
multiple vectors or a single list/data frame.
x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)
# Parallel sum
psum(x, y, z, na.rm = TRUE)
#> [1] 6 7 8 7
# Parallel mean
pmean(x, y, z, na.rm = TRUE)
#> [1] 2.000000 3.500000 4.000000 2.333333They are particularly useful for data frames:
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
psum(df)
#> [1] 12 15 18Row-wise Min, Max, and Range
fpmin(), fpmax(), and prange()
compute parallel minimum, maximum, and range (max - min) respectively.
They complement base R’s pmin() and pmax(),
providing greater performance and the ability to work efficiently with
data frames.
x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)
# Parallel minimum
fpmin(x, y, z, na.rm = TRUE)
#> [1] 1 3 4 1
# Parallel maximum
fpmax(x, y, z, na.rm = TRUE)
#> [1] 3 4 4 5
# Parallel range (max - min)
prange(x, y, z, na.rm = TRUE)
#> [1] 2 1 0 4Like psum() and pmean(), these functions
preserve the input type when all inputs have the same type, and
automatically promote to the highest type when inputs are mixed (logical
< integer < double). prange() always returns double
to avoid integer overflow.
Coalescing Values
pfirst() and plast() return the first or
last non-missing value across a set of vectors. This is equivalent to
the SQL COALESCE function (for pfirst).
Logical and Count Operations
You can check for conditions or count values row-wise with
pall, pany, and pcount.
a <- c(TRUE, FALSE, NA, TRUE)
b <- c(TRUE, NA, TRUE, FALSE)
c <- c(NA, TRUE, FALSE, TRUE)
# Any TRUE per row?
pany(a, b, c, na.rm = TRUE)
#> [1] TRUE TRUE TRUE TRUE
# Count NAs per row
pcountNA(a, b, c)
#> [1] 1 1 1 0
# Count specific value (e.g., TRUE) per row
pcount(a, b, c, value = TRUE)
#> [1] 2 1 1 2Vectorized Conditionals
Fast If-Else (iif)
Base R’s ifelse() is known to be slow and often strips
attributes (like Date class or factor levels).
iif() is a faster, more robust alternative that preserves
attributes from the yes argument.
dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03"))
# Base ifelse strips class
class(ifelse(dates > "2024-01-01", dates, dates - 1))
#> [1] "numeric"
# iif preserves class
class(iif(dates > "2024-01-01", dates, dates - 1))
#> [1] "Date"It also supports explicit NA handling:
Nested Conditionals (nif)
For multiple conditions, nif() offers a cleaner, more
efficient syntax than nested ifelse() calls, similar to
SQL’s CASE WHEN.
Vectorized Switch (vswitch, nswitch)
vswitch() maps input values to outputs efficiently.
status_code <- c(1L, 2L, 3L, 1L, 4L)
vswitch(
x = status_code,
values = c(1L, 2L, 3L),
outputs = c("pending", "approved", "rejected"),
default = "unknown"
)
#> [1] "pending" "approved" "rejected" "pending" "unknown"For pairwise syntax, nswitch() pairs values and outputs
directly.
nswitch(status_code,
1L, "pending",
2L, "approved",
3L, "rejected",
default = "unknown"
)
#> [1] "pending" "approved" "rejected" "pending" "unknown"It can also replace with values from other vectors (columns), mixing scalars and vectors:
Fast Unique and Duplicates
kit provides optimized versions of
unique() and duplicated() that are
significantly faster for vectors and data frames.
Unique Values and Duplicates
vec <- c("a", "b", "a", "c", "b")
# Get unique values
funique(vec)
#> [1] "a" "b" "c"
# Check for duplicates
fduplicated(vec)
#> [1] FALSE FALSE TRUE FALSE TRUEuniqLen() efficiently counts the number of unique
elements without allocating the unique vector itself:
df <- data.frame(
x = c(1, 1, 2, 2),
y = c("a", "a", "b", "b")
)
uniqLen(df)
#> [1] 2
funique(df)
#> x y
#> 1 1 a
#> 2 2 bCounting Occurrences
countOccur() produces a frequency table (similar to
table() or dplyr::count()) but returns a
standard data frame.
countOccur(c("apple", "banana", "apple", "cherry"))
#> Variable Count
#> 1 apple 2
#> 2 banana 1
#> 3 cherry 1Sorting and Utilities
Partial Sorting (topn)
Sorting a large vector just to get the top few elements is
inefficient. topn() uses a partial sorting algorithm to
retrieve the top (or bottom) N indices
or values.
Factor Manipulation
charToFact() is a fast alternative to
as.factor() for character vectors, with control over
NA levels.
charToFact(c("a", "b", NA, "a"))
#> [1] a b <NA> a
#> Levels: a b <NA>setlevels() allows you to change factor levels by
reference (in-place), avoiding object copying.
Summary
| Task | kit function | Base R equivalent |
|---|---|---|
| Row-wise sum | psum() |
rowSums(cbind(...)) |
| Row-wise mean | pmean() |
rowMeans(cbind(...)) |
| Row-wise min | fpmin() |
pmin(...) |
| Row-wise max | fpmax() |
pmax(...) |
| Row-wise range | prange() |
pmax(...) - pmin(...) |
| First non-NA | pfirst() |
apply(..., 1, function(x) x[!is.na(x)][1]) |
| Fast if-else | iif() |
ifelse() |
| Nested if-else | nif() |
Nested ifelse()
|
| Switch | vswitch() |
match() + indexing |
| Unique values | funique() |
unique() |
| Top N indices | topn() |
order()[1:n] |
| Char to Factor | charToFact() |
as.factor() |
For comprehensive details and performance benchmarks, please refer to the individual function documentation.