Why do we sleep on factors?

And how I wish things may behave?

Author

Josiah Parry

Published

February 15, 2024

Factors are R’s version of an enum(eration) (related post). They’re quite handy and I think we can probably rely on them a bit more for enumations like c("a", "b", "c"). Today I’ve been helping test a new possible feature of extendr involving factors and it has me thinking a bit about some behaviors. Here are my extemporaneous thoughts:

When we have a factor, how can we get new values and associate it with an existing factor?

For example, we can create a factor of the alphabet.

f <- as.factor(letters)
f

 [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Say we have new values that match the level names and want to extend the vector or create a new one based on the levels.

It would be nice if we could subset a factor based on the levels name

f["a"]

[1] <NA>
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

but this gives us an NA because there is no named element "a". If we gave them names we could access it accordingly

setNames(f, letters)["a"]

a 
a 
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

but this would be antithetical to the efficiency of a factor.

They key selling point of a factor is that we define the levels only once and associate them based on integer positions. This is far far far faster and more memory efficient than repeating a value a sh!t ton of times.

To create a new factor we have to pass in the levels accordingly:

factor("d", levels(f))

[1] d
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

This is actually pretty nice! But I feel like there could be an even better experience, though I don’t know what it would be…

If we wanted to extend the vector by combining the existing factor with levels names we coerce to a character vector but instead of the levels we get the integer values.

c(f, "a")

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "a"

To combine them we would need to ensure that they are both factors.

c(f, factor("d", levels(f)))

 [1] a b c d e f g h i j k l m n o p q r s t u v w x y z d
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Using `vctrs`

Upon further thinking, vctrs tends to have the type-safe behavior that I wish from R (and aspects of it should probably be adapted into base R).

I think vctrs gets to the behavior that I want actually. If I have a value and I use vctrs::vec_cast() and provide the existing factor vector f to the to argument, it will use the levels.

vctrs::vec_cast("z", f)

[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

But this will not succeed if we pass it a value that is unknown. The error message is a bit cryptic and frankly feels a little pythonic in the verbosity of the traceback! But this is type safe! And I LIKE IT!

vctrs::vec_cast("123", f)

Error:
! Can't convert from `"123"` <character> to <factor<754f0>> due to loss of generality.
• Locations: 1

Using vctrs

Using `vctrs`