
Phantom Types and Phantom Data đź‘»
The first time I encountered PhantomData
in Rust, I remember being thoroughly confused… What the heck was the point
of them? Why the strange name?
If that’s your case too, this blog post will hopefully shed some light on what it is and show some practical applications.
What is it?
Looking at the documentation, it says:
Zero-sized type used to mark things that “act like” they own a T.
Adding a
PhantomData<T>
field to your type tells the compiler that your type acts as though it stores a value of type T, even though it doesn’t really. This information is used when computing certain safety properties.
Hmm. Ok. So it is some kind of struct that is generic over a type T
, but it only pretends to hold a value of type T
.
And it also says it is zero-sized
, which makes sense if it doesn’t actually store anything. Looking at the available
methods, there isn’t much. In particular, there is no way to create one of those, other that with the Default
trait.
use std::marker::PhantomData;
let boo = PhantomData::<u8>::default();
// Now what??
There’s not much else we can do with it, other than via the few traits it implements (Clone
, Eq
, etc). We can print
its debug representation for instance, with println!("{boo:?}");
but all we get is PhantomData<u8>
. Again, not
particularly useful.
And if you don’t quite believe that it’s zero-sized, we can verify that easily:
println!("boo takes {} bytes", std::mem::size_of_val(&boo));
// Prints: "boo takes 0 bytes"
println!(
"Type {} takes {} bytes",
std::any::type_name::<PhantomData<u8>>(),
std::mem::size_of::<PhantomData<u8>>()
);
// Prints: "boo takes 0 bytes"
A practical use case: typed IDs
At IDVerse, we store a lot of different entities in DynamoDB, and as is common
most of them have some kind of unique ID, like a UUID
, to identify them.
These IDs are used in various APIs (e.g “fetch the user with this ID”), or are
used in one entity to refer to another entity (e.g “This session belongs to
this user”). It becomes very important then to make sure these IDs do not
accidentally get mixed up. For instance, trying to fetch a User
from the
database but passing in a “session ID” instead would be hard to detect at
runtime. It would just say the user is “not found”. It would be ideal if we
could somehow prevent this at compile time.
But what’s a way to constrain values at compile time? Types! We could create
different newtype
wrappers for every single entity, like so:
pub struct UserId(UUID);
pub struct SessionId(UUID);
// and so on...
But that would quickly become unwieldy and lead to a lot of code duplication,
especially as each ID might want to derive a bunch of traits (Debug
, Copy
,
Clone
and so on), have a common way to be serialized/deserialized, methods to
convert to/from UUID
, methods to parse a string into a valid ID, etc. All
that code would be duplicated. You could use macros to alleviate some of the
pain, but still, there has to be a better way.
There is a way to write code once in a generic way: using generics. Let’s see, if we could have something like this:
pub struct Id<T>(String);
Then we might be able to use it like so:
pub struct User {
id: Id<User>,
name: String,
// ...
}
pub struct Session {
id: Id<Session>,
user_id: Id<User>,
// ...
}
But does it work? If we try, we get this error:
error[E0392]: type parameter
T
is never used
Damn. This is true, we don’t really use it, as in we don’t store any T
(an
Id<User>
doesn’t store a User
). But if we read the rest of the error
message, we see this:
= help: consider removing
T
, referring to it in a field, or using a marker such asPhantomData
= help: if you intendedT
to be a const parameter, useconst T: /* Type */
instead
Ah ha! It mentions our new friend PhantomData
! Now the pieces are starting to
fall together: we can use PhantomData
to “pretend” that we use our type
parameter T
even though we aren’t really. Let’s give it a try:
pub struct Id<T> {
id: String,
_phantom: PhantomData<T>,
}
impl<T> Id<T> {
pub fn new(id: impl Into<String>) -> Self {
Self {
id: id.into(),
_phantom: Default::default(),
}
}
}
Great, it now compiles! Let’s try to use that:
let user1_id = Id::new::<User>("123");
let user2_id = Id::new::<User>("456");
let session_id = Id::new::<Session>("abc");
let user = User {
id: user1_id,
name: "Jane Doe".to_string(),
};
let session = Session {
id: session_id,
user: user2_id,
};
// Note: the following now doesn't compile!
// let user = User {
// id: session_id,
// name: "John Doe".to_string(),
//}
This is great, we now have a way to differentiate different IDs by tying them to the entities they identify, while having that distinction exist only at compile time. This is literally a zero-cost abstraction.
Implied bounds and perfect derive
Now that we have our fancy Id<T>
type, we are going to want to make it a bit
more useful, for instance by deriving some useful traits for it:
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Id<T> {
id: String,
_phantom: PhantomData<T>,
}
That seems reasonable, but if we try to use it with our previous code by doing
something like println!("{user_id:?});
, we get this error:
error[E0277]:
User
doesn’t implementDebug
What gives? I mean, this is true, we didn’t implement Debug
for our User
struct, (and in this case maybe we should), but why should we? We want Id
to
implement Debug
, not T
. On top of that, PhantomData<T>
already
implements Debug
for any T
!
This is due to the way the #[derive()]
macro works. Currently, if you derive
some traits on a struct that has some generic parameters, it will require that
all those traits be implemented by the generic parameters.
In a lot cases, it makes sense. For instance, if you have:
#[derive(Clone)]
struct Foo<T> {
inner: T,
}
It is clear in this case that to be able to implement Clone
for Foo
, we
need T
to implement Clone
as well. But there are many cases where this is
not the case. For our Id<T>
above, we don’t really care if T
implements
anything. Cloning an Id
for instance only requires cloning the underlying
String
. Another common case you might have ran into is deriving Clone
on a
struct that contains an Arc<T>
.
This is a pretty annoying limitation of the way #[derive()]
works, and
hopefully this will be fixed in a future version of rust
. To learn more about
this issue and potential future solutions, I recommend reading this blog
post
by Niko Matsakis.
So… what can we do in the meantime? Well, the only thing we can do at the moment is to manually implement all those traits, like so:
impl<T> Clone for Id<T> {
fn clone(&self) -> Self {
Self {
id: self.id.clone(),
_phantom: self._phantom.clone(),
}
}
}
Not ideal, but at least this is code we have to write only once for all our IDs.
Our internal implementation based on the above pattern has a few extra features (like prepending a prefix to the string representation), as well as supporting both UUIDv4 and ULID IDs. It is unfortunately not open-source (yet?), but turned out to be very similar to what is described in this blog post, so I recommend you check it out as well as the kind crate it describes.
Other use cases
The above example is by far not the only use-case for PhantomData
. Another
popular use case is as an implementation of the typestate pattern. It is a
way to use a generic parameter (and associated PhantomData
field) with a
number of marker types representing the state a particular struct can be
in. In this manner, you can control which methods are available in which state,
and how to transition from one state to another. I won’t be going into more
details about this, but you can read this blog post for instance. A good
application of this pattern is to implement “smart” builders. See for example
the (awesome) bon
crate for such an application.
PhantomData
is also used when dealing with FFI, as mentioned in its
rustdoc. There are many examples of this in the wild, but here’s a random
one from the git2
crate.
Conclusion
I hope this blog post helped demystify PhantomData
, and, despite looking a
bit strange the first time you encounter it, why it is useful.