ForgeStream

Phantom Types and Phantom Data đź‘»


The first time I encountered PhantomData in Rust, I remember being thoroughly confused… What the heck was the point of them? Why the strange name?

If that’s your case too, this blog post will hopefully shed some light on what it is and show some practical applications.

What is it?

Looking at the documentation, it says:

Zero-sized type used to mark things that “act like” they own a T.

Adding a PhantomData<T> field to your type tells the compiler that your type acts as though it stores a value of type T, even though it doesn’t really. This information is used when computing certain safety properties.

Hmm. Ok. So it is some kind of struct that is generic over a type T, but it only pretends to hold a value of type T. And it also says it is zero-sized, which makes sense if it doesn’t actually store anything. Looking at the available methods, there isn’t much. In particular, there is no way to create one of those, other that with the Default trait.

use std::marker::PhantomData;

let boo = PhantomData::<u8>::default();
// Now what??

There’s not much else we can do with it, other than via the few traits it implements (Clone, Eq, etc). We can print its debug representation for instance, with println!("{boo:?}"); but all we get is PhantomData<u8>. Again, not particularly useful.

And if you don’t quite believe that it’s zero-sized, we can verify that easily:


println!("boo takes {} bytes", std::mem::size_of_val(&boo));
// Prints: "boo takes 0 bytes"
println!(
    "Type {} takes {} bytes",
    std::any::type_name::<PhantomData<u8>>(),
    std::mem::size_of::<PhantomData<u8>>()
  );
// Prints: "boo takes 0 bytes"

A practical use case: typed IDs

At IDVerse, we store a lot of different entities in DynamoDB, and as is common most of them have some kind of unique ID, like a UUID, to identify them. These IDs are used in various APIs (e.g “fetch the user with this ID”), or are used in one entity to refer to another entity (e.g “This session belongs to this user”). It becomes very important then to make sure these IDs do not accidentally get mixed up. For instance, trying to fetch a User from the database but passing in a “session ID” instead would be hard to detect at runtime. It would just say the user is “not found”. It would be ideal if we could somehow prevent this at compile time.

But what’s a way to constrain values at compile time? Types! We could create different newtype wrappers for every single entity, like so:

pub struct UserId(UUID);
pub struct SessionId(UUID);
// and so on...

But that would quickly become unwieldy and lead to a lot of code duplication, especially as each ID might want to derive a bunch of traits (Debug, Copy, Clone and so on), have a common way to be serialized/deserialized, methods to convert to/from UUID, methods to parse a string into a valid ID, etc. All that code would be duplicated. You could use macros to alleviate some of the pain, but still, there has to be a better way.

There is a way to write code once in a generic way: using generics. Let’s see, if we could have something like this:

pub struct Id<T>(String);

Then we might be able to use it like so:

pub struct User {
  id: Id<User>,
  name: String,
  // ...
}

pub struct Session {
  id: Id<Session>,
  user_id: Id<User>,
  // ...
}

But does it work? If we try, we get this error:

error[E0392]: type parameter T is never used

Damn. This is true, we don’t really use it, as in we don’t store any T (an Id<User> doesn’t store a User). But if we read the rest of the error message, we see this:

= help: consider removing T, referring to it in a field, or using a marker such as PhantomData = help: if you intended T to be a const parameter, use const T: /* Type */ instead

Ah ha! It mentions our new friend PhantomData! Now the pieces are starting to fall together: we can use PhantomData to “pretend” that we use our type parameter T even though we aren’t really. Let’s give it a try:

pub struct Id<T> {
  id: String,
  _phantom: PhantomData<T>,
}

impl<T> Id<T> {
  pub fn new(id: impl Into<String>) -> Self {
    Self {
      id: id.into(),
      _phantom: Default::default(),
    }
  }
}

Great, it now compiles! Let’s try to use that:

let user1_id = Id::new::<User>("123");
let user2_id = Id::new::<User>("456");
let session_id = Id::new::<Session>("abc");

let user = User {
  id: user1_id,
  name: "Jane Doe".to_string(),
};

let session = Session {
  id: session_id,
  user: user2_id,
};

// Note: the following now doesn't compile!
// let user = User {
//  id: session_id,
//  name: "John Doe".to_string(),
//}

This is great, we now have a way to differentiate different IDs by tying them to the entities they identify, while having that distinction exist only at compile time. This is literally a zero-cost abstraction.

Implied bounds and perfect derive

Now that we have our fancy Id<T> type, we are going to want to make it a bit more useful, for instance by deriving some useful traits for it:

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Id<T> {
  id: String,
  _phantom: PhantomData<T>,
}

That seems reasonable, but if we try to use it with our previous code by doing something like println!("{user_id:?});, we get this error:

error[E0277]: User doesn’t implement Debug

What gives? I mean, this is true, we didn’t implement Debug for our User struct, (and in this case maybe we should), but why should we? We want Id to implement Debug, not T. On top of that, PhantomData<T> already implements Debug for any T!

This is due to the way the #[derive()] macro works. Currently, if you derive some traits on a struct that has some generic parameters, it will require that all those traits be implemented by the generic parameters.

In a lot cases, it makes sense. For instance, if you have:

#[derive(Clone)]
struct Foo<T> {
  inner: T,
}

It is clear in this case that to be able to implement Clone for Foo, we need T to implement Clone as well. But there are many cases where this is not the case. For our Id<T> above, we don’t really care if T implements anything. Cloning an Id for instance only requires cloning the underlying String. Another common case you might have ran into is deriving Clone on a struct that contains an Arc<T>.

This is a pretty annoying limitation of the way #[derive()] works, and hopefully this will be fixed in a future version of rust. To learn more about this issue and potential future solutions, I recommend reading this blog post by Niko Matsakis.

So… what can we do in the meantime? Well, the only thing we can do at the moment is to manually implement all those traits, like so:

impl<T> Clone for Id<T> {
  fn clone(&self) -> Self {
    Self {
      id: self.id.clone(),
      _phantom: self._phantom.clone(),
    }
  }
}

Not ideal, but at least this is code we have to write only once for all our IDs.

Our internal implementation based on the above pattern has a few extra features (like prepending a prefix to the string representation), as well as supporting both UUIDv4 and ULID IDs. It is unfortunately not open-source (yet?), but turned out to be very similar to what is described in this blog post, so I recommend you check it out as well as the kind crate it describes.

Other use cases

The above example is by far not the only use-case for PhantomData. Another popular use case is as an implementation of the typestate pattern. It is a way to use a generic parameter (and associated PhantomData field) with a number of marker types representing the state a particular struct can be in. In this manner, you can control which methods are available in which state, and how to transition from one state to another. I won’t be going into more details about this, but you can read this blog post for instance. A good application of this pattern is to implement “smart” builders. See for example the (awesome) bon crate for such an application.

PhantomData is also used when dealing with FFI, as mentioned in its rustdoc. There are many examples of this in the wild, but here’s a random one from the git2 crate.

Conclusion

I hope this blog post helped demystify PhantomData, and, despite looking a bit strange the first time you encounter it, why it is useful.