16 February 2022

The Ultimate Guide on Client-Generated IDs in JPA Entities

In the previous article, we discussed server-generated IDs for JPA entities. All the ID generation strategies described in the article are based on one fundamental principle: there is a single point that is responsible for generating IDs: a database. This principle might become a challenge: we depend on a particular storage system, so switching to another (e.g., from PostgreSQL to Cassandra) might be a problem. Also, this approach does not work for distributed applications where we can have several DB instances deployed on several data centers in several time zones. Those are the cases where client-based ID generation (or, rather, non-DB-based) comes into a stage. This strategy gives us more flexibility in terms of ID generation algorithm and format and allows batch operations by its nature: ID values are known before they are stored in a DB. In this article, we will discuss two fundamental topics for client-generated ID strategy: how to generate a unique ID value and when to assign it.

Generation Algorithms

When it comes to ID generation in distributed applications, we need to decide which algorithm to use to guarantee uniqueness and sound generation performance. Let’s have a look at some options here.

Random IDs and Timestamps – bad idea

This is a straightforward and naïve implementation for decentralized ID generation. Let every application instance generate a unique ID using a random number generator, and that’s it! To make it better, we might think of using a composite structure - let’s append timestamp (in milliseconds) to the beginning of the random number to make our IDs sortable. For example, to create a 64-bit ID, we can use the first 32 bits of the timestamp and the last 32 bits of the random number.

The problem with this approach is that it does not guarantee uniqueness. We can only hope that our generated IDs won’t clash. For big, distributed data-intensive systems, this approach is not acceptable. We cannot rely on probability laws unless we’re a casino.

Conclusion: we should not reinvent the wheel for globally unique ID generation algorithms. It will take a lot of time, effort, and a couple of PhDs. Some existing solutions solve this problem and can be utilized in our applications.

UUIDs: Globally Unique

UUID generation – is a well-known and widely used approach for ID generation in distributed applications. This datatype is supported by standard libraries in almost all programming languages. We can generate ID value right in the application code, and this value will be globally unique (by the design of the generation algorithm). UUIDs has some advantages over “traditional” numeric IDs:

  • Uniqueness does not depend on a data table. We can move data with primary keys of UUID type between tables or databases, and there will be no problems.
  • Data hiding. Let’s assume that we develop a web application, and a user sees the following fragment in their browser’s address on login: userId=100. It means that there might exist a user with ID 99 or 101. And knowing this info might lead to a security breach.

UUIDs are not sortable, however sorting data by surrogate ID value is usually not required; we should use a business key for that. But if we absolutely need sorting, we can use the UUID subtype – ULID, which stands for “universally unique lexicographically sortable identifier”.

The performance of the random UUID generator in Java is also sufficient for most cases. On my computer (Apple M1 max), it took about 500ns per operation, which gives us about two million UUIDs per second.

UUIDs: Drawbacks

UUID is almost the perfect choice for the ID value, but a few things might prevent you from using it.

First, UUID values consume more storage space compared to 64-bit long IDs. Twice the space, if we need to be exact. Extra 64 bits might not look like a significant addition, but it might be a problem when talking about billions of records. Also, we should remember about foreign keys where we need to duplicate ID values. Therefore, we might double the ID storage consumption.

The second issue is performance. Two factors are affecting this:

  1. UUIDs are not increased monotonously
  2. Some RDBMSes store tables or indexes as B-trees

It means that when we insert a new record into a table, the RDBMS writes its ID value into a random b-tree node of an index or a table structure. Since most of the index or table data is stored on the disk, the probability of random disk reads increases. It means further delays in the data storage process. You can find more on this topic in this article.

And finally, some databases just do not support UUID as the datatype, so we’ll have to store ID value as varchar or byte array, which may not be great for queries performance and will require some extra encoding on the ORM side.

Conclusion: UUID is a good choice for surrogate IDs if we don’t want or cannot use a database for ID generation. It is a well-known, reliable way of getting unique values. On the other hand, using UUID might cause performance issues in some databases. In addition to this, we need more storage space for this datatype which may be an issue for large datasets.

Dedicated ID generation Servers

When we start developing a distributed application, we might ask ourselves: why don’t we create a special facility for ID generation, independent from a database? It is a valid point. Twitter Snowflake is a good (though archived) example of such a facility. We can set up multiple dedicated ID generation servers in our network and fetch IDs from them. The algorithm used in Snowflake guarantees global ID uniqueness, and they are “roughly time ordered”. Performance is also good: minimum 10k ids per second per process, response rate 2ms (plus network latency).

On the other side – we need to set up and support additional servers. In addition to this, we need to make a network call to fetch an ID, and to do this – write some extra code in our application. For Hibernate, it will be a custom ID generation strategy. As we all know, all code that we write once, we need to support forever or delete, so in most cases adding custom ID generation strategy code means additional work.

Conclusion: we might need to set up a dedicated ID generation server(s) if we need an independent high-performant ID generation facility. But to use a separate ID generation server, we should be ready to invest some additional efforts in supporting dedicated servers (containers) in our infrastructure and application code for fetching IDs.

When to assign ID Value?

This question, though simple, might affect your application code when you use client-based ID generation. When deciding on this topic, we need to consider:

  • JPA entities comparison algorithm
  • Unit testing code complexity.

For ID value generation and assignment, we have the following options:

  • Initialize the ID field on entity creation.
  • Use Hibernate’s generators.
  • Implement our factory for new entities generation.

We will discuss these options using UUID datatype as an example, but principles apply to all ID generation algorithms and datatypes discussed above.

Field Initialization

The most straightforward way to generate the value is to use the field initializer directly:

@Id 
@Column(name = "id", nullable = false) 
private UUID id = UUID.randomUUID(); 

This guarantees a non-null ID value and allows us to define equals() and hashCode() methods for entities easily – we can compare IDs and calculate their hash codes.

Are there any problems with this approach?

First, when defining ID generation like this, it becomes hard to check whether an entity is newly created or persisted. It is not a problem for Hibernate. If we invoke the EntityManager#persist() method and pass an entity with an existing ID, Hibernate will return the Unique Constraint Violation error if such PK exists. Suppose we invoke EntityManager#merge() - Hibernate will perform a SELECT from the database and, based on its results, will set the entity state. But getting an entity state becomes a bit harder for developers who might check ID for null and assume that the entity is not new; we can find such code samples on the Internet. This assumption may cause unexpected application errors for the detached entities, such as trying to store references to the non-existing entities, etc. So, we need to agree on the algorithm to figure out an entity state. For example, we can use the @Version field if it is present.

The second problem – query by example (QBE). We should never forget that we have a non-null globally unique ID in every entity. Therefore, we must always remove the ID manually when creating a new entity for the query.

The third problem – unit tests. In our mocks, it will be hard to guarantee consistent test data; each time, an entity’s ID will be different. To override it, we should add the setter method, but it will make the @Id field mutable, so we’ll need to somehow prevent ID changes in the main codebase.

Finally, every time we fetch an entity, we generate a value for the new entity’s instance, and then ORM overwrites it with an ID value selected from a database. For this case, ID value generation is just wasting of time and resources.

Conclusion: ID initialization using field initializer is simple, but we need to implement some additional tasks:

  1. Agree on the entity state check algorithm for non-null IDs
  2. Ensure that we set null for ID when using the QBE feature
  3. Decide how to provide consistent data for our unit tests.

Hibernate Generator

Hibernate uses generators to assign IDs for JPA entities. We talked about sequence generators in the previous article and Hibernate provides us with more than that. For example, it handles UUID primary keys in a special way. If we define the ID field like in the code below, Hibernate will automatically use its UUIDGenerator to generate and assign UUID value to the field.

@Id 
@Column(name = "id", nullable = false) 
@GeneratedValue 
private UUID id; 

There are more standard generators in Hibernate; we can use them by specifying a corresponding class in the @GenericGenerator annotation. You can find more on generators in the documentation

If we want to generate an ID value in a way not supported by Hibernate, we need to develop a custom ID generator. To do this, we need to implement an IdentifierGenerator interface or its subclass and specify this implementation in the @GenericGenerator annotation parameter. The generator code may look like this:

public class CustomIdGenerator implements IdentifierGenerator { 
 
   @Override 
   public Serializable generate( 
              SharedSessionContractImplementor session,  
              Object object) 
              throws HibernateException { 
      //Generate ID value here 
   } 
} 

And in a JPA entity, we need to declare the field in this way to use the generator defined above:

@Id 
@GenericGenerator(name = "custom_gen", 
   strategy = "org.sample.id.CustomIdGenerator") 
@GeneratedValue(generator = "custom_gen") 
private Integer id; 

When we use Hibernate’s generators, we won’t have problems with entity state definition; we rely on the ORM. (In fact, Hibernate’s way is a bit more tricky than just ID value check, it includes version field, L1 cache, Persistable interface, etc.). We also won’t have any problems with unit tests either. For the case of a detached entity, we can safely assume that an entity with a null ID has not been saved yet.

But we need to define proper equals() and hashCode() methods. As we can see, ID is mutable; other entity fields are mutable too. And mutable fields cause “unstable” equals() and hashCode() methods. You can find an example of a “disappearing” entity with mutable fields in our blog post about Lombok usage. We will discuss equals() and hashCode() implementations later in this article; this topic is relevant to the case described in the next section.

Conclusion: using Hibernate generator liberates us from guessing an entity’s state. Also, Hibernate takes the burden of assigning the value before inserting it. But for this case, we need to implement equals() and hashCode() appropriately for newly created entities with null IDs.

Custom Factory

When we need complete control over a JPA entity creation process, we might consider the creation of a special factory for entities generation. This factory might provide an API to assign a specific ID on entity creation, set a creation date for audit purposes, specify a version, etc. In the Java code, it might look like this:

@Autowired 
private JpaEntityFactory jpaEntityFactory; 
 
public Pet createNewPet(String name) { 
   return entityFactory.builder(Pet.class) 
      .whithId(100) 
      .withVersion(0) 
      .withName(name) 
      .build(); 
} 

Such a factory makes the process of a JPA entity creation consistent and manageable – there is only one API for doing this, and we are the only ones responsible for it. Hence, we won’t have problems when generating pre-defined entities in mocks for unit tests.

But there is also a flaw here: we must enforce all developers to use our factory for entities creation. And this task might be a bit challenging. We’ll need to set up code checks in our CI pipelines and probably even fail a build if we detect an “illegal” entity creation. In order to help developers, we should introduce custom IDE checks to find and detect such cases during development time.

Conclusion: The custom factory is the most flexible way for JPA entities generation and initialization but requires some effort to support it. And the amount of effort will depend on the factory’s functionality complexity.

Equals() and hashCode() Implementation

Implementation of equals() and hashCode() methods in JPA entities usually causes a hot debate. There are various articles on this topic, for example from Baeldung, Vlad Mihalcea or Thorben Janssen.

We can use @Id fields or @NaturalId to compare entities, but the problem remains - entities are mutable by their nature. We talked about various approaches for ID assignment above, and we can see that even for the “assign ID in the field initializer” we still have to make the ID field mutable.

In the code below, we use a single ID field as the entity identifier, but we can interchange it with a natural ID field (or fields) – the approach will be the same. For the JPA Buddy, we provide code generation for both methods. Let’s have a look at our solution. First, the equals() method for a Pet entity.

@Override 
public boolean equals(Object o) { 
   if (this == o) return true; 
   if (o == null || Hibernate.getClass(this) != Hibernate.getClass(o)) return false; 
   Pet pet = (Pet) o; 
   return getId() != null && Objects.equals(getId(), pet.getId()); 
} 

As you can see, we assume that two entities without IDs are not equal unless they are the same object. Period. It satisfies all requirements for the ‘equals()’ method, is easy to follow in the code, and does not cause anomalies.

The hashCode() method implementation is even simpler. We return a constant for all entities of the same class. It does not break the “equals and hashCode convention” and works for new and stored entities.

@Override 
public int hashCode() { 
   return getClass().hashCode(); 
} 

The usual question here is, “what about terrible performance in HashMap and HashSet”? Here we can quote Vlad Mihalcea: “You should never fetch thousands of entities in a @OneToMany Set because the performance penalty on the database side is multiple orders of magnitude higher than using a single hashed bucket.”

Using Spring Data JPA, Hibernate or EclipseLink and code in IntelliJ IDEA? Make sure you are ultimately productive with the JPA Buddy plugin!

It will always give you a valuable hint and even generate the desired piece of code for you: JPA entities and Spring Data repositories, Liquibase changelogs and Flyway migrations, DTOs and MapStruct mappers and even more!

Conclusion

Entity ID generation in the application is the only option for distributed systems with several application and database instances deployed worldwide. We can use either separate ID generation servers or in-app ID generation (usually UUID generators). Both options have their pros and cons, but general advice would be:

  1. In most cases, UUID works fine and provides a good balance between ID length, values generation speed, and DB performance.
  2. If we need to fulfill special requirements regarding ID format (length, datatype, etc.) or values generation performance, then we have to consider specialized ID generation servers.

As for the ID assignment algorithm, Hibernate generators do their job well. Using a standard generator or a custom one simplifies codebase support and ID generation process debug. But we need to remember about proper equals() and hashCode() implementation because we have mutable IDs here. As for other options, we can add the following:

  1. Direct ID field initialization is straightforward to implement. Still, we need to remember corner cases such as JPA entity state definition (new or saved), query by example, and unit testing when we mock a repository. In addition, we waste some resources on ID overwrite on an entity fetch.
  2. An entity generation factory is the most flexible option; we control everything in our code. But we need to make all developers use this API for entity creation. We need to enforce specific static code checks across all the teams that work with our codebase to do this.

In the next article in the series, we will discuss composite IDs: why we need them, how to implement and use them, and the pros and cons of different approaches for composite IDs implementation.