4 July 2022

How to store text in PostgreSQL: tips, tricks, and traps

DDL generation based on JPA entities definition is a daily task for many developers. In most cases, we use tools like Hibernate's built-in generator or JPA Buddy plugin. They make the job easier, but there are exceptions. When it comes to storing big chunks of data in the database, things get a bit complicated.

Use case: storing documents

Let’s assume we need to store a document object with its content in the PostgreSQL database. The JPA entity code for this might look like the code below:

  
@Entity   
@Table(name = "document")   
public class Document {   
   @Id   
   @GeneratedValue(strategy = GenerationType.IDENTITY)   
   @Column(name = "id", nullable = false)   
   private Long id;   
  
   @Column(name = "date_created", nullable = false)   
   private LocalDateTime dateCreated;   
  
   @Column(name = "doc_txt")   
   private String docText;   
    
   //Getters and setters omitted for brevity   
}     

The question is: what if we need to store really long document text? In Java, string datatype can hold about 2Gb of text data, but the table column size will be limited to 255 characters by default for the model above. So, what should we change?

Option One: use LOB storage

In relational databases, a particular data type exists to store big amounts of data: LOB (Large OBject). Once we need to store large text in the database, we can start with defining a LOB column. All we need to do is mark the docText attribute with the @Lob annotation.

@Lob   
@Column(name = "doc_txt")   
private String docText;   

Let’s use Hibernate to generate a DDL for the table to map the Document entity. The SQL will be:

create table document ( 
    id int8 generated by default as identity, 
    date_created timestamp not null, 
    doc_txt oid, 
    primary key (id) 
);   

As we can see, the doc_text column datatype is oid. What is it? According to the documentation: PostgreSQL provides two distinct ways to store binary data. Binary data can be stored in a table using the data type BYTEA or by using the Large Object feature, which stores the binary data in a separate table in a special format and refers to that table by storing a value of type OID in your table. In our case, the second option is in effect. This separate table’s name is pg_largeobject, and it stores data split into “pages”, usually 2 kB each, as stated in the docs.

So, Hibernate stores large text in a separate table as binary data. Does it mean we should do an additional join when selecting data or an additional insert when saving it? Let’s enable SQL logging, create the Document entity and save it to the database using Spring Data JPA.

Document doc = new Document();   
doc.setDateCreated(LocalDateTime.of(2020, 1, 1, 10, 10));   
doc.setDocText("This is the doc text");   
Document saved = documentRepository.save(doc);   

Hibernate will show an ordinary SQL in the console:

    insert  
    into 
        document 
        (date_created, doc_txt)  
    values 
        (?, ?) 

Now we can check if the data was correctly stored by executing the following SQL in a console:

select * from document   

The result that we will see should be similar to this:

id date_created doc_txt
1 2020-01-01 10:10:00 76388

We don’t see the document text in this table, just a reference to the object in the large object storage. Let’s check the pg_largeobject table:

select * from pg_largeobject where loid=76338  

Now we can see the document text.

loid pageno data
76388 0 This is the doc text

So, Hibernate saves data into two tables automatically under the hood. Now we can try to fetch the document data using Spring Data JPA:

documentRepository.findById(1L).ifPresent(d -> System.out.println(d.getDocText()));   

We can see the following SQL in the console:

   select   
       document0_.id as id1_0_0_,   
       document0_.date_created as date_cre2_0_0_,   
       document0_.doc_txt as doc_txt3_0_0_    
   from   
       document document0_    
   where   
       document0_.id=?   

And the output should be as expected:

This is the doc text   

Hibernate selects the data from the pg_largeobject table transparently. Let’s try to use JPQL to execute the same query. To do this, we create an additional Spring Data JPA repository method and invoke it:

//repository   
@Query("select d from Document d where d.id = ?1")   
Optional<Document> findByIdIs(Long id); 
... 
//invocation 
        documentRepository.findByIdIs(1L).ifPresent(d -> System.out.println(d.getDocText())); 

The method will fail:

org.springframework.orm.jpa.JpaSystemException: Unable to access lob stream   
…   
Caused by: org.hibernate.HibernateException: Unable to access lob stream   
…   
Caused by: org.postgresql.util.PSQLException: Large Objects may not be used in auto-commit mode.   
…   

Hibernate performs additional database reading to fetch LOB data. In auto-commit mode, this reading is executed in a separate transaction. PostgreSQL driver explicitly prohibits it, as shown in the error message above. To fix this, we need to execute such queries in one transaction or disable auto-commit mode.

Spring Data JPA methods from CrudRepository like findById() and findAll() are executed in one transaction by default. That is why everything worked fine in the first example. When we use Spring Data JPA query methods or JPQL queries, we must use @Transactional explicitly as in the example below.

@Transactional 
@Query("select d from Document d where d.id = ?1")   
Optional<Document> findByIdIs(Long id); 


@Transactional 
List<Document> findByDateCreatedIsBefore(LocalDateTime dateCreated); 

If disabling auto-commit mode seems preferable to using the @Transactional annotation, we should look into the documentation for the app libraries. For example, to do it for the default connection pool implementation (HikariCP) in Spring Boot, we need to set the spring.datasource.hikari.auto-commit property to false.

Storing text in a separate table might cause other issues. Let’s add a repository method to select documents using the LIKE clause for the docText field:

@Transactional   
List<Document> findByDocTextLike(String text);   

This method will generate the following query:

   select   
       document0_.id as id1_0_,   
       document0_.date_created as date_cre2_0_,   
       document0_.doc_txt as doc_txt3_0_    
   from   
       document document0_    
   where   
       document0_.doc_txt like ? escape ?   

...and this query will fail with the following error:

org.springframework.dao.InvalidDataAccessResourceUsageException: could not extract ResultSet; SQL [n/a];    
…   
Caused by: org.postgresql.util.PSQLException: ERROR: function pg_catalog.like_escape(bigint, character varying) does not exist   
Hint: No function matches the given name and argument types. You might need to add explicit type casts.   

Hibernate cannot generate proper SQL to handle the LIKE clause for LOB text columns. For this case, we can use the native query. In this query, we have to fetch text data from the LOB storage and convert it to string format. After that, we can use it in the LIKE clause (and don’t forget about @Transactional):

@Query(value = "select * from document d " +   
       "where convert_from(lo_get(doc_txt), 'UTF-8') like '%'||?1||'%'", nativeQuery = true)   
@Transactional   
List<Document> findByDocTextLike(String text);   

Now everything works fine. Please remember that native queries may not be compatible with other RDBMSes and are not validated in runtime. Use them only when absolutely necessary.

Conclusion: storing text as LOB

So, what are the pros and cons of storing large texts as LOB objects in PostgreSQL?
Pro:

  • PostgreSQL uses optimized storage for LOB objects
  • We can store up to 4Gb of text there

Con:

  • Some functions (LIKE, SUBSTRING, etc.) in the WHERE clause do not work in Hibernate for LOB text columns. We need to use native queries for this.
  • To fetch text with JPQL or Spring Data JPA repository query methods, we must use the @Transactional annotation for repository methods or disable the auto-commit mode

The question here is: why don’t we store text data right in the table? Let’s discuss this option too.

Option Two: column re-definition

PostgreSQL allows us to store long text data in a column of a particular datatype - TEXT. We can specify the column definition right in the annotation.

@Column(name = "doc_txt", columnDefinition = "text")  
private String docText;  

This allows us to work with long text in a "usual" way. No transactions, native queries, and JPQL works as expected. Compared to LOB type there is a limitation: “..., the longest possible character string that can be stored is about 1 GB.” It is smaller than 4Gb, allowed by LOB storage, but still long enough for most use cases.

The only problem here is the hardcoded column definition. To overcome it, we can use annotation @Type and converter org.hibernate.type.TextType in Hibernate 5. It has an advantage over the previous column definition: it is not vendor-specific.

@Type(type = "org.hibernate.type.TextType")  
@Column(name = "doc_txt")  
private String docText;  

In Hibernate 6, the org.hibernate.type.TextType class was removed. To define a column to store a long text, we can define the attribute in the following way:

@Column(name = "doc_txt", length = Length.LONG32)
private String docText;

So, storing long text in the TEXT/VARCHAR column brings fewer issues. No problems with transactions, LIKE conditions, etc. Are there any caveats with it?

There might be a problem if we use the TEXT column type in the database and the @Lob annotation. Let’s see how it works, using Hibernate 5 API. First, let’s create a table document and insert some data into it:

create table document (
    id int8 generated by default as identity,
    date_created timestamp not null,
    doc_txt text,
    primary key (id)
);  
 
insert into document (id, date_created, doc_txt) values (1, '2021-10-10', 'This is the document text number 1');  

We will use the document entity definition with the @Lob column:

@Entity  
@Table(name = "document")  
public class Document {  
   @Id  
   @GeneratedValue(strategy = GenerationType.IDENTITY)  
   @Column(name = "id", nullable = false)  
   private Long id;  
 
   @Column(name = "date_created", nullable = false)  
   private LocalDateTime dateCreated;  
 
   @Lob  
   @Column(name = "doc_txt")  
   private String docText;  
 
   //Getters and setters omitted for brevity  
}  

The code for document fetching will be the same:

documentRepository.findById(1L).ifPresent(d -> System.out.println(d.getDocText()));  

If we try to execute the repository method, we will see the following:

java.lang.IllegalStateException: Failed to execute Application  
… 
Caused by: org.hibernate.exception.DataException: could not execute query  
…  
Caused by: org.postgresql.util.PSQLException: Bad value for type long: This is the document text number 1  
…  

As we can see, Hibernate handles @Lob attribute values as a reference to LOB object data. The TEXT column type in the database table does not affect this behavior.

And what about saving data? Let’s clean the table, try to save the document entity with the @Lob field, and fetch it using Spring Data JPA. Here is the code to do this:

//Saving
Document doc = new Document();  
doc.setDateCreated(LocalDateTime.now());  
doc.setDocText("This is another text document");  
documentRepository.save(doc); 
...
//Fetching
documentRepository.findAll().forEach(d -> System.out.println(d.getDocText));  
...
//Result
This is another text document

So, it looks as if our entity with the @Lob attribute can work with TEXT columns. In the database tables we will see the familiar picture:

id date_created doc_txt
1 2022-06-16 15:28:26.751041 76388
loid pageno data
76388 0 This is another text document

If we insert document data into the table using SQL and then select the data, we will get the following:

insert into document (date_created, doc_txt) values ('2021-10-10', 'This is the document text'); 
select * from document;
id date_created doc_txt
1 2022-06-16 15:28:26.751041 76388
2 2021-10-10 00:00:00 This is the document text

Now we won’t be able to select data from the database using Spring Data JPA. The application will crash with the type conversion error while selecting the second row.

Let’s try to fix it (using Hibernate 5 API) and add the @Type annotation to the attribute...

@Lob  
@Type(type = "org.hibernate.type.TextType")  
@Column(name = "doc_txt")  
private String docText;  

...and try to print documents' text data to the app console.

documentRepository.findAll().forEach(d -> System.out.println(d.getDocText));  

We’ll see the following:

Hibernate: select document0_.id as id1_0_, document0_.date_created as date_cre2_0_, document0_.text as text3_0_ from document document0_  
 
76388  
This is the document text  

With the @Type annotation, we can select data, but the OID reference is translated into text, so we “lose” the text stored in the LOB storage.

Conclusion: Storing Long Text in Table

So what are the pros and cons of storing long text as a text column in the database:

Pro:

  • Queries work as expected; no need for separate transactions or native queries

Con:

  • Storage size is limited
  • Mixing the @Lob attribute definition and TEXT column datatype may cause unexpected results.

Using Spring Data JPA, Hibernate or EclipseLink and code in IntelliJ IDEA? Make sure you are ultimately productive with the JPA Buddy plugin!

It will always give you a valuable hint and even generate the desired piece of code for you: JPA entities and Spring Data repositories, Liquibase changelogs and Flyway migrations, DTOs and MapStruct mappers and even more!

Final words: how to store long text in PostgreSQL

  1. For most cases, storing long text data in the same table along with other entity data should work fine. It will allow you to manipulate data using both Hibernate and direct SQL.
    • In Hibernate 5, use `@Type(type = "org.hibernate.type.TextType")` annotation for the JPA entity attribute.
    • If you use Hibernate 6, define column like this: `@Column(name = ..., length = Length.LONG32)`.
  2. If you plan to store significant amounts of character data (more than single Java string class can contain), use the `@Lob` annotation for the JPA entity attribute. Hibernate will use PostgreSQL’s dedicated storage optimized for large amounts of data. There are several things that we should consider when we use LOBs.
    • We must execute JPQL queries and Spring Data JPA query methods in one transaction or disable auto-commit mode explicitly.
    • To use the LOB column in the `WHERE` condition, we might need to use native queries.
  3. There is great advice in Hibernate documentation: ` Please don’t (ab)use JPA’s @Lob annotation just because you want a TEXT column. The purpose of the @Lob annotation is not to control DDL generation!`. Hence, do not use the `@Lob` entity attribute definition with the `TEXT` column datatype.

Hope those simple rules will help you avoid issues while storing text data in PostgreSQL with Hibernate.