Advanced faceted search in SQL database using MyBatis

Contents

Faceted search is popular technique used in online retailers. Users can search for product using multiple filters, called facets, narrowing result list in each step. In addition to that, filters can show number of products matching each facets for selected conditions. Facets without results can be hidden.

Most facets use product information, such as: type, brand, price, color, weight, etc. There can also be product type specific facets. For example you can search for laptops by processor, RAM, disk size, screen size.

Faceted search is also useful in business applications. Besides simple attributes, we can create facets based on complex business rules, such as: calculated customer segment, exceeded payment deadline, customers qualified for vindication.

In this blog post I show how to create simple faceted search application and how to handle complex business conditions. Examples use MyBatis persistence framework, but they are easily translated to other SQL query frameworks.

SQL faceted search

Let’s create simple faceted search for online shop. Consider this simple database schema:

In this simple example, we can extract several facets:

type – product type filter can be presented as selectable list of types with number of products for each type
brand – brand can be presented as select box with number of products for each brand
price – vertical range widget; minimum and maximum prices should reflect actual values from PRODUCTS table
is_available – single selectable facet with number of products
not_available – single selectable facet with number of products

Then we create criteria class for our facets.

<br /><br /><br /><br />
public class ProductCriteria {</p><br /><br /><br />
<p>  private Collection<String> types;<br /><br /><br /><br />
  private Collection<String> brands;<br /><br /><br /><br />
  private BigDecimal minPrice;<br /><br /><br /><br />
  private BigDecimal maxPrice;<br /><br /><br /><br />
  private boolean available;<br /><br /><br /><br />
  private boolean notAvailable;</p><br /><br /><br />
<p>  …<br /><br /><br /><br />
}

public class ProductCriteria {

private Collection<String> types;

private Collection<String> brands;

private BigDecimal minPrice;

private BigDecimal maxPrice;

private boolean available;

private boolean notAvailable;

...

}

We can select multiple types, so criteria object takes collection of types. For range conditions I’m using pair of field: minPrice and maxPrice.

Next step is to create MyBatis mapper file to execute query for list of products. I’m using separate productCriteria block, so it can be later reused.

<br /><br /><br /><br />
<sql id=”productCriteria”><br /><br /><br /><br />
  <if test=”criteria.types != null && !criteria.types.isEmpty()”><br /><br /><br /><br />
    and type in <foreach collection=”criteria.types” open=”(” close=”)” separator=”, ” item=”item”>#{item}</foreach><br /><br /><br /><br />
  </if><br /><br /><br /><br />
  <if test=”criteria.brands != null && !criteria.brands.isEmpty()”><br /><br /><br /><br />
    and type in <foreach collection=”criteria.brands” open=”(” close=”)” separator=”, ” item=”item”>#{item}</foreach><br /><br /><br /><br />
  </if><br /><br /><br /><br />
  <if test=”criteria.minPrice != null”><br /><br /><br /><br />
    and price >= #{criteria.minPrice}<br /><br /><br /><br />
  </if><br /><br /><br /><br />
  <if test=”criteria.maxPrice != null”><br /><br /><br /><br />
    and price <= #{criteria.maxPrice}<br /><br /><br /><br />
  </if><br /><br /><br /><br />
  <if test=”criteria.available”><br /><br /><br /><br />
    and is_available = 1<br /><br /><br /><br />
  </if><br /><br /><br /><br />
  <if test=”criteria.notAvailable”><br /><br /><br /><br />
    and is_available = 0<br /><br /><br /><br />
  </if><br /><br /><br /><br />
</sql></p><br /><br /><br />
<p><select id=”searchProducts” parameterType=”map” resultType=”Product”><br /><br /><br /><br />
  select id, name, type, brand, price, is_available<br /><br /><br /><br />
  from products<br /><br /><br /><br />
  <trim prefix=”WHERE” prefixOverrides=”and |or “><br /><br /><br /><br />
    <include refid=”productCriteria”/><br /><br /><br /><br />
  </trim><br /><br /><br /><br />
</select>

and type in <foreach collection=“criteria.types” open=“(“ close=“)” separator=“, “ item=“item”>#{item}</foreach>

</if>

and type in <foreach collection=“criteria.brands” open=“(“ close=“)” separator=“, “ item=“item”>#{item}</foreach>

</if>

and price >= #{criteria.minPrice}

</if>

and price <= #{criteria.maxPrice}

</if>

and is_available = 1

</if>

and is_available = 0

</if>

</sql>

select id, name, type, brand, price, is_available

from products

</trim>

</select>

In addition to product list, we must also calculate values for our facets.

Unfortunately several select queries are needed. Each query uses conditions from ProductCriteria with one condition cleared. For example typesCount would have criteria.types == null.

Type facet uses group by statement to select product counts for each type:

<br /><br /><br /><br />
<select id=”typesCount” parameterType=”map” resultType=”GroupCount”><br /><br /><br /><br />
  select type, count(*)<br /><br /><br /><br />
  from products<br /><br /><br /><br />
  <trim prefix=”WHERE” prefixOverrides=”and |or “><br /><br /><br /><br />
    <include refid=”productCriteria”/><br /><br /><br /><br />
  </trim><br /><br /><br /><br />
  group by type<br /><br /><br /><br />
</select>

select type, count(*)

from products

</trim>

group by type

</select>

Brand facet works analogously to type:

<br /><br /><br /><br />
<select id=”brandsCount” parameterType=”map” resultType=”GroupCount”><br /><br /><br /><br />
  select brand, count(*)<br /><br /><br /><br />
  from products<br /><br /><br /><br />
  <trim prefix=”WHERE” prefixOverrides=”and |or “><br /><br /><br /><br />
    <include refid=”productCriteria”/><br /><br /><br /><br />
  </trim><br /><br /><br /><br />
  group by brand<br /><br /><br /><br />
</select>

select brand, count(*)

from products

</trim>

group by brand

</select>

Price uses min and max functions:

<br /><br /><br /><br />
<select id=”priceMinMax” parameterType=”map” resultType=”MinMax”><br /><br /><br /><br />
  select min(price), max(price)<br /><br /><br /><br />
  from products<br /><br /><br /><br />
  <trim prefix=”WHERE” prefixOverrides=”and |or “><br /><br /><br /><br />
    <include refid=”productCriteria”/><br /><br /><br /><br />
  </trim><br /><br /><br /><br />
</select>

select min(price), max(price)

from products

</trim>

</select>

Is_available and not_available can also use group by:

<br /><br /><br /><br />
<select id=”availableCount” parameterType=”map” resultType=”GroupCount”><br /><br /><br /><br />
  select is_available, count(*)<br /><br /><br /><br />
  from products<br /><br /><br /><br />
  <trim prefix=”WHERE” prefixOverrides=”and |or “><br /><br /><br /><br />
    <include refid=”productCriteria”/><br /><br /><br /><br />
  </trim><br /><br /><br /><br />
  group by is_available<br /><br /><br /><br />
</select>`

select is_available, count(*)

from products

</trim>

group by is_available

</select>`

UI uses returned information to properly display each facet.

In business applications facets are not always that simple. Let’s take a look at some more complicated example:

Customer can have multiple contracts. For each contract there can be multiple payments and each payment has several fields:

price – how much customer should pay us
payed_price – how much customer has payed
deadline – due date for payment

There could be many search facets for customers, but let’s focus on more interesting ones in order of complexity:

customers with pending payments – tells us which customers have payments that weren’t payed
customers with outstanding payments – tells us which customers have pending payments after deadline
customers for vindication – tells us which customers have outstanding payments with cumulative value of more then 500 and exceeded deadline of 2 months

As shown in previous example, we need 2 things for facet implementation: SQL condition which narrows list of customers and customer count for facet. The second one is easy if we can implement the first one, so let’s focus on SQL condition. For customers with pending payments we need to select all customers that have contracts that have payments with payed_price < price. This could be written in following sql query:

<br /><br /><br /><br />
select id, name<br /><br /><br /><br />
from customers cust<br /><br /><br /><br />
where exists(<br /><br /><br /><br />
  select * from contracts c<br /><br /><br /><br />
  join payments p on (p.contract_id = c.id)<br /><br /><br /><br />
  where p.price > p.payed_price<br /><br /><br /><br />
    and cust.id = c.customer_id<br /><br /><br /><br />
)

selectid,name

from customers cust

where exists(

select * from contracts c

join payments p on (p.contract_id = c.id)

where p.price > p.payed_price

and cust.id = c.customer_id

)

Instead of exists we could use in statement, but the problem remains that we have to deal with sub-query. For more complex facets there could be many more joins for one facet condition. And all selected facets conditions goes to where query. So basically query complexity can grow really fast.

Data denormalization

Fortunately there is a way to deal with any complex facets without sacrificing query performance. Idea is to create additional table with data for complex facets. This additional data already exists in the database in many different tables, so we are duplicating information to have easier access. The main problem of data denormalization is keeping all duplicates in sync. But sometimes it is the only way to have satisfactory performance of select queries.

Here’s example of additional table, which keeps data for our complex facets:

Table CUSTOMERS_AGG (aggregates) contains aggregated data from other tables for our facets. Pending_payments facet is just boolean value. I will deal with other facets later, so column types are left unspecified.

With this new table our select query is simpler:

<br /><br /><br /><br />
select id, name<br /><br /><br /><br />
from customers cust<br /><br /><br /><br />
join customers_agg ca on (cust.id = ca.id)<br /><br /><br /><br />
where pending_payments = 1

selectid,name

from customers cust

join customers_agg ca on (cust.id = ca.id)

where pending_payments = 1

Complexity does not change when we add conditions for other facets. We would still have only one join, we only add simple conditions to where statement.

As I said earlier, we should keep our duplicated data in sync, so we need additional query to select current data for our aggregate table. To get value for pending_payments facet of one customer, we can use following query:

<br /><br /><br /><br />
select count(*) from contracts c<br /><br /><br /><br />
join payments p on (p.contract_id = c.id)<br /><br /><br /><br />
where p.price > p.payed_price<br /><br /><br /><br />
  and cust.id = v_customer_id

select count(*) from contracts c

join payments p on (p.contract_id = c.id)

where p.price > p.payed_price

and cust.id = v_customer_id

For v_customer_id, when count(*) is greater then 0, we can set CUSTOMERS_AGG value to 1.

How and when to update the aggregates table?

Aggregate table should be updated when there is any change in source data for our facet. In our case we should update aggregates in following situations:

insert, delete on CONTRACTS table
insert, update, delete on PAYMENTS table

We have several ways to update aggregates table.

Database triggers

Database triggers are a very powerful way for updating aggregate data. When there is a change in CONTRACTS or PAYMENTS table, customer aggregate procedure is called. In this way, data is always in sync. It doesn’t matter if update is executed from database or application. Downside is slower updates, because of aggregate procedure complexity. Sometimes triggers generate multiple updates for the same customer if multiple rows are updated in the same transaction.

Synchronous application updates

Synchronous application updates can be done at the end of transaction. In this way we can collect all of the changes in whole transaction and recalculate aggregates for each customer only once. Unfortunately updates outside of application do not execute aggregate procedure.

Asynchronous application updates

When aggregate update is very slow we can create asynchronous application updates. One way to do it is to mark rows to be updated from database triggers. Later application asynchronously updates invalid aggregate rows. Unfortunately there is some time between source data update and aggregate update. In this time we are showing old data. The upside is that aggregate procedure does not impact other transactions. When we use database triggers to mark obsolete rows, updates from outside of the application also eventually cause aggregate updates.

Time-dependent conditions

Other two facets outstanding_payments and vindication can be put into temporal facet category. The problem with temporal facets is that it is more difficult to create aggregates for them.

Let’s say that customer does not have any outstanding payments. This information can be false in the next day. Recalculating facets every day is rather wasteful.

Fortunately, instead of storing information about current facet state, we can calculate date from which this facet will be active. So in our CUSTOMERS_AGG table, temporal facets will have column type of date:

Now we can check current facet value based on current date:

<br /><br /><br /><br />
select id, name<br /><br /><br /><br />
from customers cust<br /><br /><br /><br />
join customers_agg ca on (cust.id = ca.id)<br /><br /><br /><br />
where outstanding_payments <= sysdate

selectid,name

from customers cust

join customers_agg ca on (cust.id = ca.id)

where outstanding_payments <= sysdate

Updating temporal facet aggregate is a little more difficult. We must find the minimum date that this facet is active:

<br /><br /><br /><br />
select min(p.deadline) from contracts c<br /><br /><br /><br />
join payments p on (p.contract_id = c.id)<br /><br /><br /><br />
where p.price > p.payed_price<br /><br /><br /><br />
  and cust.id = v_customer_id

select min(p.deadline) from contracts c

join payments p on (p.contract_id = c.id)

where p.price > p.payed_price

and cust.id = v_customer_id

And what about vindication facet? We should find minimum date that is later then 2 months then deadline date of payments with sum of not payed value exceeding 500. I leave this query as an exercise to the reader.

Summary

Faceted search is a powerful way to filter data by categories interesting to users. Besides basic conditions used in online retailer applications, we can create facets based on complex business rules. In this article I have shown how to create faceted search in SQL (relational) database using MyBatis persistence framework. I’ve also shown how to optimize faceted search performance by creating data aggregates and how to implement temporal facets with values changing in time.