Apache Solr: How to efficiently index large amounts of data?

Contents

Apache Solr lets you quickly search through indexed resources. However, the indexing process itself can take a lot of time. This article will explain how a few simple steps had made ours – and can make yours – Solr indexing process much more efficient.

Apache Solr provides means to search through indexed resources extremely quickly. The indexing process itself, however, can take a lot of time. When one googles “solr indexing efficiency” or “solr efficient indexing” one can find a lot of information on RAM and tuning JVM parameters or details about specific Solr options. There are some great solr efficiency considerations to be found, but they cover very specific cases.

When we approached our indexing efficiency problem, the resources we’ve found weren’t of much help. It turned out that what we needed was to adapt our SQL queries to work well with Solr indexing mechanisms. To understand what and why Solr does with the data we supply it with.

This article will explain how a few simple steps had made ours – and can make yours – Solr indexing process much more efficient.

Solr data-config.xml files and SQL scripts used when writing this article can be found here: Effective Solr files.

How does Solr index data?

There are two types of imports that can be executed by Solr: a full import and a delta import. The full import will read all (*not necessarily, but we’ll get to that:) rows from the specified datasource and index them. A delta import will first get a list of changed elements, then query the datasource for the details of each such element. If there is a sub-entity with query specified in another entity, it will be executed separately for each obtained row of data, regardless of whether it was a delta import or a full import.

In other words, a full-import will execute exactly 1 query for each defined entity + N queries for each sub-entity, while a delta-import will execute 1 query to get given entity’s changed elements list + N queries for each changed element + another N queries for each defined sub-entity. Some of you may already see what I’m getting at…

A real-life scenario

Consider this table:

<br /><br /><br /><br />
CREATE TABLE “EPG”.”P_PROGRAMS” (<br /><br /><br /><br />
    “PPROG_ID” NUMBER NOT NULL,<br /><br /><br /><br />
    “PPROG_NAME” VARCHAR2(200 BYTE) NOT NULL,<br /><br /><br /><br />
    “PPROG_CAST” VARCHAR2(400 BYTE),<br /><br /><br /><br />
    “PPROG_CHNG_TIME” TIMESTAMP (6) DEFAULT systimestamp, </p><br /><br /><br />
<p>     PRIMARY KEY (“PPROG_ID”)<br /><br /><br /><br />
);

CREATE TABLE “EPG“.“P_PROGRAMS” (

“PPROG_ID” NUMBER NOT NULL,

“PPROG_NAME” VARCHAR2(200 BYTE) NOT NULL,

“PPROG_CAST” VARCHAR2(400 BYTE),

“PPROG_CHNG_TIME” TIMESTAMP (6) DEFAULT systimestamp,

PRIMARY KEY (“PPROG_ID“)

);

It stores information about movies: one field containing their titles (PPROG_NAME) and another containing comma separated list of at most 5 actors starring in the movie (PPROG_CAST). A field with record’s last modification date is added for tracking changes (PPROG_CHNG_TIME).

In solr we want to store two entities: Program with information about which actors star in it, and Actor, with a list of movies he/she stars in. Some of you may have already noticed that the way the actors are stored is not particularly processing-friendly. How could we index them efficiently…?

For now, let’s take a look at the Program entity for indexing programs in solr:

<entity dataSource=”p_program” name=”Program” query=”SELECT ‘PROGRAM’ AS DOCUMENT_TYPE, ‘P_’|| P.PPROG_ID as DOCUMENT_ID, p.PPROG_ID, p.PPROG_NAME, p.PPROG_CAST FROM P_PROGRAMS p” pk=”DOCUMENT_ID” deltaImportQuery=”SELECT ‘PROGRAM’ AS DOCUMENT_TYPE, ‘P_’|| P.PPROG_ID as DOCUMENT_ID, p.PPROG_ID, p.PPROG_NAME, p.PPROG_CAST FROM P_PROGRAMS p WHERE PPROG_ID =’${dih.delta.DOCUMENT_ID}'” deltaQuery=”select PPROG_ID as DOCUMENT_ID from p_programs where PPROG_CHNG_TIME > to_timestamp_tz(‘${dataimporter.last_index_time}’,’YYYY-MM-DD hh24:mi:ss’)” transformer=”RegexTransformer”> <field column=”DOCUMENT_TYPE” name=”DOCUMENT_TYPE” /> <field column=”PPROG_ID” name=”PPROG_ID” /> <field column=”PPROG_NAME” name=”PPROG_NAME” /> <field column=”PPROG_CAST” name=”PPROG_CAST” splitBy=”, +?”/> </entity>

<entity dataSource=“p_program” name=“Program”

query=“SELECT

‘PROGRAM’ AS DOCUMENT_TYPE,

‘P_’|| P.PPROG_ID as DOCUMENT_ID,

p.PPROG_ID,

p.PPROG_NAME,

p.PPROG_CAST

FROM P_PROGRAMS p”

pk=“DOCUMENT_ID”

deltaImportQuery=“SELECT

‘PROGRAM’ AS DOCUMENT_TYPE,

‘P_’|| P.PPROG_ID as DOCUMENT_ID,

p.PPROG_ID,

p.PPROG_NAME,

p.PPROG_CAST

FROM P_PROGRAMS p

WHERE PPROG_ID =’${dih.delta.DOCUMENT_ID}'”

deltaQuery=“select PPROG_ID as DOCUMENT_ID

from p_programs

where PPROG_CHNG_TIME > to_timestamp_tz(‘${dataimporter.last_index_time}’,’YYYY-MM-DD hh24:mi:ss’)”

transformer=“RegexTransformer”>

</entity>

It’s fairly straightforward. Only PPROG_CAST field handling is sligtly more compliacted, as we are using splitBy attribute to separate each comma separated actor from another, thus creating 5 PPROG_CAST values for each Program.

Actor entity is more complicated due to how we store Actors in the database – ale due to the requirement of indexing PPROG_ID for each Actor:

<br /><br /><br /><br />
<entity dataSource=”p_program” name=”Actor”<br /><br /><br /><br />
    transformer=”RegexTransformer”<br /><br /><br /><br />
    query=”<br /><br /><br /><br />
        SELECT DOCUMENT_TYPE, PPROG_CAST, ‘A_’||PPROG_CAST as DOCUMENT_ID from (<br /><br /><br /><br />
            SELECT DISTINCT<br /><br /><br /><br />
                ‘ACTOR’ AS DOCUMENT_TYPE,<br /><br /><br /><br />
                TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST<br /><br /><br /><br />
            FROM P_PROGRAMS<br /><br /><br /><br />
            WHERE PPROG_CAST IS NOT NULL<br /><br /><br /><br />
            CONNECT BY level &lt;= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)<br /><br /><br /><br />
            AND PRIOR pprog_id = pprog_id<br /><br /><br /><br />
            AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL<br /><br /><br /><br />
        )<br /><br /><br /><br />
    “<br /><br /><br /><br />
    pk=”DOCUMENT_ID”</p><br /><br /><br />
<p>    deltaImportQuery=”<br /><br /><br /><br />
        SELECT DISTINCT<br /><br /><br /><br />
            ‘ACTOR’ AS DOCUMENT_TYPE,<br /><br /><br /><br />
            ‘A_’||TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS DOCUMENT_ID,<br /><br /><br /><br />
            TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST<br /><br /><br /><br />
        FROM P_PROGRAMS<br /><br /><br /><br />
        WHERE PPROG_ID=’${dih.delta.DOCUMENT_ID}’ AND PPROG_CAST IS NOT NULL<br /><br /><br /><br />
        CONNECT BY level &lt;= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)<br /><br /><br /><br />
            AND PRIOR pprog_id = pprog_id<br /><br /><br /><br />
            AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL<br /><br /><br /><br />
    “</p><br /><br /><br />
<p>    deltaQuery=”<br /><br /><br /><br />
        select DISTINCT PPROG_ID as DOCUMENT_ID<br /><br /><br /><br />
        from P_PROGRAMS<br /><br /><br /><br />
        where PPROG_CHNG_TIME &gt; to_timestamp_tz(‘${dataimporter.last_index_time}’,’YYYY-MM-DD hh24:mi:ss’)<br /><br /><br /><br />
    “<br /><br /><br /><br />
    ><br /><br /><br /><br />
    <field column=”PPROG_CAST” name=”ACTOR” /></p><br /><br /><br />
<p>    <field column=”DOCUMENT_TYPE” name=”DOCUMENT_TYPE” /></p><br /><br /><br />
<p>    <entity dataSource=”p_program” name=”ProgramId” query=”<br /><br /><br /><br />
        SELECT PPROG_ID FROM (<br /><br /><br /><br />
            SELECT DISTINCT<br /><br /><br /><br />
                PPROG_ID,<br /><br /><br /><br />
                TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST<br /><br /><br /><br />
            FROM P_PROGRAMS<br /><br /><br /><br />
            WHERE PPROG_CAST IS NOT NULL<br /><br /><br /><br />
            CONNECT BY level &lt;= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)<br /><br /><br /><br />
                AND PRIOR pprog_id = pprog_id<br /><br /><br /><br />
                AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL<br /><br /><br /><br />
         ) WHERE PPROG_CAST = ‘${dataimporter.functions.escapeSql(Actor.PPROG_CAST)}'<br /><br /><br /><br />
         “><br /><br /><br /><br />
        <field column=”PPROG_ID” name=”ProgramId” /><br /><br /><br /><br />
    </entity><br /><br /><br /><br />
</entity>

<entity dataSource=“p_program” name=“Actor”

transformer=“RegexTransformer”

query=“

SELECT DOCUMENT_TYPE, PPROG_CAST, ‘A_’||PPROG_CAST as DOCUMENT_ID from (

SELECT DISTINCT

‘ACTOR’ AS DOCUMENT_TYPE,

TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST

FROM P_PROGRAMS

WHERE PPROG_CAST IS NOT NULL

CONNECT BY level <= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)

AND PRIOR pprog_id = pprog_id

AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL

)

“

pk=“DOCUMENT_ID”

deltaImportQuery=“

SELECT DISTINCT

‘ACTOR’ AS DOCUMENT_TYPE,

‘A_’||TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS DOCUMENT_ID,

TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST

FROM P_PROGRAMS

WHERE PPROG_ID=’${dih.delta.DOCUMENT_ID}’ AND PPROG_CAST IS NOT NULL

CONNECT BY level <= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)

AND PRIOR pprog_id = pprog_id

AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL

“

deltaQuery=“

select DISTINCT PPROG_ID as DOCUMENT_ID

from P_PROGRAMS

where PPROG_CHNG_TIME > to_timestamp_tz(‘${dataimporter.last_index_time}’,’YYYY-MM-DD hh24:mi:ss’)

“

<entity dataSource=“p_program” name=“ProgramId” query=“

SELECT PPROG_ID FROM (

SELECT DISTINCT

PPROG_ID,

TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST

FROM P_PROGRAMS

WHERE PPROG_CAST IS NOT NULL

CONNECT BY level <= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)

AND PRIOR pprog_id = pprog_id

AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL

) WHERE PPROG_CAST = ‘${dataimporter.functions.escapeSql(Actor.PPROG_CAST)}’

“>

</entity>

With solr configured like this, the full-import of a mere 1000 programs with 5 actors (generated by populate.sql script downloadable here) each took 20 minutes on our test vm machine (Linux 64bit, 3 cores@3GHz, 6GB RAM, HDD drive). A delta for all 1000 programs took 24 minutes. That’s very slow and very inefficient.

Why are these results so poor? It’s not the program entity – its indexing is fast. It performs only a single query. It’s indexing Actors that’s much, much slower. Not only are we querying the database with a complex query for each actor, we are querying it with a where-clause relying on a non-indexed, dynamically built column. It has to be slow.

A production environment is 10 times as powerful. Still, indexing and updating tens of thousands of programs daily would take almost a whole day. And we’d like to to that twice a day. So how do we improve these entity definitions? First of all – we can get rid of the sub-entity for Actor. And the first rule for quick indexing:

One entity – one query (aka don’t use sub-entities)

It’s the slow sub-entity query that takes the most time while indexing actors. It’s fairly complex and time consuming, and we call it once for each Actor… It would be best if we were able to include the program id’s in a single query in the same way the actors are included for Program entity.

And we can do it. We can transform the query for Actor to include a csv column with program ids. It’s not basic SQL, but it’s not rocket science either. There was a problem with most popular actors, for which the ProgramId csv colum was long enough to force us to uise a clob instead of varchar – it turns out that you can only fit so many 7 digit ids in Oracle’s 4000 byte varchar2…

A redesigned Actor entity query looks like this:

<br /><br /><br /><br />
<entity dataSource=”p_program” name=”Actor”<br /><br /><br /><br />
    transformer=”ClobTransformer, RegexTransformer”<br /><br /><br /><br />
    query=”<br /><br /><br /><br />
        select DOCUMENT_TYPE, PPROG_CAST,<br /><br /><br /><br />
            ‘A_’ || PPROG_CAST as DOCUMENT_ID,<br /><br /><br /><br />
            rtrim(xmlagg(xmlelement(e,PPROG_ID,’,’).extract(‘//text()’) order by PPROG_ID).GetClobVal(),’,’) as PPROG_IDS<br /><br /><br /><br />
        from (<br /><br /><br /><br />
            SELECT<br /><br /><br /><br />
                ‘ACTOR’ AS DOCUMENT_TYPE,<br /><br /><br /><br />
                TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST,<br /><br /><br /><br />
                PPROG_ID<br /><br /><br /><br />
            FROM P_PROGRAMS<br /><br /><br /><br />
            WHERE PPROG_CAST IS NOT NULL<br /><br /><br /><br />
            CONNECT BY level <= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)<br /><br /><br /><br />
                AND PRIOR pprog_id = pprog_id<br /><br /><br /><br />
                AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL<br /><br /><br /><br />
            )<br /><br /><br /><br />
        group by DOCUMENT_TYPE, PPROG_CAST<br /><br /><br /><br />
        “<br /><br /><br /><br />
    pk=”PPROG_CAST”<br /><br /><br /><br />
><br /><br /><br /><br />
    <field column=”PPROG_CAST” name=”ACTOR” /></p><br /><br /><br />
<p>    <field column=”DOCUMENT_TYPE” name=”DOCUMENT_TYPE” /><br /><br /><br /><br />
    <field column=”PPROG_IDS” splitBy=”,” clob=”true” name=”ProgramId” /><br /><br /><br /><br />
</entity>

<entity dataSource=“p_program” name=“Actor”

transformer=“ClobTransformer, RegexTransformer”

query=“

select DOCUMENT_TYPE, PPROG_CAST,

‘A_’ || PPROG_CAST as DOCUMENT_ID,

rtrim(xmlagg(xmlelement(e,PPROG_ID,’,’).extract(‘//text()’) order by PPROG_ID).GetClobVal(),’,’) as PPROG_IDS

from (

SELECT

‘ACTOR’ AS DOCUMENT_TYPE,

TRIM(REGEXP_SUBSTR(PPROG_CAST, ‘[^,]+’, 1, level)) AS PPROG_CAST,

PPROG_ID

FROM P_PROGRAMS

WHERE PPROG_CAST IS NOT NULL

CONNECT BY level <= REGEXP_COUNT(PPROG_CAST, ‘[^,]+’)

AND PRIOR pprog_id = pprog_id

AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL

)

group by DOCUMENT_TYPE, PPROG_CAST

“

pk=“PPROG_CAST”

</entity>

With this change a full import for 1000 programs with 5 actors each takes only 6 seconds instead of the 20 minutes it took previously.

This query, however, improves the delta speed by a factor of only 2, since we will be making 1 query instead of 2. It means that a delta would still take a lot of time. How can we address that? We can…

Use full import without clean instead of delta

When choosing full-import command, we can specify whether we want to remove all previously indexed data or not. It’s the clean flag passed as a parameter. If we set it to true, then a full import will re-parse and re-index all data – which was not bad, since full reindex in our case took minutes and delta took hours. If we set it to false, previous data won’t be removed. We can combine this with another feature of Solr: we can include the last index time and state of clean flag in our query. Therefore, we are able to determine which data to read – all, or just the changed rows. If the clean flag is set to true, we’ll want to get all data, since the previous data will be removed. If clean flag is set to false, we can use last index time to get only modified results. In SQL it boils down to including something like this in a where clause:

<br /><br /><br /><br />
AND ( ‘${dataimporter.request.clean}’ = ‘true'<br /><br /><br /><br />
  OR PPROG_CHNG_TIME &gt; to_timestamp_tz(‘${dataimporter.last_index_time}’,’YYYY-MM-DD hh24:mi:ss’) )

1 2	AND ( ‘${dataimporter.request.clean}’ = ‘true’ OR PPROG_CHNG_TIME > to_timestamp_tz(‘${dataimporter.last_index_time}’,‘YYYY-MM-DD hh24:mi:ss’) )

And now, in Solr, if we want to index changes only, we can do it by invoking full-import?clean=false. It’ll be as fast as a full import, since it will only execute a single time-bounded query. When we want a full re-index, we set clean=true, the query matches all data regardless of last modification time.

Now, instead of delta querying for each row, we are only calling one query to update all data. That’s much more efficient, isn’t it?

There’s another advantage of not using deltas. We have less SQL queries to manage – in case we have to modify our entities, there is less space for making errors, and no chance of forgetting to update one of the queries.

Impact on a real-life production server

The changes to solr data configuration explained in this post have reduced the time necessary to index program and actor (and other, with similar changes) data from 8 hours to 12 minutes for a full import, and from 14 hours to 4-6 minutes for delta replaced with clean=false full import. Our import’s efficiency has been increased by more than 1 order of magnitude for full import and by 2 orders of maginitude for delta.

Initially, a delta import was scheduled to occur two times a day, just after synching data from external provders. The unoptimized delta import didn’t actually run fast enough to be used twice a day. The optimized import does, and could be easily run 100 times per day.