Ways to load data from source to target

Sagadevan K
7 min readMar 21, 2023

In an ETL process, there are multiple ways to load data from a source system to a target system. Let’s have a look at some of them.

Image source: Vecteezy

Full load

Full load means to fetch all records from the source and insert them in the target. The target needs to be emptied before every load. This is the simplest way to load data from one place to another.

Incremental load

Incremental load means to fetch only new/updated records from the source and push those changes to the target. To do an incremental load, we should be able to identify the new/changed records and this can become tricky.

Throughout the article, we’ll be considering the source table and target table structure(DDL) to be similar to keep things simple.

Let’s start with full load followed by the different types of incremental loads

Full load implementation

Let’s say we have a source table named emp_details which has the following records:

We want to load these records into our target. So the steps would be truncate the target table and insert all records from the source table. SQL commands for the same:

TRUNCATE TABLE target.emp_details;

INSERT INTO target.emp_details
(
id,
emp_name,
dob,
department
)
SELECT
id,
emp_name,
dob,
department
FROM source.emp_details;

Incremental Load Implementation

To do an incremental load, we should know the new/updated records in the source. This can be done in the following ways:

  1. Using date columns which can be used to identify the changes

Let’s say our source table initially looks like:

Our target table is in sync with the source table. Now there are some inserts and updates in the source table. New state of source table is:

The row highlighted in yellow has been modified and the row highlighted in green has been inserted.

How will we know that only these 2 records should be picked from the source? We know that the target was in sync with our initial source state.

If we find the max(created_date) from the target table, we’ll get the last_inserted_date and similarly we can find the last_updated_date, by finding the max(modified_date) from the target table. If for a record in the source table, modified_date > last_updated_date, then this record is a newly updated record and it needs to be modified. If for a record in the source table, created_date > last_inserted_date, then this record is a newly created record and it needs to be inserted.

The UPDATE statement will look like:

UPDATE target.emp_details
SET emp_name = S.emp_name,
dob = S.dob,
department = S.department
FROM target.emp_details T
INNER JOIN source.emp_details S
ON S.id = T.id
WHERE S.modified_date > (SELECT MAX(modified_date) FROM target.emp_details);

The INSERT statement will look like:

INSERT INTO target.emp_details
(
id,
emp_name,
dob,
department
)
SELECT
S.id,
S.emp_name,
S.dob,
S.department
FROM source.emp_details S
WHERE created_date > (SELECT MAX(created_date) FROM target.emp_details);

The WHERE clause does the magic in both statements, it only returns the records which were not processed previously. Both these statements(UPDATE and INSERT) should be executed inside a single transaction.

To improve the performance of these queries, we can create indexes on created_date and modified_date columns in the source and target.

If we want to delete the records which have been deleted in the source table from the target table, we will have to do a record by record comparison between the source and the target, because there is no column available on basis of which we can identify if a recorded has been deleted. To solve this, we can add a deleted_flag in the source table. If deleted_flag = TRUE, we’ll delete records from the target table.

This method will work only if the tables have created_date, modified_date and deleted_flag columns. Setting this logic(addition of columns) up for already existing tables will be an additional task.

2. Using an Upsert operation

If the source and target tables do not have the identifier columns, we’ll have to do an upsert operation, by comparing the source and target tables.

We want to compare two tables and we can do that by joining the two tables.

  • Only rows which exist in both tables need to be updated, and we’ll get these rows by doing an INNER JOIN.
  • Only rows which exist in the source table and do not exist in the target table need to be inserted, and we can get these rows by doing source LEFT JOIN target ON source.id = target.id AND target.id IS NULL. target.id IS NULL will give those records in source for which there is no matching record present in the target table.
  • Only rows which do not exist in the source table and exist in the target table need to be deleted, and we can get these rows by doing target LEFT JOIN source ON source.id = target.id AND source.id IS NULL. source.id IS NULL will give those records in target for which there is no matching record present in the source table.

Initially, both tables are in sync

After changes in the source, the full outer join of the tables will look like this:

The brown record needs to be updated, red records need to be deleted and blue records need to be inserted. SQL commands for the same:

UPDATE target.emp_details
SET emp_name = S.emp_name,
dob = S.dob,
department = S.department
FROM target.emp_details T
INNER JOIN source.emp_details S
ON S.id = T.id;
DELETE FROM target.emp_details T
LEFT JOIN source.emp_details S
ON S.id = T.id
WHERE S.id IS NULL;
INSERT INTO target.emp_details
(
id,
emp_name,
dob,
department
)
SELECT
S.id,
S.emp_name,
S.dob,
S.department
FROM source.emp_details S
LEFT JOIN target.emp_details T
ON S.id = T.id
WHERE T.id IS NULL;

Some databases like SQL Server, provide a MERGE statement to do the same thing.

We can see in the update statement, that even though there is only one updated record in the source, all records which are returned by the INNER JOIN will be updated in the target . This can be very costly if the number of records is huge.

So to solve this, we can add a new column to the source and target table called row_hash_code. This column will be a hash of all other columns and needs to be computed(additional work). If a record gets updated, its row_hash_code will change. So in the UPDATE staement we can add a condition WHERE S.row_hash_code != T.row_hash_code. If the row_hash_code for a particular id is not matching between the source table and target table, it means that there is some update in the source table for this id.

The above mentioned SQL commands will only work for systems where both the source and target reside in the same SQL environment. If the source and target do not reside at the same place, for e.g. source is file based, target is relational or source is SQL Server, target is AWS Redshift; then we will have to load both the source and target into an ETL process(Spark, Pandas, etc.) and then compare them.

Some cloud providers have some offerings for incremental load like AWS Glue has an option called Job bookmark which keeps record of the last processed file’s timestamp, which is used to process only the newer files in the next run.

Change Data capture(CDC) is also an option to only get the changes from the source system. These changes can be very frequent, hence we’ll have to stream them into a message broker like Kafka and then consume them from there.

Depending on the problem statement, we will have to choose a load type which will be efficient.

  • If the table size is small, we can go for a full load but we won’t be able to maintain history of deleted/updated records.
  • Generally for large tables, it is a good practice to go for incremental load which reads only the changed records. The tables should have a unique key on which they can be compared. Deleted records can be handled as required(keep them or mark as deleted or actually delete them)
  • Sometimes, tables can be append only, we only get inserts in the source system. So in this case, we can insert into the target table and truncate the source table so that in the next run we don’t insert the same records again.

--

--