mardi 14 juin 2016

SQL Restoring historical data from the changelog

There is a table A. One row from the table A looks like following:

+----+---------+---------+---------+------------+------------+------------------+------------------+
| id | value_a | value_b | value_c | created_on | created_by | last_modified_on | last_modified_by |
+----+---------+---------+---------+------------+------------+------------------+------------------+
| 42 | x       | y       | z       | 2016-04-01 | Maria      | 2016-05-01       | Jim              |
+----+---------+---------+---------+------------+------------+------------------+------------------+

So, table A contains only the latest values.

There is also a table called changelog. It stores all the changes/updates concerning table A. changelog records for table A look like following:

+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+
| id  | object_id | action |  field  | old_value |                new_value                | created_on | created_by |
+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+
| 234 |        42 | insert | NULL    | NULL      | {value_a: xx, value_b: yy, value_c: zz} | 2016-04-01 | Maria      |
| 456 |        42 | update | value_a | xx        | x                                       | 2016-04-05 | Bob        |
| 467 |        42 | update | value_b | yy        | y                                       | 2016-05-01 | Jim        |
| 678 |        42 | update | value_c | zz        | z                                       | 2016-05-01 | Jim        |
+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+

I need to create a historical_A table, which for this specific record will look like follows:

+----+---------+---------+---------+------------+------------+------------+--------------+
| id | value_a | value_b | value_c | valid_from | created_by |  valid_to  | modified_by  |
+----+---------+---------+---------+------------+------------+------------+--------------+
| 42 | xx      | yy      | zz      | 2016-04-01 | Maria      | 2016-04-05 | Bob          |
| 42 | x       | yy      | zz      | 2016-04-05 | Bon        | 2016-05-01 | Jim          |
| 42 | x       | y       | z       | 2016-05-01 | Jim        |            |              |
+----+---------+---------+---------+------------+------------+------------+--------------+

Table A has about 1 500 000 rows, changelog table for table A has about 27 000 000 rows.

Currently i am doing an initial transformation (load) using both SQL and Python scripting. Bascially i generate an insert statement for the initial row (by parsing a json), and then generate all following insert statements grouping by created_on column of the changelog table. Currently it takes me around 3 minutes to process 1000 rows of table A. Thus, i am parallelising (x10) my scripts execution to get a result in a more timely manner.

I suspect that Sql + Python scripting is not the best solution to the problem. Is there a purely SQL solution to the presented problem? Are there any established best practices for such problems?

Aucun commentaire:

Enregistrer un commentaire