There is a table A. One row from the table A looks like following:
+----+---------+---------+---------+------------+------------+------------------+------------------+
| id | value_a | value_b | value_c | created_on | created_by | last_modified_on | last_modified_by |
+----+---------+---------+---------+------------+------------+------------------+------------------+
| 42 | x | y | z | 2016-04-01 | Maria | 2016-05-01 | Jim |
+----+---------+---------+---------+------------+------------+------------------+------------------+
So, table A contains only the latest values.
There is also a table called changelog. It stores all the changes/updates concerning table A. changelog records for table A look like following:
+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+
| id | object_id | action | field | old_value | new_value | created_on | created_by |
+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+
| 234 | 42 | insert | NULL | NULL | {value_a: xx, value_b: yy, value_c: zz} | 2016-04-01 | Maria |
| 456 | 42 | update | value_a | xx | x | 2016-04-05 | Bob |
| 467 | 42 | update | value_b | yy | y | 2016-05-01 | Jim |
| 678 | 42 | update | value_c | zz | z | 2016-05-01 | Jim |
+-----+-----------+--------+---------+-----------+-----------------------------------------+------------+------------+
I need to create a historical_A table, which for this specific record will look like follows:
+----+---------+---------+---------+------------+------------+------------+--------------+
| id | value_a | value_b | value_c | valid_from | created_by | valid_to | modified_by |
+----+---------+---------+---------+------------+------------+------------+--------------+
| 42 | xx | yy | zz | 2016-04-01 | Maria | 2016-04-05 | Bob |
| 42 | x | yy | zz | 2016-04-05 | Bon | 2016-05-01 | Jim |
| 42 | x | y | z | 2016-05-01 | Jim | | |
+----+---------+---------+---------+------------+------------+------------+--------------+
Table A has about 1 500 000 rows, changelog table for table A has about 27 000 000 rows.
Currently i am doing an initial transformation (load) using both SQL and Python scripting. Bascially i generate an insert statement for the initial row (by parsing a json), and then generate all following insert statements grouping by created_on column of the changelog table.
Currently it takes me around 3 minutes to process 1000 rows of table A. Thus, i am parallelising (x10) my scripts execution to get a result in a more timely manner.
I suspect that Sql + Python scripting is not the best solution to the problem. Is there a purely SQL solution to the presented problem? Are there any established best practices for such problems?
Aucun commentaire:
Enregistrer un commentaire