I'm working with a data pipeline similar to Airflow and want to have a daily task that checks if a new partition has landed in a table and then create a new table with all duplicate records removed.
The dataset is quite large, so I'm struggling to think of an efficient HiveQL query to dedupe it with. Simply using a group by over all the columns is certainly too expensive.
Aucun commentaire:
Enregistrer un commentaire