samedi 23 juillet 2016

Efficient way to dedupe Hive table?

I'm working with a data pipeline similar to Airflow and want to have a daily task that checks if a new partition has landed in a table and then create a new table with all duplicate records removed.

The dataset is quite large, so I'm struggling to think of an efficient HiveQL query to dedupe it with. Simply using a group by over all the columns is certainly too expensive.

Aucun commentaire:

Enregistrer un commentaire