jeudi 30 juin 2016

How to select entire rows based on distinct columns

I am doing this in spark

cityId  PhysicalAddress      EmailAddress         ..many other columns of other meta info...   
1       b st                 something@email.com   
1       b st                 something@email.com   <- some rows can be entirely duplicates
1       a avenue             random@gmail.com
2       c square             anything@yahoo.com
2       d blvd               d@d.com

There is no primary key on this table and I want to grab one random row based on each distinct cityId

e.g. This is a correct answer

cityId  PhysicalAddress      EmailAddress        ..many other columns 
1       b st                 something@email.com   
2       c square             anything@yahoo.com

e.g. this is also an correct answer

cityId  PhysicalAddress      EmailAddress       ..many other columns 
1       a avenue             random@gmail.com
2       c square             anything@yahoo.com

One way that comes to mind is to use a group by. However, that requires me to use an aggregate function on the other column. (such as min()). Whereas, I just want to pull out an entire row (doesn't matter which one).

Aucun commentaire:

Enregistrer un commentaire