import modules
Create dataframe with duplicates
raw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'],
'age': [42, 42, 36, 24, 73],
'preTestScore': [4, 4, 31, 2, 3],
'postTestScore': [25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
| first_name | last_name | age | preTestScore | postTestScore |
---|
0 | Jason | Miller | 42 | 4 | 25 |
---|
1 | Jason | Miller | 42 | 4 | 25 |
---|
2 | Tina | Ali | 36 | 31 | 57 |
---|
3 | Jake | Milner | 24 | 2 | 62 |
---|
4 | Amy | Cooze | 73 | 3 | 70 |
---|
Identify which observations are duplicates
0 False
1 True
2 False
3 False
4 False
dtype: bool
Drop duplicates
| first_name | last_name | age | preTestScore | postTestScore |
---|
0 | Jason | Miller | 42 | 4 | 25 |
---|
2 | Tina | Ali | 36 | 31 | 57 |
---|
3 | Jake | Milner | 24 | 2 | 62 |
---|
4 | Amy | Cooze | 73 | 3 | 70 |
---|
Drop duplicates in the first name column, but take the last obs in the duplicated set
df.drop_duplicates(['first_name'], keep='last')
| first_name | last_name | age | preTestScore | postTestScore |
---|
1 | Jason | Miller | 42 | 4 | 25 |
---|
2 | Tina | Ali | 36 | 31 | 57 |
---|
3 | Jake | Milner | 24 | 2 | 62 |
---|
4 | Amy | Cooze | 73 | 3 | 70 |
---|