본문 바로가기

Tips/Solutions for problems

Delete redundant rows in pandas dataframe

import modules

import pandas as pd

Create dataframe with duplicates

raw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 42, 36, 24, 73],
        'preTestScore': [4, 4, 31, 2, 3],
        'postTestScore': [25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
1JasonMiller42425
2TinaAli363157
3JakeMilner24262
4AmyCooze73370

Identify which observations are duplicates

df.duplicated()
0    False
1     True
2    False
3    False
4    False
dtype: bool

Drop duplicates

df.drop_duplicates()
first_namelast_nameagepreTestScorepostTestScore
0JasonMiller42425
2TinaAli363157
3JakeMilner24262
4AmyCooze73370

Drop duplicates in the first name column, but take the last obs in the duplicated set

df.drop_duplicates(['first_name'], keep='last')
first_namelast_nameagepreTestScorepostTestScore
1JasonMiller42425
2TinaAli363157
3JakeMilner24262
4AmyCooze73370