- Creating DataFrames
- Cleaning Data
- Exploring Data
- Indexing and Selecting Data
- Grouping and Transforming Data
Creating DataFrames
Specify header while reading a tab-delimited file:
cols = ["col1", "col2", "col3" ]
df = pd.read_csv("data.tab", sep="\t", names=cols)
Cleaning Data
Rename several DataFrame columns:
df = df.rename(columns = {
'col1 old name':'col1 new name',
'col2 old name':'col2 new name',
'col3 old name':'col3 new name',
})
Get a report of all duplicate records in a DataFrame, based on specific columns:
dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)]
Remove duplicates by column, keeping first entry:
df = df.drop_duplicates(subset='col', keep='first')
Reset a DataFrame’s index to continuous integers, without saving old index (eg. after deleting records).
df.reset_index(drop=True, inplace=True)
Clean up missing values in multiple DataFrame columns:
df = df.fillna({
'col1': 'missing',
'col2': 999,
'col3': 0,
'col4': 'missing',
})
Rename several DataFrame columns:
df = df.rename(columns = {
'col1 old name':'col1 new name',
'col2 old name':'col2 new name',
'col3 old name':'col3 new name',
})
Exploring Data
Sort DataFrame on a column:
df.sort_values(by='col1', ascending=False)
Indexing and Selecting Data
Accessing an index from a multi-index DataFrame:
pandas.MultiIndex.get_level_values
Grouping and Transforming Data
Group by two columns and count total in groups:
df.groupby(['col1', 'col2']).count()
Group and then aggregate all columns with a function:
df.groupby('col').agg(function)
Group and aggregate specific columns with specific functions:
df.groupby('col').agg({"col1": np.sum, "col2": pd.Series.nunique})
Spreadsheet-style pivot with aggregation:
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)