Pandas essentials for Machine Learning
Starting your ML journey can be quite overwhelming, but to sail through it you need to deep dive into the pandas library
Important resources to Pandas
Introduction to Pandas
- Column oriented data analysis API.
- It is supported by many ML Libraries in the python environment.
Pandas has two main structures to be discussed, one is the DataFrame and the other is the Series.
Pandas Series
- Series in pandas is a single column.
Pandas DataFrame
- A collection of series is called as a Dataframe.
- Consider it as a relational datatable, that has rows and named columns
- Each Series in a dataframe has a name.
Usage of Pandas
Installing Pandas
The first step for working with any library in Python is installing it.
%pip install pandas
Now let us import it.
import pandas as pd
Creating Dataframe from a Series
courses = pd.Series(['MLF', 'MLT', 'MLP'])
students = pd.series([100, 200, 150])
register_df = pd.DataFrame({ "course_name": courses,
"student_count": students})
Here we create two series, namely, courses
and students
and combine them to the dataframe namely, register_df
Loading predefined sklearn’s dataset
Let us now load some dataset into it. I’ll be using sklearn’s predefault dataset namely the diabetes
dataset.
-
Let us first import it.
from sklearn.diabetes import load_diabetes
-
Now let us load the dataset
# as_frame = True returns the data as a pandas dataframe # under the attribute 'data' diabetes = load_diabetes(as_frame=True) df = diabetes['data']
Functions/Attributes to explore data in Pandas
df.shape
- Note: Not a function, but an attribute
- Returns the tuple of Rows, Columns.
df.columns
- Note: Not a function, but an attribute
- Returns the list of the column names in the dataframe
df.head(n)
- Parameters:
- n(Optional): The number of rows.
- Integer
- Initially
- n(Optional): The number of rows.
- Returns
- First rows from the dataframe
df.tail(n)
- Parameters:
- n(Optional): The number of rows.
- Integer
- Initially
- n(Optional): The number of rows.
- Returns
- Last rows from the dataframe
df.info()
- Returns a basic info of the dataset, mainly if null values are present in datasets and the datatypes of the columns.
df.describe()
- Returns a descriptive summarisation of all the numerical columns.
- You can also pass in some optional parameters
percentiles
- A list of the required percentiles.
df.select_dtypes(include, exclude)
- Returns the dataframe after filtering out the columns of paticular datatypes on the function call.
- Example
df.select_dtypes(include="int32") df.select_dtypes(exclude="float32")
df.Series.dtype
- Returns the datatype of the paticular column
- Example:
df.amount.dtype
df.Series.astype(new_type)
- Changes the datatype of the paticular column
- Example:
df.amount.astype("float64")
Selecting data in Pandas
Selecting Columns
-
1 column
df['column_name']
This returns a Pandas Series.
-
More than 1 columns
df[['col1', 'col2', ..., 'coln']]
This returns a Pandas Dataframe.
We can now slice this just like we slice the python lists.
Example:
- First 100 elements
df['col'][:100]
- Last 100 elements
df['col'][-100:]
Selecting Rows
loc
- Return the row at that specific indexdf.loc['index']
- One more implementation is
# This returns the specific value df.loc[index, 'column_name']
- One more implementation to select more than 1 columns is
df.loc[index, ['col1', 'col2']]
- One more implementation is
iloc
- Return the row at that specific integer position in the dataframedf.iloc[int_pos]
- One more implementation to select more than 1 columns is
df.loc[index, column_splices]
- One more implementation to select more than 1 columns is
Selecting based on Conditions
Pandas also provides us an easy way yo select values based on conditions like and so on.
The following block of code will return the indexes of all the rows that has the value in the age column greater than 10.
df.age > 10
To filter out the rows, we can combine the loc
utility and the above block of code as follows:
df.loc[df.age > 10]
This code will return a dataframe that contains the filtered rows.
Combining multiple selections
- AND operator
&
- Combine two conditions over an AND, i.e., both the conditions should be met. - OR operator
|
- Combine two conditions over an AND, i.e., either the conditions should be met. - NOT operator
~
- Negates (or inverts) the output of the current condition.
Using lambda functions as selectors
To make the code look a bit better, we can use lambda functions as selectors. The syntax is fairly simple as shown in the block below
condition = lambda df: (df.age > 50) & (df.amount < 10000)
df[condition]
Data Manipulation in Pandas
Adding a column
You can add new columns to the DataFrame, that are derived from other columns as follows:
df['column_name'] = Operation of Pandas Series
For example, if you have a date
column with values of the format YYYY-MM-DD
and I want to extract the month, you can use the following operations
df['month'] = df['date'].str.slice(5, 7).asdtype("int64")
The int64
turns the string to integer, and the slice
method from the string
class is used to break the date.
-
One more way to add new columns is as a list or a Pandas Series:
df['new_col'] = [1, 2, ..., n] df['new_col2'] = pd.Series([2, 3, 5, ..., 5])
Just note that the new Series or list you add must be of the same size as the dataset, i.e., have the same rows.
Manipulation of existing data
Let's say I have a column of float values. I now want the values under a certain threshold to be . To accomplish this, we can do the following:
threshold = 0.4
criteria = df.col < threshold # Returns all the rows with values less than the threshold
df.loc[criteria, 'col'] = 0