Pandas essentials for Machine Learning

Pandas Python Data Science

Starting your ML journey can be quite overwhelming, but to sail through it you need to deep dive into the pandas library

Important resources to Pandas

Introduction to Pandas

Column oriented data analysis API.
It is supported by many ML Libraries in the python environment.

Pandas has two main structures to be discussed, one is the DataFrame and the other is the Series.

Pandas Series

Series in pandas is a single column.

Pandas DataFrame

A collection of series is called as a Dataframe.
Consider it as a relational datatable, that has rows and named columns
Each Series in a dataframe has a name.

Usage of Pandas

Installing Pandas

The first step for working with any library in Python is installing it.

%pip install pandas

Now let us import it.

import pandas as pd

Creating Dataframe from a Series

courses = pd.Series(['MLF', 'MLT', 'MLP'])
students = pd.series([100, 200, 150])
 
register_df = pd.DataFrame({  "course_name": courses,
															"student_count": students})

Here we create two series, namely, courses and students and combine them to the dataframe namely, register_df

Loading predefined sklearn’s dataset

Let us now load some dataset into it. I’ll be using sklearn’s predefault dataset namely the diabetes dataset.

Let us first import it.

from sklearn.diabetes import load_diabetes

Now let us load the dataset

# as_frame = True returns the data as a pandas dataframe
# under the attribute 'data'
diabetes = load_diabetes(as_frame=True)
df = diabetes['data']

Functions/Attributes to explore data in Pandas

`df.shape`

Note: Not a function, but an attribute
Returns the tuple of $($ Rows, Columns $)$ .

`df.columns`

Note: Not a function, but an attribute
Returns the list of the column names in the dataframe

`df.head(n)`

Parameters:
- n(Optional): The number of rows.
  - Integer
  - Initially $n=5$
Returns
- First $n$ rows from the dataframe

`df.tail(n)`

Parameters:
- n(Optional): The number of rows.
  - Integer
  - Initially $n=5$
Returns
- Last $n$ rows from the dataframe

`df.info()`

Returns a basic info of the dataset, mainly if null values are present in datasets and the datatypes of the columns.

`df.describe()`

Returns a descriptive summarisation of all the numerical columns.
You can also pass in some optional parameters
- percentiles- A list of the required percentiles.

`df.select_dtypes(include, exclude)`

Returns the dataframe after filtering out the columns of paticular datatypes on the function call.

Example

df.select_dtypes(include="int32")
df.select_dtypes(exclude="float32")

`df.Series.dtype`

Returns the datatype of the paticular column
Example:
```
df.amount.dtype
```

`df.Series.astype(new_type)`

Changes the datatype of the paticular column
Example:
```
df.amount.astype("float64")
```

Selecting data in Pandas

Selecting Columns

1 column
```
df['column_name']
```
This returns a Pandas Series.
More than 1 columns
```
df[['col1', 'col2', ..., 'coln']]
```
This returns a Pandas Dataframe.

We can now slice this just like we slice the python lists.

Example:

First 100 elements
```
df['col'][:100]
```
Last 100 elements
```
	df['col'][-100:]
```

Selecting Rows

loc - Return the row at that specific index
```
df.loc['index']
```
- One more implementation is
```
# This returns the specific value
df.loc[index, 'column_name']
```
- One more implementation to select more than 1 columns is
```
df.loc[index, ['col1', 'col2']]
```
iloc - Return the row at that specific integer position in the dataframe
```
df.iloc[int_pos]
```
- One more implementation to select more than 1 columns is df.loc[index, column_splices]

Selecting based on Conditions

Pandas also provides us an easy way yo select values based on conditions like $<,>, \leq, \geq$ and so on.

The following block of code will return the indexes of all the rows that has the value in the age column greater than 10.

df.age > 10

To filter out the rows, we can combine the loc utility and the above block of code as follows:

df.loc[df.age > 10]

This code will return a dataframe that contains the filtered rows.

Combining multiple selections

AND operator & - Combine two conditions over an AND, i.e., both the conditions should be met.
OR operator |- Combine two conditions over an AND, i.e., either the conditions should be met.
NOT operator ~- Negates (or inverts) the output of the current condition.

Using lambda functions as selectors

To make the code look a bit better, we can use lambda functions as selectors. The syntax is fairly simple as shown in the block below

condition = lambda df: (df.age > 50) & (df.amount < 10000)
df[condition]

Data Manipulation in Pandas

Adding a column

You can add new columns to the DataFrame, that are derived from other columns as follows:

df['column_name'] = Operation of Pandas Series

For example, if you have a date column with values of the format YYYY-MM-DD and I want to extract the month, you can use the following operations

df['month'] = df['date'].str.slice(5, 7).asdtype("int64")

The int64 turns the string to integer, and the slice method from the string class is used to break the date.

One more way to add new columns is as a list or a Pandas Series:
```
df['new_col'] = [1, 2, ..., n]
df['new_col2'] = pd.Series([2, 3, 5, ..., 5])
```
Just note that the new Series or list you add must be of the same size as the dataset, i.e., have the same rows.

Manipulation of existing data

Let's say I have a column of float values. I now want the values under a certain threshold to be $0$ . To accomplish this, we can do the following:

threshold = 0.4
criteria = df.col < threshold # Returns all the rows with values less than the threshold
df.loc[criteria, 'col'] = 0

Pandas essentials for Machine Learning

Subscribe to Our Newsletter