Pandas essentials for Machine Learning

Starting your ML journey can be quite overwhelming, but to sail through it you need to deep dive into the pandas library


Important resources to Pandas

Introduction to Pandas

  • Column oriented data analysis API.
  • It is supported by many ML Libraries in the python environment.

Pandas has two main structures to be discussed, one is the DataFrame and the other is the Series.

Pandas Series

  • Series in pandas is a single column.

Pandas DataFrame

  • A collection of series is called as a Dataframe.
  • Consider it as a relational datatable, that has rows and named columns
  • Each Series in a dataframe has a name.

Usage of Pandas

Installing Pandas

The first step for working with any library in Python is installing it.

%pip install pandas

Now let us import it.

import pandas as pd

Creating Dataframe from a Series

courses = pd.Series(['MLF', 'MLT', 'MLP'])
students = pd.series([100, 200, 150])
 
register_df = pd.DataFrame({  "course_name": courses,
															"student_count": students})

Here we create two series, namely, courses and students and combine them to the dataframe namely, register_df

Loading predefined sklearn’s dataset

Let us now load some dataset into it. I’ll be using sklearn’s predefault dataset namely the diabetes dataset.

  • Let us first import it.

    from sklearn.diabetes import load_diabetes
  • Now let us load the dataset

    # as_frame = True returns the data as a pandas dataframe
    # under the attribute 'data'
    diabetes = load_diabetes(as_frame=True)
    df = diabetes['data']

Functions/Attributes to explore data in Pandas

df.shape

  • Note: Not a function, but an attribute
  • Returns the tuple of ((Rows, Columns)).

df.columns

  • Note: Not a function, but an attribute
  • Returns the list of the column names in the dataframe

df.head(n)

  • Parameters:
    • n(Optional): The number of rows.
      • Integer
      • Initially n=5n=5
  • Returns
    • First nn rows from the dataframe

df.tail(n)

  • Parameters:
    • n(Optional): The number of rows.
      • Integer
      • Initially n=5n=5
  • Returns
    • Last nn rows from the dataframe

df.info()

  • Returns a basic info of the dataset, mainly if null values are present in datasets and the datatypes of the columns.

df.describe()

  • Returns a descriptive summarisation of all the numerical columns.
  • You can also pass in some optional parameters
    • percentiles- A list of the required percentiles.

df.select_dtypes(include, exclude)

  • Returns the dataframe after filtering out the columns of paticular datatypes on the function call.
  • Example
    df.select_dtypes(include="int32")
    df.select_dtypes(exclude="float32")

df.Series.dtype

  • Returns the datatype of the paticular column
  • Example:
    df.amount.dtype

df.Series.astype(new_type)

  • Changes the datatype of the paticular column
  • Example:
    df.amount.astype("float64")

Selecting data in Pandas

Selecting Columns

  • 1 column

    df['column_name']

    This returns a Pandas Series.

  • More than 1 columns

    df[['col1', 'col2', ..., 'coln']]

    This returns a Pandas Dataframe.

We can now slice this just like we slice the python lists.

Example:

  • First 100 elements
    df['col'][:100]
  • Last 100 elements
    	df['col'][-100:]

Selecting Rows

  • loc - Return the row at that specific index
    df.loc['index']
    • One more implementation is
      # This returns the specific value
      df.loc[index, 'column_name']
    • One more implementation to select more than 1 columns is
      df.loc[index, ['col1', 'col2']]
  • iloc - Return the row at that specific integer position in the dataframe
    df.iloc[int_pos]
    • One more implementation to select more than 1 columns is df.loc[index, column_splices]

Selecting based on Conditions

Pandas also provides us an easy way yo select values based on conditions like <,>,,<,>, \leq, \geq and so on.

The following block of code will return the indexes of all the rows that has the value in the age column greater than 10.

df.age > 10

To filter out the rows, we can combine the loc utility and the above block of code as follows:

df.loc[df.age > 10]

This code will return a dataframe that contains the filtered rows.

Combining multiple selections

  • AND operator & - Combine two conditions over an AND, i.e., both the conditions should be met.
  • OR operator |- Combine two conditions over an AND, i.e., either the conditions should be met.
  • NOT operator ~- Negates (or inverts) the output of the current condition.

Using lambda functions as selectors

To make the code look a bit better, we can use lambda functions as selectors. The syntax is fairly simple as shown in the block below

condition = lambda df: (df.age > 50) & (df.amount < 10000)
df[condition]

Data Manipulation in Pandas

Adding a column

You can add new columns to the DataFrame, that are derived from other columns as follows:

df['column_name'] = Operation of Pandas Series

For example, if you have a date column with values of the format YYYY-MM-DD and I want to extract the month, you can use the following operations

df['month'] = df['date'].str.slice(5, 7).asdtype("int64")

The int64 turns the string to integer, and the slice method from the string class is used to break the date.

  • One more way to add new columns is as a list or a Pandas Series:

    df['new_col'] = [1, 2, ..., n]
    df['new_col2'] = pd.Series([2, 3, 5, ..., 5])

    Just note that the new Series or list you add must be of the same size as the dataset, i.e., have the same rows.

Manipulation of existing data

Let's say I have a column of float values. I now want the values under a certain threshold to be 00. To accomplish this, we can do the following:

threshold = 0.4
criteria = df.col < threshold # Returns all the rows with values less than the threshold
df.loc[criteria, 'col'] = 0