Skip to content

Pandas Library for Machine Learning in Python

Pandas is open-source library built on NumPy and provides easy-to-use data structures and data analysis tools for machine learning. It is backbone for most of ML projects. We use pandas library in machine learning to import data, data cleaning, transforming and analyzing. This article covers essential functions in pandas with examples using Python.

Features of Pandas Library in Python

Here is the main features of pandas library in python.

  • User friendly.
  • Easy to use data structure.
  • Includes tools to work with time-series and numerical data.
  • Efficiently works with large data sets and other libraries.

Getting started with Pandas library in python

How to Install Pandas?

conda install pandas

pip install pandas

How to import Pandas Library in Python?

import pandas as pd

Series and DataFrame in Pandas

A series is a column, whereas DataFrame is a multi-dimensional table that is a collection of series. All operations of pandas done on dataframe.

Uses of Pandas Library in Machine Learning

Pandas are built on NumPy and has similar functions. We use data in pandas for following tasks.

  • Import, read and create data.
  • Exploratory Data Analysis: View & plot the data, Get insight on data.
  • Data Cleaning.
  • Feed data to machine learning algorithms.

Create DataFrame: Create, read and show data

Create a directory in Python

# Create a Python directory
data = {"math-score": [80, 93, 49, 62], "english-score": [83, 83, 42, 61]}
print(data)

{‘math-score’: [80, 93, 49, 62], ‘english-score’: [83, 83, 42, 61]}

# Add index to a python directory
index_values = ["john", "hari", 'mack', "ram"]
data_with_index = {index: scores for index, scores in zip(index_values, zip(*data.values()))}
print(data_with_index)

{‘john’: (80, 83), ‘hari’: (93, 83), ‘mack’: (49, 42), ‘ram’: (62, 61)}

Create DataFrame from Dictionary

#Create a dataframe from python directory
data_frame = pd.DataFrame(data)
print(data_frame)
math-score  english-score
0          80             83
1          93             83
2          49             42
3          62             61

Create DataFrame from dictionary and add index

# This will conver a python directory into dataframe and add the index
df_dire_index = pd.DataFrame(data, index = ["john", "hari", 'mack', "ram"])
print(df_dire_index)
  math-score  english-score
john          80             83
hari          93             83
mack          49             42
ram           62             61

Create a DataFrame in Python from input data

df = pd.DataFrame({"math-score": [80, 93, 49, 62], "english-score": [83, 83, 42, 61]})

# Display the DataFrame
print(df)

Create a data frame from numpy array

import numpy as np
# Create a NumPy list
numpy_list = np.array([[80, 83], [93, 83], [49, 42], [62, 61]])

# Create a DataFrame from the NumPy list
df = pd.DataFrame(numpy_list, columns=["math-score", "english-score"])

# Display the DataFrame
print(df)

Add index to an existing dataframe

#Add index to existing dataframe
df = df.set_index(pd.Index(["john", "hari", 'mack', "ram"]))

# Display the DataFrame
print(df)

Import data in Pandas from a .csv file

# Import data from a csv file

file_path = "example.csv" # Specify the path to your CSV file

# Read data from CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the DataFrame
print(df)
# Read data from CSV file into a DataFrame with an index
df = pd.read_csv(file_path, index_col="index_col_name")

Import data in Pandas from a .json file

import pandas as pd

# Specify the path to JSON file
file_path = "example.json"

# Read data from JSON file into a DataFrame
df = pd.read_json(file_path, orient='records')

# Display the DataFrame
print(df)

Read data in pandas from SQL database

We need to first create a connection with SQL server to read data from SQL database. We can create connection to SQL database in following steps:

Step-1: Install SQLite
pip install pysqlite3
Step-2: Make a connection with SQL server
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('your_database.db')
#Replace 'your_database.db' with the actual database file.
Step-3: Select the data and convert it into a dataframe
# Write your SQL query
query = 'SELECT * FROM your_table'
#Replace 'your_table' with the actual table name

# Read data from SQL database into a DataFrame
df = pd.read_sql(query, conn)

# Close the database connection
conn.close()

# Display the DataFrame
print(df)

Convert a DataFrame into CSV file

# Specify the path to save the CSV file
csv_file_path = "output.csv"

# Save the DataFrame to CSV file
df.to_csv(csv_file_path, index=False)

Convert a DataFrame into json file

# Specify the path to save the json file
json_file_path = "output.json"

# Save the DataFrame to json file
df.to_json(json_file_path, index=False)

Combine Dataset

Let’s first create two DataFrame df1 and df2. Each DataFrame has two columns. we will join these DataFrames in various ways. This is very useful during machine learning data wrangling activities. 

import pandas as pd

# Create a DataFrame with two columns ('A' and 'B') and three rows
data_1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df1 = pd.DataFrame(data_1)

# Create a DataFrame with two columns ('C' and 'D') and three rows
data_2 = {'C': ["k", "l", "m"], 'D': ["n", "o", "p"]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# Display the DataFrame
print(df1)
print(df2)

df1

   A  B
0  1  4
1  2  5
2  3  6

df2

   C  D
0  k  n
1  l  o
2  m  p

Join two DataFrame by rows

# Join two dataframe by rows
merged_df = pd.merge(df1, df2, left_index=True, right_index=True)
print(merged_df)
Merged Data-Frame by Rows
   A  B  C  D
0  1  4  k  n
1  2  5  l  o
2  3  6  m  p

Concatenate DataFrames by rows

We can concatenate two datasets by rows or columns.

# Concatenate DataFrames by rows
concat_df = pd.concat([df1, df2], ignore_index=True)
# Since df1 and df2 have different rows, the results will be accordingly
print(concat_df)
     A    B    C    D
0  1.0  4.0  NaN  NaN
1  2.0  5.0  NaN  NaN
2  3.0  6.0  NaN  NaN
3  NaN  NaN    k    n
4  NaN  NaN    l    o
5  NaN  NaN    m    p

Reshape a DataFrame in Pandas

Melt a DataFrame: Convert Columns into rows

Pandas’s pd.melt() function is used to reshape a DataFrame. It “melts” or unpivots the DataFrame. In other words, it converts columns into rows. As a result, it helps with analysis and visualization of data.

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'Name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

# Melt the DataFrame
melted_df = pd.melt(df, id_vars=['ID', 'Name'], var_name='Subject', value_name='Score')

# Display the original and melted DataFrames
print("Original DataFrame:")
print(df)

print("\nMelted DataFrame:")
print(melted_df)
Original DataFrame:
   ID     Name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95

Melted DataFrame:
   ID     Name        Subject  Score
0   1    alice     math_score     80
1   2      bob     math_score     90
2   3  charlie     math_score     70
3   1    alice  english_score     75
4   2      bob  english_score     85
5   3  charlie  english_score     95

Pivot a DataFrame: Convert Rows into Columns

We can use the pivot method on a DataFrame to specify the index, columns, and values.

import pandas as pd

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'Name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

# Pivot the DataFrame using pivot_table
pivot_df = pd.pivot_table(df, index='ID', columns='Name', values=['math_score', 'english_score'])

# Display the original and pivoted DataFrames
print("Original DataFrame:")
print(df)

print("\nPivoted DataFrame:")
print(pivot_df)
Original DataFrame:
   ID     Name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95

Pivoted DataFrame:
     english_score               math_score              
Name         alice   bob charlie      alice   bob charlie
ID                                                       
1             75.0   NaN     NaN       80.0   NaN     NaN
2              NaN  85.0     NaN        NaN  90.0     NaN
3              NaN   NaN    95.0        NaN   NaN    70.0

Note that in a pivot table, the resulting DataFrame will have a hierarchical column index. We can use reset_index to flatten it.

Chaining method in pandas

Most of pandas functions or methods return a DataFrame. Therefore output of pandas method can works as input to another pandas method.

df = df1.rename(columns={'A': "A'", 'B': "B'"}).query("value > 1")
df = df1.rename(columns={'A': 'a', 'B': 'b'}).query('a > 1')

This code first rename the original DataFrame (df1) and then filters the rows where the values are greater than 1.

View data in DataFrame

Viewing the required data in a DataFrame is must function to understand the data.

Display Initial 5 Rows of a DataFrame

# Display Initial 5 Rows of a DataFrame
df.head()

Display Initial 2 Rows of a DataFrame

# Display Initial 2 Rows of a DataFrame
df.head(2)

Display Last 5 Rows of a DataFrame

# Display Last 5 Rows of a DataFrame
df.tail()
# Display Initial 5 Rows of a DataFrame
df.head()

Sort Data in a DataFrame

Sort Data in Ascending Order

import pandas as pd

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'Name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

# Sort the DataFrame by 'math_score' in ascending order
df_sorted = df.sort_values(by='math_score')

# Display the sorted DataFrame
print(df_sorted)

Sort Data in Descending Order

# Sort data in Decending order
df_sorted_desc = df.sort_values(by='math_score', ascending=False)
print(df_sorted_desc)

Sort Data in DataFrame Index

# Sort data by index
df_sorted_index = df.sort_index
print(df_sorted_index)

Reset the index of a DataFrame

# Reset the index
df_reset_index = df.reset_index(drop=True)
print(df_reset_index)

Know about your DataFrame or Data

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'Name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

Information about the Data

df.info()

info() will output various information about the data such as number of columns, their names, number of total values in columns, null values in columns.

This is one of the first command we run during data analysis in machine learning.

Output the shape of the dataframe

df.shape()
#Output unique values in a column
df["math_score"].nunique()
#Output length or number of rows in a dataframe
len(df)
#Output columns names
df.columns

Get the number of missing values in a DataFrame

#This will output a Boolean table against each element if its a null
df.isnull()
#This will output number of null values in a column
df.isnull().sum()

Understand Variables in a DataFrame

# Output a table with all information about all columns with numericl data
df.describe()
# Output Variable count, mean, standard deviation, min value, max value etc
# Output information only about required column
df.math_score.describe()
#Output the frequency of all unique values in a column
df.math_score.value_counts().head()
#Output Pearson correlation between contious variables
df.corr()
#Output sum of each variables /columns
df.sum()
#Output non-Null values
df.count()
#Output median of each contious variables
df.median()
#Output mean values for each continuous column
df.mean()
#Output minimum values for each column including categoriacal variables
df.min()
#Output maximum values for each column including categoriacal variables
df.max()
#Output variance for each continuous column
df.var()
#Output standard deviation for each continuous column
df.std()

Getting Data from a DataFrame

Get Data by the name of row in a dataframe

#Get data by the name of row in the dataframe
# Create a DataFrame
data = {'ID': [1, 2, 3],
        'name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

# Set 'Name' as the index
df = df.set_index('name')

# Get data for a specific row by name
row_data = df.loc['bob']

# Display the data for the specified row
print(row_data)

Get a row Data by index position in a dataframe

#Get a row Data by index position in a dataframe
df.iloc[1] #this will output data at index position 1
df.iloc[1:3] #this will output data at index position 1 and 2

Data Cleaning

Remove Duplicate Rows

# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
        'Name': ['alice', 'bob', 'charlie', 'bob', 'alice', "mike"],
        'math_score': [80, 90, 70, 90, 80, 90],
        'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Here one point you need to note is it will only remove rows where all data points are similar.

Original DataFrame:
   ID     Name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95
3   2      bob          90             85
4   1    alice          80             75
5   4     mike          90             85

DataFrame after removing duplicates:
   ID     Name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95
5   4     mike          90             85
# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
        'Name': ['alice', 'bob', 'charlie', 'bob', 'alice', "mike"],
        'math_score': [80, 90, 70, 90, 80, 90],
        'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

Remove Duplicates by specifying a column

# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
        'name': ['alice', 'bob', 'charlie', 'bob', 'alice', "alice"],
        'math_score': [80, 90, 70, 90, 80, 90],
        'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
df_no_duplicates_name = df.drop_duplicates(subset='name')
# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
print("\nDataFrame after removing duplicates from name column:")
print(df_no_duplicates)
DataFrame after removing duplicates:
   ID     name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95
5   4    alice          90             85

DataFrame after removing duplicates from name column:
   ID     name  math_score  english_score
0   1    alice          80             75
1   2      bob          90             85
2   3  charlie          70             95

If you look at above dataframes, the first on displays alice name twice whereas second one displays alice name only once.

#this wil remove duplicates from data frame df
df.drop_duplicates(inplace=True)
#Remove duplicates from dataframe df while keeping the required duplicates
df_second=df.drop_duplicates(keep="last")

Renaming Columns

#Renaming Columns
# Create a DataFrame
data = {'ID': [1, 2, 3],
        'name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

# Rename the columns
df_rename=df.rename(columns ={"name": "NAME", "math_score":"MATH"})

# Display the DataFrame after removing duplicates
print("\nDataFrame before renaming the columns:")
print(df)
print("\nDataFrame after rename the column:")
print(df_rename)

Handling null values

#Renaming Columns
# Create a DataFrame with null values
data = {'ID': [1, 2, 3, 4],
        'name': ['alice', 'bob', 'charlie', "mike"],
        'math_score': [80, 90, 70, 90],
        'english_score': [75, 85,None , 85]}
df = pd.DataFrame(data)

# Rename the columns
df_without_null = df.dropna()
# We can always use inplace = True to drop values in the same dataframe

# Display the DataFrame after removing duplicates
print("\nDataFrame before removing the null values:")
print(df)
print("\nDataFrame after removing all rows with null values:")
print(df_without_null)

Instead of removing the null values we can also replace null values with a number or string.

# Replace null values with the mean value in the column
df_replace_null = df.fillna(df["english_score"].mean())
# We can always use inplace = True to drop values in the same dataframe

# Display the DataFrame after removing duplicates
print("\nDataFrame before removing the null values:")
print(df)
print("\nDataFrame after removing all rows with null values:")
print(df_replace_null)

Convert a dataframe into a List

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

X = [[df]]
Y = [[df["math_score"]]]
type_x=type(X)
type_y=type(Y)
print(f"The newly create X is a {type_x} and Y is a {type_y}")
The newly create X is a <class 'list'> and Y is a <class 'list'>

Conditional Data Selection

#This will select all the rows where the values for column fullfill the given condition. 
#The output will be a series of true or false

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

condition = (df["english_score"] == 75)
print(condition)

The above code will select all the rows where the values for column fulfill the given condition. The output will be a series of true or false

# Create a DataFrame
data = {'ID': [1, 2, 3],
        'name': ['alice', 'bob', 'charlie'],
        'math_score': [80, 90, 70],
        'english_score': [75, 85, 95]}
df = pd.DataFrame(data)

df1 = df[df["english_score"] <= 90]

# Display the DataFrame after appling the conditions
print("\nRaw DataFrame:")
print(df)
print("\nDataFrame after appling the required conditions:")
print(df1)

The above code will create a new DataFrame with all values satisfying the given conditions

Group Data in a DataFrame

Group data by values in a column

# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
        'grade': ['A', 'B', 'C', 'C', 'A', 'B'],
        'age': [12, 14, 13, 10, 10, 12]}
df = pd.DataFrame(data)

# Group data by the 'Category' column
grouped_df = df.groupby('grade')

# Display the groups
for name, group in grouped_df:
    print(f"Group: {name}")
    print(group)
    print()

Apply a Function to a DataFrame In Pandas

# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
        'percentage_marks': [92, 94, 73, 82, 62, 51]}
df = pd.DataFrame(data)

# Define a function to assign grades
def assign_grade(mark):
    if mark > 90:
        return 'A'
    elif 80 <= mark <= 90:
        return 'B'
    else:
        return 'C'

# Apply the function to create a new 'grade' column
df['grade'] = df['percentage_marks'].apply(assign_grade)

# Display the updated DataFrame
print(df)

The above code will add a column to dataframe df to assign grades to each students.

Query Function in Pandas

# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
        'percentage_marks': [92, 94, 73, 82, 62, 51]}
df = pd.DataFrame(data)

# Use the query function to filter rows
filtered_df = df.query('percentage_marks > 90')

# Display the filtered DataFrame
print(filtered_df)

The above code will output a DataFrame containing only the rows where percentage marks are greater than 90.

Key Takeaways

Pandas Library in python is an important library that we can use to implement various functions on data. This helps us in creating the data ready for machine learning. We suggest you also read this article on numpy in python.

Leave a Reply

Your email address will not be published. Required fields are marked *