Pandas is open-source library built on NumPy and provides easy-to-use data structures and data analysis tools for machine learning. It is backbone for most of ML projects. We use pandas library in machine learning to import data, data cleaning, transforming and analyzing. This article covers essential functions in pandas with examples using Python.
Table of Contents
ToggleFeatures of Pandas Library in Python
Here is the main features of pandas library in python.
- User friendly.
- Easy to use data structure.
- Includes tools to work with time-series and numerical data.
- Efficiently works with large data sets and other libraries.
Getting started with Pandas library in python
How to Install Pandas?
conda install pandas
pip install pandas
How to import Pandas Library in Python?
import pandas as pd
Series and DataFrame in Pandas
A series is a column, whereas DataFrame is a multi-dimensional table that is a collection of series. All operations of pandas done on dataframe.
Uses of Pandas Library in Machine Learning
Pandas are built on NumPy and has similar functions. We use data in pandas for following tasks.
- Import, read and create data.
- Exploratory Data Analysis: View & plot the data, Get insight on data.
- Data Cleaning.
- Feed data to machine learning algorithms.
Create DataFrame: Create, read and show data
Create a directory in Python
# Create a Python directory
data = {"math-score": [80, 93, 49, 62], "english-score": [83, 83, 42, 61]}
print(data)
{‘math-score’: [80, 93, 49, 62], ‘english-score’: [83, 83, 42, 61]}
# Add index to a python directory
index_values = ["john", "hari", 'mack', "ram"]
data_with_index = {index: scores for index, scores in zip(index_values, zip(*data.values()))}
print(data_with_index)
{‘john’: (80, 83), ‘hari’: (93, 83), ‘mack’: (49, 42), ‘ram’: (62, 61)}
Create DataFrame from Dictionary
#Create a dataframe from python directory
data_frame = pd.DataFrame(data)
print(data_frame)
math-score english-score 0 80 83 1 93 83 2 49 42 3 62 61
Create DataFrame from dictionary and add index
# This will conver a python directory into dataframe and add the index
df_dire_index = pd.DataFrame(data, index = ["john", "hari", 'mack', "ram"])
print(df_dire_index)
math-score english-score john 80 83 hari 93 83 mack 49 42 ram 62 61
Create a DataFrame in Python from input data
df = pd.DataFrame({"math-score": [80, 93, 49, 62], "english-score": [83, 83, 42, 61]})
# Display the DataFrame
print(df)
Create a data frame from numpy array
import numpy as np
# Create a NumPy list
numpy_list = np.array([[80, 83], [93, 83], [49, 42], [62, 61]])
# Create a DataFrame from the NumPy list
df = pd.DataFrame(numpy_list, columns=["math-score", "english-score"])
# Display the DataFrame
print(df)
Add index to an existing dataframe
#Add index to existing dataframe
df = df.set_index(pd.Index(["john", "hari", 'mack', "ram"]))
# Display the DataFrame
print(df)
Import data in Pandas from a .csv file
# Import data from a csv file
file_path = "example.csv" # Specify the path to your CSV file
# Read data from CSV file into a DataFrame
df = pd.read_csv(file_path)
# Display the DataFrame
print(df)
# Read data from CSV file into a DataFrame with an index
df = pd.read_csv(file_path, index_col="index_col_name")
Import data in Pandas from a .json file
import pandas as pd
# Specify the path to JSON file
file_path = "example.json"
# Read data from JSON file into a DataFrame
df = pd.read_json(file_path, orient='records')
# Display the DataFrame
print(df)
Read data in pandas from SQL database
We need to first create a connection with SQL server to read data from SQL database. We can create connection to SQL database in following steps:
Step-1: Install SQLite
pip install pysqlite3
Step-2: Make a connection with SQL server
import sqlite3
# Connect to the SQLite database
conn = sqlite3.connect('your_database.db')
#Replace 'your_database.db' with the actual database file.
Step-3: Select the data and convert it into a dataframe
# Write your SQL query
query = 'SELECT * FROM your_table'
#Replace 'your_table' with the actual table name
# Read data from SQL database into a DataFrame
df = pd.read_sql(query, conn)
# Close the database connection
conn.close()
# Display the DataFrame
print(df)
Convert a DataFrame into CSV file
# Specify the path to save the CSV file
csv_file_path = "output.csv"
# Save the DataFrame to CSV file
df.to_csv(csv_file_path, index=False)
Convert a DataFrame into json file
# Specify the path to save the json file
json_file_path = "output.json"
# Save the DataFrame to json file
df.to_json(json_file_path, index=False)
Combine Dataset
Let’s first create two DataFrame df1 and df2. Each DataFrame has two columns. we will join these DataFrames in various ways. This is very useful during machine learning data wrangling activities.
import pandas as pd
# Create a DataFrame with two columns ('A' and 'B') and three rows
data_1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df1 = pd.DataFrame(data_1)
# Create a DataFrame with two columns ('C' and 'D') and three rows
data_2 = {'C': ["k", "l", "m"], 'D': ["n", "o", "p"]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# Display the DataFrame
print(df1)
print(df2)
df1
A B 0 1 4 1 2 5 2 3 6
df2
C D 0 k n 1 l o 2 m p
Join two DataFrame by rows
# Join two dataframe by rows
merged_df = pd.merge(df1, df2, left_index=True, right_index=True)
print(merged_df)
Merged Data-Frame by Rows
A B C D 0 1 4 k n 1 2 5 l o 2 3 6 m p
Concatenate DataFrames by rows
We can concatenate two datasets by rows or columns.
# Concatenate DataFrames by rows
concat_df = pd.concat([df1, df2], ignore_index=True)
# Since df1 and df2 have different rows, the results will be accordingly
print(concat_df)
A B C D 0 1.0 4.0 NaN NaN 1 2.0 5.0 NaN NaN 2 3.0 6.0 NaN NaN 3 NaN NaN k n 4 NaN NaN l o 5 NaN NaN m p
Reshape a DataFrame in Pandas
Melt a DataFrame: Convert Columns into rows
Pandas’s pd.melt() function is used to reshape a DataFrame. It “melts” or unpivots the DataFrame. In other words, it converts columns into rows. As a result, it helps with analysis and visualization of data.
# Create a DataFrame
data = {'ID': [1, 2, 3],
'Name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
# Melt the DataFrame
melted_df = pd.melt(df, id_vars=['ID', 'Name'], var_name='Subject', value_name='Score')
# Display the original and melted DataFrames
print("Original DataFrame:")
print(df)
print("\nMelted DataFrame:")
print(melted_df)
Original DataFrame: ID Name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95 Melted DataFrame: ID Name Subject Score 0 1 alice math_score 80 1 2 bob math_score 90 2 3 charlie math_score 70 3 1 alice english_score 75 4 2 bob english_score 85 5 3 charlie english_score 95
Pivot a DataFrame: Convert Rows into Columns
We can use the pivot method on a DataFrame to specify the index, columns, and values.
import pandas as pd
# Create a DataFrame
data = {'ID': [1, 2, 3],
'Name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
# Pivot the DataFrame using pivot_table
pivot_df = pd.pivot_table(df, index='ID', columns='Name', values=['math_score', 'english_score'])
# Display the original and pivoted DataFrames
print("Original DataFrame:")
print(df)
print("\nPivoted DataFrame:")
print(pivot_df)
Original DataFrame: ID Name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95 Pivoted DataFrame: english_score math_score Name alice bob charlie alice bob charlie ID 1 75.0 NaN NaN 80.0 NaN NaN 2 NaN 85.0 NaN NaN 90.0 NaN 3 NaN NaN 95.0 NaN NaN 70.0
Note that in a pivot table, the resulting DataFrame will have a hierarchical column index. We can use reset_index to flatten it.
Chaining method in pandas
Most of pandas functions or methods return a DataFrame. Therefore output of pandas method can works as input to another pandas method.
df = df1.rename(columns={'A': "A'", 'B': "B'"}).query("value > 1")
df = df1.rename(columns={'A': 'a', 'B': 'b'}).query('a > 1')
This code first rename the original DataFrame (df1) and then filters the rows where the values are greater than 1.
View data in DataFrame
Viewing the required data in a DataFrame is must function to understand the data.
Display Initial 5 Rows of a DataFrame
# Display Initial 5 Rows of a DataFrame
df.head()
Display Initial 2 Rows of a DataFrame
# Display Initial 2 Rows of a DataFrame
df.head(2)
Display Last 5 Rows of a DataFrame
# Display Last 5 Rows of a DataFrame
df.tail()
# Display Initial 5 Rows of a DataFrame
df.head()
Sort Data in a DataFrame
Sort Data in Ascending Order
import pandas as pd
# Create a DataFrame
data = {'ID': [1, 2, 3],
'Name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
# Sort the DataFrame by 'math_score' in ascending order
df_sorted = df.sort_values(by='math_score')
# Display the sorted DataFrame
print(df_sorted)
Sort Data in Descending Order
# Sort data in Decending order
df_sorted_desc = df.sort_values(by='math_score', ascending=False)
print(df_sorted_desc)
Sort Data in DataFrame Index
# Sort data by index
df_sorted_index = df.sort_index
print(df_sorted_index)
Reset the index of a DataFrame
# Reset the index
df_reset_index = df.reset_index(drop=True)
print(df_reset_index)
Know about your DataFrame or Data
# Create a DataFrame
data = {'ID': [1, 2, 3],
'Name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
Information about the Data
df.info()
info() will output various information about the data such as number of columns, their names, number of total values in columns, null values in columns.
This is one of the first command we run during data analysis in machine learning.
Output the shape of the dataframe
df.shape()
#Output unique values in a column
df["math_score"].nunique()
#Output length or number of rows in a dataframe
len(df)
#Output columns names
df.columns
Get the number of missing values in a DataFrame
#This will output a Boolean table against each element if its a null
df.isnull()
#This will output number of null values in a column
df.isnull().sum()
Understand Variables in a DataFrame
# Output a table with all information about all columns with numericl data
df.describe()
# Output Variable count, mean, standard deviation, min value, max value etc
# Output information only about required column
df.math_score.describe()
#Output the frequency of all unique values in a column
df.math_score.value_counts().head()
#Output Pearson correlation between contious variables
df.corr()
#Output sum of each variables /columns
df.sum()
#Output non-Null values
df.count()
#Output median of each contious variables
df.median()
#Output mean values for each continuous column
df.mean()
#Output minimum values for each column including categoriacal variables
df.min()
#Output maximum values for each column including categoriacal variables
df.max()
#Output variance for each continuous column
df.var()
#Output standard deviation for each continuous column
df.std()
Getting Data from a DataFrame
Get Data by the name of row in a dataframe
#Get data by the name of row in the dataframe
# Create a DataFrame
data = {'ID': [1, 2, 3],
'name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
# Set 'Name' as the index
df = df.set_index('name')
# Get data for a specific row by name
row_data = df.loc['bob']
# Display the data for the specified row
print(row_data)
Get a row Data by index position in a dataframe
#Get a row Data by index position in a dataframe
df.iloc[1] #this will output data at index position 1
df.iloc[1:3] #this will output data at index position 1 and 2
Data Cleaning
Remove Duplicate Rows
# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
'Name': ['alice', 'bob', 'charlie', 'bob', 'alice', "mike"],
'math_score': [80, 90, 70, 90, 80, 90],
'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
Here one point you need to note is it will only remove rows where all data points are similar.
Original DataFrame: ID Name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95 3 2 bob 90 85 4 1 alice 80 75 5 4 mike 90 85 DataFrame after removing duplicates: ID Name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95 5 4 mike 90 85
# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
'Name': ['alice', 'bob', 'charlie', 'bob', 'alice', "mike"],
'math_score': [80, 90, 70, 90, 80, 90],
'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)
# Display the original DataFrame
print("Original DataFrame:")
print(df)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
Remove Duplicates by specifying a column
# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 3, 2, 1, 4],
'name': ['alice', 'bob', 'charlie', 'bob', 'alice', "alice"],
'math_score': [80, 90, 70, 90, 80, 90],
'english_score': [75, 85, 95, 85, 75, 85]}
df = pd.DataFrame(data)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
df_no_duplicates_name = df.drop_duplicates(subset='name')
# Display the DataFrame after removing duplicates
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
print("\nDataFrame after removing duplicates from name column:")
print(df_no_duplicates)
DataFrame after removing duplicates: ID name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95 5 4 alice 90 85 DataFrame after removing duplicates from name column: ID name math_score english_score 0 1 alice 80 75 1 2 bob 90 85 2 3 charlie 70 95
If you look at above dataframes, the first on displays alice name twice whereas second one displays alice name only once.
#this wil remove duplicates from data frame df
df.drop_duplicates(inplace=True)
#Remove duplicates from dataframe df while keeping the required duplicates
df_second=df.drop_duplicates(keep="last")
Renaming Columns
#Renaming Columns
# Create a DataFrame
data = {'ID': [1, 2, 3],
'name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
# Rename the columns
df_rename=df.rename(columns ={"name": "NAME", "math_score":"MATH"})
# Display the DataFrame after removing duplicates
print("\nDataFrame before renaming the columns:")
print(df)
print("\nDataFrame after rename the column:")
print(df_rename)
Handling null values
#Renaming Columns
# Create a DataFrame with null values
data = {'ID': [1, 2, 3, 4],
'name': ['alice', 'bob', 'charlie', "mike"],
'math_score': [80, 90, 70, 90],
'english_score': [75, 85,None , 85]}
df = pd.DataFrame(data)
# Rename the columns
df_without_null = df.dropna()
# We can always use inplace = True to drop values in the same dataframe
# Display the DataFrame after removing duplicates
print("\nDataFrame before removing the null values:")
print(df)
print("\nDataFrame after removing all rows with null values:")
print(df_without_null)
Instead of removing the null values we can also replace null values with a number or string.
# Replace null values with the mean value in the column
df_replace_null = df.fillna(df["english_score"].mean())
# We can always use inplace = True to drop values in the same dataframe
# Display the DataFrame after removing duplicates
print("\nDataFrame before removing the null values:")
print(df)
print("\nDataFrame after removing all rows with null values:")
print(df_replace_null)
Convert a dataframe into a List
# Create a DataFrame
data = {'ID': [1, 2, 3],
'name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
X = [[df]]
Y = [[df["math_score"]]]
type_x=type(X)
type_y=type(Y)
print(f"The newly create X is a {type_x} and Y is a {type_y}")
The newly create X is a <class 'list'> and Y is a <class 'list'>
Conditional Data Selection
#This will select all the rows where the values for column fullfill the given condition.
#The output will be a series of true or false
# Create a DataFrame
data = {'ID': [1, 2, 3],
'name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
condition = (df["english_score"] == 75)
print(condition)
The above code will select all the rows where the values for column fulfill the given condition. The output will be a series of true or false
# Create a DataFrame
data = {'ID': [1, 2, 3],
'name': ['alice', 'bob', 'charlie'],
'math_score': [80, 90, 70],
'english_score': [75, 85, 95]}
df = pd.DataFrame(data)
df1 = df[df["english_score"] <= 90]
# Display the DataFrame after appling the conditions
print("\nRaw DataFrame:")
print(df)
print("\nDataFrame after appling the required conditions:")
print(df1)
The above code will create a new DataFrame with all values satisfying the given conditions
Group Data in a DataFrame
Group data by values in a column
# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
'grade': ['A', 'B', 'C', 'C', 'A', 'B'],
'age': [12, 14, 13, 10, 10, 12]}
df = pd.DataFrame(data)
# Group data by the 'Category' column
grouped_df = df.groupby('grade')
# Display the groups
for name, group in grouped_df:
print(f"Group: {name}")
print(group)
print()
Apply a Function to a DataFrame In Pandas
# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
'percentage_marks': [92, 94, 73, 82, 62, 51]}
df = pd.DataFrame(data)
# Define a function to assign grades
def assign_grade(mark):
if mark > 90:
return 'A'
elif 80 <= mark <= 90:
return 'B'
else:
return 'C'
# Apply the function to create a new 'grade' column
df['grade'] = df['percentage_marks'].apply(assign_grade)
# Display the updated DataFrame
print(df)
The above code will add a column to dataframe df to assign grades to each students.
Query Function in Pandas
# Create a DataFrame
data = {'student_name': ['alice', 'bob', 'charlie', 'ramesh', 'yana', 'john'],
'percentage_marks': [92, 94, 73, 82, 62, 51]}
df = pd.DataFrame(data)
# Use the query function to filter rows
filtered_df = df.query('percentage_marks > 90')
# Display the filtered DataFrame
print(filtered_df)
The above code will output a DataFrame containing only the rows where percentage marks are greater than 90.
Key Takeaways
Pandas Library in python is an important library that we can use to implement various functions on data. This helps us in creating the data ready for machine learning. We suggest you also read this article on numpy in python.