In this tutorial, we will learn about Python pandas functions for Data Scientists. We will learn about top 10 functions which are go to functions for every Data scientists. Data analysis is the most growing field in the data world. Data engineers, Data scientists, Data analyst, they all rely of Python’s Pandas library for doing most of the data operation and convert it into usable format. We will discuss about Pandas library and its power by looking at some of its important methods.
What is Pandas ?
Pandas is a powerful open-source data analysis and manipulation library for Python. It provides data structures like Series(1-dimensional) and Dataframe (2-dimensional) that are designed to handle and manipulate structured data seamlessly. We will cover 10 most popular and commonly used Pandas function in the upcoming sections. So let’s get started.
Python Pandas Functions for Data Scientists [Top 10 Functions]
Also read: Python Pandas Compare Two Dataframes [Solved]
The Pandas methods which are discussed in later part of this tutorial are the fundamental methods for data manipulation and analysis . They offer powerful tools for reading, processing and transforming tabular data efficiently. Let us look at each method one by one.
1. read_csv()
In Pandas, read_csv() method is used to read data from a CSV file and create a DataFrame. It allows you to load data into a pandas DataFrame from a comma-separated values(CSV) file. We will use a common code throughout this tutorial to perform all the methods operation.
In the below example, we are fetching the data set from a url using request module. requests.get() method allows us to fetch the data from url and store it in a variable (stores in ‘url’ variable in this case). We then check the response code using response.status_code. If response code is success (200), we are converting the dataset into text format and read the data using pd.read_csv() method. Create a file and save below code.
Syntax
read_csv(<filepath>, sep=',', delimiter=None, header='infer', names=None)
read_csv() method
import pandas as pd import requests from io import StringIO url = "https://github.com/datablist/sample-csv-files/raw/main/files/organizations/organizations-100.csv" response = requests.get(url) response_status = response.status_code if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) print(df) else: print(f"Failed to fetch data from given url. Status code: {response_status}")
Index Organization Id Name ... Founded Industry Number of employees 0 1 FAB0d41d5b5d22c Ferrell LLC ... 1990 Plastics 3498 1 2 6A7EdDEA9FaDC52 Mckinney, Riley and Day ... 2015 Glass / Ceramics / Concrete 4952 2 3 0bFED1ADAE4bcC1 Hester Ltd ... 1971 Public Safety 5287 3 4 2bFC1Be8a4ce42f Holder-Sellers ... 2004 Automotive 921 4 5 9eE8A6a4Eb96C24 Mayer Group ... 1991 Transportation 7870 .. ... ... ... ... ... ... ... 95 96 0a0bfFbBbB8eC7c Holmes Group ... 1975 Photography 2988 96 97 BA6Cd9Dae2Efd62 Good Ltd ... 1971 Consumer Services 4292 97 98 E7df80C60Abd7f9 Clements-Espinoza ... 1991 Broadcast Media 236 98 99 AFc285dbE2fEd24 Mendez Inc ... 1993 Education Management 339 99 100 e9eB5A60Cef8354 Watkins-Kaiser ... 2009 Financial Services 2785 [100 rows x 9 columns]
2. head()
In Pandas, head() method is used to display the first n rows of a DataFrame. It is useful for quickly checking the structure and content of the DataFrame. In the below example, we are fetching first 10 rows from the dataset. Make below modification in the previous example file.
Syntax
head(n)
head() method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) first10 = df.head(10) print(first10)
Index Organization Id Name ... Founded Industry Number of employees 0 1 FAB0d41d5b5d22c Ferrell LLC ... 1990 Plastics 3498 1 2 6A7EdDEA9FaDC52 Mckinney, Riley and Day ... 2015 Glass / Ceramics / Concrete 4952 2 3 0bFED1ADAE4bcC1 Hester Ltd ... 1971 Public Safety 5287 3 4 2bFC1Be8a4ce42f Holder-Sellers ... 2004 Automotive 921 4 5 9eE8A6a4Eb96C24 Mayer Group ... 1991 Transportation 7870 5 6 cC757116fe1C085 Henry-Thompson ... 1992 Primary / Secondary Education 4914 6 7 219233e8aFF1BC3 Hansen-Everett ... 2018 Publishing Industry 7832 7 8 ccc93DCF81a31CD Mcintosh-Mora ... 1970 Import / Export 4389 8 9 0B4F93aA06ED03e Carr Inc ... 1996 Plastics 8167 9 10 738b5aDe6B1C6A5 Gaines Inc ... 1997 Outsourcing / Offshoring 9698
3. describe()
In Pandas, describe() method generates descriptive statistics of the DataFrame, including measures of central tendency, dispersion and shape of the distribution. We are fetching the statistics summary of dataset using describe() method. Make below code change in the previous example file.
Syntax
describe()
describe() method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) print(df.describe()) # Use pandas describe() function to display summary statistics
Index Founded Number of employees count 100.000000 100.000000 100.000000 mean 50.500000 1995.410000 4964.860000 std 29.011492 15.744228 2850.859799 min 1.000000 1970.000000 236.000000 25% 25.750000 1983.500000 2741.250000 50% 50.500000 1995.000000 4941.500000 75% 75.250000 2010.250000 7558.000000 max 100.000000 2021.000000 9995.000000
4. drop()
In Pandas, drop() method is used to remove specified labels from rows or columns of a DataFrame. It returns a new DataFrame with the specified labels dropped. We are using drop() method to remove ‘Founded’ and ‘Number of employees’ column from the dataset. Make below code changes in the previous example file.
Syntax
drop(labels=None, axis=0, index=None, columns=None, inplace=False)
drop() method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) print(f"These are the columns in Dataset = {df.columns}") dropColumn = df.drop(['Founded', 'Number of employees'], axis=1) print(f"Dataset after deleting requested Columns:\n {dropColumn}")
These are the columns in Dataset = Index(['Index', 'Organization Id', 'Name', 'Website', 'Country', 'Description', 'Founded', 'Industry', 'Number of employees'], dtype='object') Dataset after deleting requested Columns: Index Organization Id ... Description Industry 0 1 FAB0d41d5b5d22c ... Horizontal empowering knowledgebase Plastics 1 2 6A7EdDEA9FaDC52 ... User-centric system-worthy leverage Glass / Ceramics / Concrete 2 3 0bFED1ADAE4bcC1 ... Switchable scalable moratorium Public Safety 3 4 2bFC1Be8a4ce42f ... De-engineered systemic artificial intelligence Automotive 4 5 9eE8A6a4Eb96C24 ... Synchronized needs-based challenge Transportation .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 95 96 0a0bfFbBbB8eC7c ... Right-sized zero tolerance focus group Photography 96 97 BA6Cd9Dae2Efd62 ... Reverse-engineered composite moratorium Consumer Services 97 98 E7df80C60Abd7f9 ... Progressive modular hub Broadcast Media 98 99 AFc285dbE2fEd24 ... User-friendly exuding migration Education Management 99 100 e9eB5A60Cef8354 ... Synergistic background access Financial Services [100 rows x 7 columns]
5. loc[]
In Pandas, the loc[] method is used for label-based indexing. It is used to access a group of rows and columns by labels or a Boolean array. We are fetching the record at index 2 using ‘df.loc[2]’. To print the value of Industry for record at index 2, we have used ‘df.loc[2, ‘Industry’]’. Make below code changes in the previous example file.
Syntax
loc[row_indexer, column_indexer]
loc[] method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) label = df.loc[2] print(f"Fetching record at index 2: {label}") label = df.loc[2, 'Industry'] print(f"\nFetching type of Industry for record at index 2: {label}")
Fetching record at index 2: Index 3 Organization Id 0bFED1ADAE4bcC1 Name Hester Ltd Website http://sullivan-reed.com/ Country China Description Switchable scalable moratorium Founded 1971 Industry Public Safety Number of employees 5287 Name: 2, dtype: object Fetching type of Industry for record at index 2: Public Safety
6. iloc[]
In Pandas, iloc[] method is used for integer-location based indexing. It is used to access a group of rows and columns by integer positions. We are fetching record at index 2 using ‘df.iloc[2]’. To fetch data at row 2, column 4, we have used ‘df.iloc[2, 4]’
Syntax
iloc[row_indexer, column_indexer]
iloc[] method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) label = df.iloc[2] print(f"Fetching record at index 2: {label}") label = df.iloc[2, 4] print(f"\nFetching type of Industry for record at index 2: {label}")
Fetching record at index 2: Index 3 Organization Id 0bFED1ADAE4bcC1 Name Hester Ltd Website http://sullivan-reed.com/ Country China Description Switchable scalable moratorium Founded 1971 Industry Public Safety Number of employees 5287 Name: 2, dtype: object Fetching type of Industry for record at index 2: China
7. groupby()
In Pandas, groupby() method is used to split the data into groups based on some criteria. It is often followed by an aggregation function to perform a computation on each group. We have used grouby() method to group ‘Name’ and ‘Number of employees columns’ in the dataset. Make below modification to the previous example file.
Syntax
groupby(by=None, axis=0)
groupby() method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) grouped_data = df.groupby(['Name', 'Number of employees']).size().reset_index(name='Count') print(grouped_data)
Name Number of employees Count 0 Arroyo Inc 9067 1 1 Ayala LLC 7664 1 2 Baker, Mccann and Macdonald 1638 1 3 Bartlett-Arroyo 3987 1 4 Beasley, Greene and Mahoney 869 1 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... 95 Walls LLC 1678 1 96 Walton-Barnett 1746 1 97 Watkins-Kaiser 2785 1 98 Weiss and Sons 5984 1 99 Wilkinson, Charles and Arroyo 602 1 [100 rows x 3 columns]
8. merge()
In Pandas, merge() method is used to combine two or more DataFrames based on a common column or index. It performs database-like join operations. We have used a different example to implement the merge() method. We have declared two dataset namely, dataset1 and dataset2. We have merged these two dataset using merge() method as shown below.
Syntax
merge(left, right, how='inner', on=None)
merge() method
import pandas as pd dataset1 = { 'NumID': [1, 2, 3], 'Name': ['One', 'Two', 'Three'] } dataset2 = { 'NumID': [1, 2, 3], 'Value': [100, 200, 300] } df1 = pd.DataFrame(dataset1) df2 = pd.DataFrame(dataset2) mergedDF = pd.merge(df1, df2, on='NumID', how='outer') print(f"Merged Dataframe:\n {mergedDF}")
OUTPUT
Merged Dataframe: NumID Name Value 0 1 One 100 1 2 Two 200 2 3 Three 300
9. isnull()
In Pandas, isnull() method is used to detect missing or null values in a Dataframe. It returns a Dataframe of the same shape as the input, where each element is a Boolean value indicating whether the corresponding element in the original DataFrame was null. We have used isnull() method to check if there is any missing or null value in our dataset. It it is present, it will return those records from the dataset else it will return the message ‘No records with null values found’.
Syntax
isnull()
isnull() method
if response_status == 200: data = response.text organization_data = StringIO(data) df = pd.read_csv(organization_data) # Check for null values in the DataFrame if df.isnull().values.any(): print(df[df.isnull().any(axis=1)]) else: print("No records with null values found.")
No records with null values found.
10. fillna()
In Pandas, the fillna() method is used to fill missing or NaN(null) values in a DataFrame with specified values or using a certain method like forward fill or backward fill. In the below example, we fill all the None values in dataset1 with ‘Unknown’ as shown below.
Syntax
fillna(value=None, method=None, axis=None, inplace=False)
fillna() method
import pandas as pd dataset1 = { 'NumID': [1, 2, 3], 'Name': ['One', None, None] } dataset2 = { 'NumID': [1, 2, 3], 'Value': [100, 200, 300] } df1 = pd.DataFrame(dataset1) df2 = pd.DataFrame(dataset2) # Filling missing values in 'Value1' column with 'Unknown' df1['Name'].fillna('Unknown', inplace=True) mergedDF = pd.merge(df1, df2, on='NumID', how='outer') print(f"Merged Dataframe with Filled Missing Values:\n {mergedDF}")
Merged Dataframe with Filled Missing Values: NumID Name Value 0 1 One 100 1 2 Unknown 200 2 3 Unknown 300
Summary
Reference pandas.pydata.org