10 2 Pandas Explained
Key Concepts
Pandas is a powerful Python library for data manipulation and analysis. It provides data structures and functions needed to work with structured data efficiently. Key concepts include:
- DataFrame
- Series
- Indexing and Selection
- Data Cleaning
- Grouping and Aggregation
- Merging and Joining
1. DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used data structure in Pandas.
Example:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) print(df)
Analogy: Think of a DataFrame as a spreadsheet where each column can contain different types of data.
2. Series
A Series is a one-dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a single column in a DataFrame.
Example:
import pandas as pd data = [10, 20, 30, 40, 50] series = pd.Series(data) print(series)
Analogy: Think of a Series as a single column in a spreadsheet.
3. Indexing and Selection
Indexing and selection in Pandas allow you to access specific rows and columns of a DataFrame. You can use labels, integer positions, or boolean arrays to select data.
Example:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) # Selecting a single column print(df['Name']) # Selecting multiple columns print(df[['Name', 'City']]) # Selecting rows by label print(df.loc[0]) # Selecting rows by integer position print(df.iloc[1])
Analogy: Think of indexing and selection as navigating through a spreadsheet to find specific cells or ranges of cells.
4. Data Cleaning
Data cleaning involves handling missing data, removing duplicates, and correcting inconsistencies in the dataset. Pandas provides various functions to clean and preprocess data.
Example:
import pandas as pd import numpy as np data = { 'Name': ['Alice', 'Bob', np.nan, 'Charlie'], 'Age': [25, 30, np.nan, 35], 'City': ['New York', 'Los Angeles', 'Chicago', 'Chicago'] } df = pd.DataFrame(data) # Dropping missing values df_cleaned = df.dropna() print(df_cleaned) # Filling missing values df_filled = df.fillna(0) print(df_filled) # Removing duplicates df_unique = df.drop_duplicates() print(df_unique)
Analogy: Think of data cleaning as tidying up a messy room by removing unnecessary items and organizing the rest.
5. Grouping and Aggregation
Grouping and aggregation allow you to group data based on one or more columns and apply aggregate functions like sum, mean, count, etc.
Example:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'], 'Age': [25, 30, 35, 25, 30], 'Salary': [50000, 60000, 70000, 55000, 65000] } df = pd.DataFrame(data) # Grouping by Name and calculating mean salary grouped = df.groupby('Name').mean() print(grouped)
Analogy: Think of grouping and aggregation as organizing items in a store by category and calculating the total sales for each category.
6. Merging and Joining
Merging and joining allow you to combine multiple DataFrames based on common columns or indices. This is useful when working with relational data.
Example:
import pandas as pd data1 = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35] } data2 = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Salary': [50000, 60000, 70000] } df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) # Merging DataFrames on 'Name' merged = pd.merge(df1, df2, on='Name') print(merged)
Analogy: Think of merging and joining as combining two spreadsheets based on a common column, like merging customer information from two different sources.