Pandas is a popular open-source data analysis and manipulation library for Python. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for data cleaning, merging, and reshaping. Pandas is built on top of NumPy, another popular Python library for numerical computing.
The name "Pandas" is derived from "panel data", a term used in statistics and econometrics to refer to multidimensional structured datasets.
Pandas provides two main data structures for storing and manipulating data:
Some of the key features of Pandas include:
Pandas can be installed using pip, the Python package manager. To install Pandas, open a terminal or command prompt and type:
pip install pandas
Once Pandas is installed, you can import it into your Python code using the following command:
import pandas as pd
Let's take a look at some basic examples of working with Pandas.
To create a Series object in Pandas, you can pass a list of values to the Series constructor:
import pandas as pd
# create a Series object
s = pd.Series([1, 3, 5, 7, 9])
print(s)
This will output:
0 1
1 3
2 5
3 7
4 9
dtype: int64
The output shows the index of each value in the Series (0 to 4) and the corresponding value.
To create a DataFrame object in Pandas, you can pass a dictionary of lists to the DataFrame constructor:
import pandas as pd
# create a DataFrame object
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M']}
df = pd.DataFrame(data)
print(df)
This will output:
name age gender
0 Alice 25 F
1 Bob 32 M
2 Charlie 18 M
3 David 47 M
The output shows a table-like structure with three columns (name, age, and gender) and four rows of data.
You can select data from a Pandas DataFrame using various methods, such as:
Here's an example of selecting data using loc:
import pandas as pd
# create a DataFrame object
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M']}
df = pd.DataFrame(data)
# select data using loc
print(df.loc[1:2, ['name', 'age']])
This will output:
name age
1 Bob 32
2 Charlie 18
The output shows the rows with index 1 and 2, and the columns with names "name" and "age".
Pandas provides powerful grouping and aggregation functions for summarizing data. Here's an example of grouping data by a categorical variable:
import pandas as pd
# create a DataFrame object
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M']}
df = pd.DataFrame(data)
# group data by gender and calculate the mean age
grouped = df.groupby('gender')['age'].mean()
print(grouped)
This will output:
gender
F 25.0
M 32.333333
Name: age, dtype: float64
The output shows the mean age for each gender (F and M).
Pandas is a powerful and flexible library for data analysis and manipulation in Python. It provides efficient data structures and tools for cleaning, merging, and reshaping datasets, as well as powerful grouping and aggregation functions for summarizing data. With Pandas, you can easily handle large and complex datasets and extract valuable insights from your data.