Merge Pandas DataFrame based on index
In the world of data science and machine learning, it is imperative to master operations that organize, maintain, and clean data for further analysis. Merging two DataFrames is an example of such an operation. It turns out that merging two DataFrames is very easy using the Pandas library in Python.
Pandas provides us with two useful functions, merge() and join()
merge() to merge two DataFrames. The two methods are very similar, but merge()
merge() is considered more general and flexible. It also provides many parameters to change the behavior of the final DataFrame. merge() join()
merges two DataFrames on their indexes, while merge()
merge() allows us to specify columns that can be used as keys to merge two DataFrames.
A common parameter for both functions is how
, which defines the type of connection. By default, how
the parameter merge()
is for and for inner
, but for both functions it can be changed to , , , and . It is useful to understand the difference between them.join()
left
left
right
inner
outer
When merging two Pandas DataFrames, we assume that one is the left DataFrame and the other is the right DataFrame. Both join merge()
and join()
match records on a key column. inner
Join returns a DataFrame consisting of the matching records from both DataFrames. outer
Join produces a merged DataFrame containing all elements from both DataFrames, filling missing values with NaN on both sides. left
Join contains all elements of the left DataFrame, but only matching records from the right DataFrame. left
Conversely right
, it contains all elements of the right DataFrame, and only matching records from the left DataFrame. All of this will be clearer in the following example code, where we will combine DataFrames.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
print(df1)
print(df2)
Output:
C1
1 a
2 b
4 d
5 e
7 h
C2
1 AA
2 BB
3 CC
5 EE
6 FF
merge()
Merge two Pandas DataFrames on an index using
When merging two DataFrames by their indices, the values merge()
of the left_index
and right_index
parameters of the function should be True
. The following code example will merge two DataFrames, joining them of type inner
.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
df_inner = df1.merge(df2, how="inner", left_index=True, right_index=True)
print(df_inner)
Output:
C1 C2
1 a AA
2 b BB
5 e EE
The following code will merge outer
DataFrames of join type.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
df_outer = df1.merge(df2, how="outer", left_index=True, right_index=True)
print(df_outer)
Output:
C1 C2
1 a AA
2 b BB
3 NaN CC
4 d NaN
5 e EE
6 NaN FF
7 h NaN
As you can see, the merged DataFrame with join type of inner
has only the matching records from both DataFrames, while outer
the DataFrame with join type of has all the elements, NaN
filling the missing records with . Now use left join.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
df_left = df1.merge(df2, how="left", left_index=True, right_index=True)
print(df_left)
Output:
C1 C2
1 a AA
2 b BB
4 d NaN
5 e EE
7 h NaN
The merged DataFrame above has all the elements in the left DataFrame and only the matching records in the right DataFrame. The exact opposite is the right join, as shown in the figure below.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
df_right = df1.merge(df2, how="right", left_index=True, right_index=True)
print(df_right)
Output:
C1 C2
1 a AA
2 b BB
3 NaN CC
5 e EE
6 NaN FF
join()
To merge two Pandas DataFrames on an index, use
join()
The join method merges two DataFrames based on their indices. By default, the join type is left
. It always uses the index of the right DataFrame, but we can provide the keys for the left DataFrame. We can join()
specify the join type for the join function just like we merge()
did for the join function.
The following example shows the join type of the merged DataFrame outer
.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(["a", "b", "d", "e", "h"], index=[1, 2, 4, 5, 7], columns=["C1"])
df2 = pd.DataFrame(
["AA", "BB", "CC", "EE", "FF"], index=[1, 2, 3, 5, 6], columns=["C2"]
)
df_outer = df1.join(df2, how="outer")
print(df_outer)
Output:
C1 C2
1 a AA
2 b BB
3 NaN CC
4 d NaN
5 e EE
6 NaN FF
7 h NaN
For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.
Related Articles
How to Convert DataFrame Column to String in Pandas
Publish Date:2025/05/02 Views:161 Category:Python
-
We will look at methods for converting Pandas DataFrame columns to strings. Pandas Series.astype(str) Method DataFrame.apply() Methods operate on the elements in a column We will use the same DataFrame below in this article. import pandas a
How to count the frequency of values in a Pandas DataFrame
Publish Date:2025/05/02 Views:84 Category:Python
-
Sometimes, when you use DataFrame , you may want to count the number of times a value occurs in a column, or in other words, calculate the frequency. There are mainly three methods used for this. Let's look at them one by one. df.groupby().
How to get value from Pandas DataFrame cell
Publish Date:2025/05/02 Views:147 Category:Python
-
We'll look at using to get values from cells in iloc Pandas , which is great for selecting by position, and how it differs from . We'll also learn about the and methods, which we can use when we don't want to set the return type to .
How to Add a Row to a Pandas DataFrame
Publish Date:2025/05/02 Views:127 Category:Python
-
Pandas is designed to load a fully populated DataFrame . We can pandas.DataFrame add them one by one in . This can be done by using various methods, such as .loc , dictionary, pandas.concat() or DataFrame.append() . .loc [index] Add rows to
How to change the order of Panas DataFrame columns
Publish Date:2025/05/02 Views:184 Category:Python
-
We will show how to use insert and reindex to change the order of columns in different ways pandas.DataFrame , such as assigning column names in a desired order. pandas.DataFrame Sort the columns in the new order The easiest way is columns
How to pretty print an entire Pandas Series/DataFrame
Publish Date:2025/05/02 Views:167 Category:Python
-
We will introduce various methods to pretty print the entire Pandas Series/DataFrame, such as option_context, set_option, and options.display. option_context Pretty Printing Pandas DataFrame We can option_context use with one or more option
How to count the number of NaN occurrences in a Pandas Dataframe column
Publish Date:2025/05/02 Views:144 Category:Python
-
We will look at methods for counting the number of NaN occurrences in a column of a Pandas DataFrame. We have a number of options, including isna() the method for one or more columns, by NaN subtracting the total length from the number of o
How to Convert a Pandas Dataframe to a NumPy Array
Publish Date:2025/05/02 Views:151 Category:Python
-
We will introduce to_numpy() the method to pandas.Dataframe convert a to NumPy an array, which is introduced in pandas v0.24.0, replacing the old .values method. We can define it on Index , Series , and DataFrame objects to_numpy . The old
How to add a header row to a Pandas DataFrame
Publish Date:2025/05/02 Views:161 Category:Python
-
We will look at methods for adding a header row to a pandas dataframe, as well as the option to pass in the names directly in the dataframe or by assigning the column names in a list directly to dataframe.columns the method. We will also in