Skip to content

ENH: Improve assertion message for assert_frame_equal #39967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mansenfranzen opened this issue Feb 22, 2021 · 7 comments
Open

ENH: Improve assertion message for assert_frame_equal #39967

mansenfranzen opened this issue Feb 22, 2021 · 7 comments
Assignees
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Testing pandas testing functions or related to the test suite

Comments

@mansenfranzen
Copy link

Problem description

For testing data pipelines using pandas I usually use assert_frame_equal to compare expected and resulting dataframes. However, in some circumstances (e.g. test dataframes with more than 20 rows/columns and timestamps) the resulting assertion message may not provide enough information to easily identify the difference between expected and resulting dataframe.

Consider the following example with timestamp columns:

import pandas as pd

df_expected = pd.DataFrame(
    {
        "timestamp_1": pd.date_range(start="2020-01-01 10:00:01", periods=10, freq="d"),
        "timestamp_2": pd.date_range(start="2020-05-05 10:00:01", periods=10, freq="d"),
    }
)

df_resulting = df_result.copy()
df_resulting.iloc[7, 1] = pd.Timestamp("2020-01-07 15:00:00")

pd.testing.assert_frame_equal(df_resulting, df_expected)

The resulting assertion message is the follwing:

AssertionError: numpy array are different

numpy array values are different (10.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[left]:  [1588672801000000000, 1588759201000000000, 1588845601000000000, 1588932001000000000, 1589018401000000000, 1589104801000000000, 1589191201000000000, 1578409200000000000, 1589364001000000000, 1589450401000000000]
[right]: [1588672801000000000, 1588759201000000000, 1588845601000000000, 1588932001000000000, 1589018401000000000, 1589104801000000000, 1589191201000000000, 1589277601000000000, 1589364001000000000, 1589450401000000000]

It is already hard to spot the actual difference even though we have only 2 columns with 10 rows. I don't know the column name and the affected indices.

Proposal

The assertion message could include the name of the column and differing indices, like:

AssertionError: numpy array are different

numpy array values are different (10.0 %)
[column]: timestamp_2
[affected_indices]: [7]
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[left]:  [1588672801000000000, 1588759201000000000, 1588845601000000000, 1588932001000000000, 1589018401000000000, 1589104801000000000, 1589191201000000000, 1578409200000000000, 1589364001000000000, 1589450401000000000]
[right]: [1588672801000000000, 1588759201000000000, 1588845601000000000, 1588932001000000000, 1589018401000000000, 1589104801000000000, 1589191201000000000, 1589277601000000000, 1589364001000000000, 1589450401000000000]

I can find the differences by using some additional boilerplate code however it would be more convenient to have such assertion information right away.

API breaking implications

There should be no breaking API changes. However, assert_extension_array_equal and assert_numpy_array_equal may need an additional keyword argument in order to pass the column name.

Describe alternatives you've considered

Currently, I use the following helper function to see the actual differences:

import pandas as pd
import numpy as np

def identify_differences(df_left: pd.DataFrame, df_right: pd.DataFrame) -> pd.DataFrame:
    """Provide indices, column names and differing values for left and 
    right dataframe. Assumes that dataframes have same shape, indicies
    and columns.
    
    Parameters
    ----------
    df_left: pd.DataFrame
        Left dataframe used for comparison.
    df_right: pd.DataFrame
        Right dataframe used for comparison.
    
    Returns
    -------
    df_differences: pd.DataFrame
        DataFrame containing indices, column names and differing values
        of left and right dataframe.

    """
    
    mask = df_left.ne(df_right)
    indices, columns = np.where(mask)

    differences = [
        {
            "index": df.index[idx],
            "column": df.columns[col],
            "left": df_left.iloc[idx, col],
            "right": df_right.iloc[idx, col],
        }
        for idx, col in zip(indices, columns)
    ]
    
    return pd.DataFrame(differences)

This works nicely but I'd rather have the information in the assertion message already.


If agreed, I could provide a PR myself. Thanks for looking at it anyway.

@mansenfranzen mansenfranzen added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 22, 2021
@phofl phofl added the Testing pandas testing functions or related to the test suite label Feb 22, 2021
@ruijpbastos
Copy link

Why not showing only the different values, instead of the whole columns?

@mansenfranzen
Copy link
Author

Why not showing only the different values, instead of the whole columns?

Agreed - this would be even more convenient.

@jbrockmendel jbrockmendel added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2021
@jbrockmendel
Copy link
Member

PR would be welcome.

@mansenfranzen
Copy link
Author

take

@benhammondmusic
Copy link

            "index": df.index[idx],
            "column": df.columns[col],

Trying to use your helper function but one of the variable names is incorrect (I think). You call

          "index": df.index[idx],
            "column": df.columns[col],

but we don't have a df here, only df_left and df_right. Which is it supposed to be ?

@dvfariaf-bops
Copy link

dvfariaf-bops commented Sep 13, 2023

this improvement would be great. any chance the PR is somewhere as a draft? I could help completing it.

I don't know which was the initial intention for showing in index/column, but one way to address @benhammondmusic would be to use the left dataframe as the reference for the index/columns.

            "index": df_left.index[idx],
            "column": df_left.columns[col],

@aeisenbarth
Copy link

Wouldn't pd.DataFrame.compare be a good fit for the assertion message?
(But consider that it currently misses tolerances #48488.)

Also, it can be very helpful to know which tolerance was violated (at least for the whole dataframe, not per difference).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

7 participants