Skip to content

Conversation

skalwaghe-56
Copy link
Contributor

@skalwaghe-56 skalwaghe-56 commented Sep 8, 2025


This PR fixes a regression in the CSV parsers when using on_bad_lines as a callable.

Thanks!

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 3 times, most recently from f579800 to d77afef Compare September 10, 2025 10:43
@skalwaghe-56
Copy link
Contributor Author

@jbrockmendel @rhshadrach If you could please guide me further.

@simonjayhawkins simonjayhawkins added Bug IO CSV read_csv, to_csv labels Sep 10, 2025
@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 4 times, most recently from 7009e84 to 0729267 Compare September 12, 2025 12:14
@skalwaghe-56
Copy link
Contributor Author

@rhshadrach @jorisvandenbossche When I ran the test locally for the changes 1 test xpassed. Related to #10153 I think.
Its this test

@pytest.mark.parametrize("dtype", [{"b": "category"}, {1: "category"}])
def test_categorical_dtype_single(all_parsers, dtype, request):
    # see gh-10153
    parser = all_parsers
    data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""
    expected = DataFrame(
        {"a": [1, 1, 2], "b": Categorical(["a", "a", "b"]), "c": [3.4, 3.4, 4.5]}
    )
    if parser.engine == "pyarrow":
        mark = pytest.mark.xfail(
            strict=False,
            reason="Flaky test sometimes gives object dtype instead of Categorical",
        )
        request.applymarker(mark)

    actual = parser.read_csv(StringIO(data), dtype=dtype)
    tm.assert_frame_equal(actual, expected)

I would like you guys to check this out and check the PR too!

Thanks!

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 4 times, most recently from 02e9bd2 to 7f303f7 Compare September 16, 2025 16:40
Copy link
Contributor Author

@skalwaghe-56 skalwaghe-56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the tests too now. The CI should be successful now.

@skalwaghe-56 skalwaghe-56 force-pushed the fix-issue-61837 branch 2 times, most recently from e1f405e to 2fa7f70 Compare September 20, 2025 09:00
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

- Always emit ParserWarning and drop extra fields when an on_bad_lines
  callable returns more elements than expected, regardless of index_col,
  in PythonParser._rows_to_cols. [GH#61837]

- Ensure non-bad rows are appended in the outer else branch so good lines
  are preserved.

- Add regression test
  pandas/tests/io/parser/test_python_parser_only.py::test_on_bad_lines_callable_warns_and_truncates_with_index_col
  covering index_col in [None, 0].

Closes pandas-dev#61837.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv() on_bad_lines callable does not raise ParserWarning when index_col is set
3 participants