Skip to content

Conversation

@slobodan-ilic
Copy link
Contributor

@slobodan-ilic slobodan-ilic commented Oct 13, 2025

File Handle Support for SAV Files

Summary

Adds support for reading SAV files directly from file-like objects (e.g., zip archives, BytesIO) without extracting to disk.

Use Case

Read large SAV files from zip archives without temporary file extraction:

import zipfile
import pyreadstat

with zipfile.ZipFile('data.zip', 'r') as zf:
with zf.open('survey.sav') as f:
df, meta = pyreadstat.read_sav(f) # No extraction needed!

Implementation

  • Leverages ReadStat's built-in I/O handler system
  • Custom Cython handlers bridge Python file objects to C callbacks
  • Unified parser function handles both paths and file objects
  • Zero breaking changes - existing code works unchanged

Testing

  • All existing tests pass (92 tests)
  • New integration test validates zip archive reading
  • Data integrity verified (identical output to path-based reading)

Changes

  • pyreadstat.read_sav() now accepts file-like objects with read() and seek() methods
  • +132 net lines (after refactoring to eliminate duplication)
  • Extensible to other formats (DTA, SAS7BDAT, etc.)

@slobodan-ilic slobodan-ilic changed the title [Issue-308 support for file handles [Issue-308]: Add support for reading directly from file handles Oct 13, 2025
@slobodan-ilic slobodan-ilic force-pushed the ISS-308-support-for-file-handles branch 2 times, most recently from bc17fd2 to 1c9535f Compare October 13, 2025 17:56
@slobodan-ilic slobodan-ilic force-pushed the ISS-308-support-for-file-handles branch from 1c9535f to c2060eb Compare October 13, 2025 17:58
@ofajardo
Copy link
Collaborator

hi, thanks a lot for the PR! It is really great!

Is this ready or WIP?

I read your PR quickly, and I have a couple of questions, a couple of asks and one suggestion:

questions:

  • Why is the _read_fileobj implemented only for sav? Would it benefit also all the other reading functions?
  • in read_sav, if a file object you route it to _read_fileobj, which skips apply_value_formats ... why? I think that is still needed.

suggestions:

  • Would it be possible to just use the regular run_conversion instead of having a separate _read_fileobj? It looks to me as if run_conversion could be used out of the box with the changes you did. Maybe it is something you did for testing purposes but now can be discarded? This would solve my previous questions.

asks:

  • Could you add in the README file in the section Usage -> more reading options, a new sub-section describing briefly this new feature, and a couple of examples on how it could be used? I am thinking on your zip usecase ,and also what is described in read_sas7bdat when user_missing=True not populating meta.missing_user_values #279 if it applies (reading from a remote file)
  • The test_read_sav_from_zip_file_handle, could you put it in test_narwhalified.py and make sure it works for both pandas and polars?

@slobodan-ilic slobodan-ilic marked this pull request as draft October 14, 2025 14:28
@slobodan-ilic
Copy link
Contributor Author

Hey @ofajardo , thanks for your comment. I think all of your observations are valid. I should've created the PR as a draft (I just converted it to one just now). I first wanna make sure our team can use it for what they need, and if it proves to be usable, then I'll get back to the PR and make it much tighter, and also address all of your concerns.

@ofajardo
Copy link
Collaborator

Awesome! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants