Skip to content

Support deleted rows in SAS data files#366

Open
hpoettker wants to merge 1 commit into
WizardMac:devfrom
hpoettker:deleted-rows
Open

Support deleted rows in SAS data files#366
hpoettker wants to merge 1 commit into
WizardMac:devfrom
hpoettker:deleted-rows

Conversation

@hpoettker

Copy link
Copy Markdown
Contributor

Resolves #284.

Introduction

This PR adds support for deleted rows in SAS data files, which may be compressed or uncompressed.

The handling of deleted uncompressed rows follows the description of the sas7bdat file format here: https://github.com/FredHutch/sas7bdat-specification/blob/master/sas7bdat.rst

The handling of deleted compressed rows is not described in the document linked above. However, it is much simpler as it doesn't involve a dedicated bitmap but only the compression type 0x05, which marks compressed data rows as deleted. The compression type 0x05 seems to be the combination of 0x01 (meaning the data can be skipped) and 0x04 (indicating a compressed data row), which fits well with my theory put forth in #365 that the compression type is actually a bitmap.

Row limits and counters

Currently, the row_limit is not only an upper limit on the number of rows but also the exact number of rows that are expected to be parsed. As the row_limit also includes the deleted rows, the corresponding counter parsed_row_count is now also increased when encountering a deleted row. This allows for the validation checks against row_limit to remain unchanged.

The two new variables deleted_row_limit and parsed_deleted_row_count serve a corresponding purpose but count only deleted rows.

The row count in the meta data is computed as row_limit - deleted_row_limit. Accordingly, the row id that is passed to the value handler and is used in error messages is now computed as parsed_row_count - parsed_deleted_row_count.

Implementation alternative

To the user of the library, deleted rows are transparent with the proposed change, which is a bit different in SAS itself. There, the number of deleted rows is visible in the metadata and the GUI viewer indicates positions of deleted rows by non-subsequent row ids.

One could consider adding the number of deleted rows to the meta data. And one could also consider marking rows as deleted either by explicitly passing the information to the value handler or by implicitly not passing data for deleted row ids to it. This would require a breaking change in the API, I think. It's mostly a discussion on what the information row_count in the meta data should exactly refer to in the case of deleted rows.

Validation

The code works as expected with the example file from #284, which uses a single page and contains only 5 rows (of which 2 are deleted).

I've also tested it successfully against data files written on both Windows and Linux that contain multiple pages and have deleted rows across them.

As generating generic test data is quite easy in this case, I can generate test files if needed.

@hpoettker

Copy link
Copy Markdown
Contributor Author

I've rebased on the latest commits in dev, and I've made some minor changes:

  • the type unsigned char is now used instead of uint8_t for the bitmap
  • the code style is more aligned with the existing code (indentation and placement of the asterisk in pointer declarations)

If there is anything else that should be changed, please let me know.

The existing unit tests for the reading of SAS data files all rely on the write feature of ReadStat to write the files first. Would it be okay to add a unit test that uses the example file from #284? It's quite small so commiting it to the repository seems okay to me. Or should I add a feature to the write API to write deleted rows?

@hpoettker

Copy link
Copy Markdown
Contributor Author

I've tested this PR with SAS data files from SAS 9.4 and the file from the linked issue, which has been created with SAS 8.0202M0. And it works fine for those as far as I can tell.

However, the PR doesn't work for the SAS file from #379, which has been generated with SAS 7.

The SAS 7 file is also treated differently by SAS 9.4 clients. Usually, the number of deleted rows is visible in the GUI of a local SAS installation but not for this one. While the incremental row number in the data view is presented correctly for the (only) non-deleted row, the total number of rows in the meta-data is presented as just 1.

There are two reasons why the PR doesn't work for the SAS 7 file:

  • The row size subheader doesn't provide the total and deleted row count as 178 and 177, respectively, but as 1 and 0. This also explains the different display in the SAS GUI.
  • The calculation of the offset of the deleted row bitmap is off by 14 for all 5 pages.

Both problems are fixable:

  • The deleted_row_limit would need to be removed and the parsed_row_count would not be incremented for deleted rows.
  • Instead of calculating the bitmap position by starting from the beginning of the page area for the uncompressed rows, the position could be calculated from the position of the first subheader. In all files I've analyzed, the bitmap is directly before the first subheader with no bytes in between.

If there is interest in an adaptation of this PR for SAS 7 files, please let me know.

@gdementen

Copy link
Copy Markdown

As far as I am concerned (but I am just a biased user) support for SAS8+ files with deleted rows would already be very nice. I mean, support for SAS7 files would be even nicer but not if that delays merging the original PR.

@sasutils

sasutils commented Jul 1, 2026

Copy link
Copy Markdown

I would not worry about dealing with SAS datasets generated by SAS version 7. That version was not very stable and very few people ever used it.

@evanmiller

Copy link
Copy Markdown
Contributor

Hi, I asked Fable to review, can you look over the comments and make adjustments if necessary? (Sorry it somewhat annoyingly decided to use "maintainer voice", but here goes:)


Bugs: interaction with user-set row limit / row offset

The premise that row_limit is "the exact number of rows expected to be parsed" only holds when the user hasn't set a limit. sas7bdat_parse_row_size_subheader() takes row_limit = min(user limit, total_row_count − row_offset), and the page loop breaks as soon as parsed_row_count == row_limit. That leads to three problems:

  1. uint32 underflow in the metadata. With

    .row_count = ctx->row_limit - ctx->deleted_row_limit,

    a caller using readstat_set_row_limit(10) on a file with 30 deleted rows gets row_count ≈ 4.29 billion in the metadata handler. Also reachable with a large row_offset on a file with many deleted rows.

  2. Spurious hard error at end of parse. The new check

    if (ctx->handle.value && ctx->parsed_deleted_row_count != ctx->deleted_row_limit)

    fails whenever a user limit stops parsing before all deleted rows have been seen — a valid limit on a valid file now returns READSTAT_ERROR_ROW_COUNT_MISMATCH, after values have already been streamed to the handler.

  3. Wrong bitmap location. In sas7bdat_parse_deleted_row_bitmap():

    uint32_t row_count = ctx->page_row_count < ctx->row_limit ? ctx->page_row_count : ctx->row_limit;
    uint64_t deleted_row_bitmap_offset = row_count * ctx->row_length + page_unused_bytes;

    The bitmap's on-disk position depends only on how many rows are physically on the page, so row_limit shouldn't enter into it. If the user's limit is smaller than the page's row count, the computed pointer lands inside row data and the code reads garbage flags — wrong rows delivered, or a spurious parse error via sas7bdat_register_deleted_row(). I'd drop the min() entirely: ctx->page_row_count is already per-page for DATA pages (read from the page header) and comes from the row-size subheader for MIX pages, and the existing bounds check guards corrupt values.

A related semantic wrinkle even once those are fixed: deleted rows consume the user's row budget (parsed_row_count++), so readstat_set_row_limit(N) can deliver fewer than N rows even when N live rows exist. That may be an acceptable trade-off, but it should be a deliberate decision rather than a side effect.

Design: hard failure on count mismatch

Today, files with deleted rows parse "successfully" (deleted rows come through as live). With this PR, any file where the bitmap and counters don't reconcile — the SAS 7 file in #379, or sas7bdat_register_deleted_row() hitting more flagged rows than the subheader promised — fails hard, sometimes after values were already delivered. Since the counters are evidently unreliable across SAS versions, I'd prefer to degrade gracefully: report the discrepancy through the error handler but don't fail the parse, or fall back to pre-PR behavior when the counts don't reconcile. I agree with the comments above that SAS 7 support shouldn't block this PR — but the failure mode for such files should be "stale rows included, like today," not "file unreadable."

Your SAS 7 analysis also suggests the more robust bitmap location is "immediately before the first subheader" rather than computed forward from the row area — worth considering here even independent of SAS 7 support.

Minor

  • deleted_row_limit is a misnomer — it's the file's deleted-row count from metadata, not a limit. The overloaded meaning next to row_limit is part of what makes the arithmetic above easy to get wrong; deleted_row_count would be clearer.
  • The new error message uses %d for uint32_t values; given Fix incorrect snprintf format specifier and undefined bsearch comparator #377 just fixed a format-specifier bug, %u (or PRIu32) would be better.
  • On your metadata question: reporting row_count as live rows is the right default for existing consumers. If the deleted count is worth exposing, an additive readstat_get_deleted_row_count()-style accessor would avoid any breaking change.
  • On tests: committing the small fixture file from Skip deleted observations in SAS7BDAT files #284 seems more honest than adding deleted-row write support just to round-trip a test — a write feature would exercise the writer's assumptions rather than real SAS output.

Summary

To get this merged: drop row_limit from the bitmap-offset calculation, rework the metadata subtraction so it can't underflow, skip (or soften) the deleted-count equality check when parsing stops early at a user limit, and decide whether count mismatches are warnings or hard errors. Everything else looks merge-ready to me.

@hpoettker

Copy link
Copy Markdown
Contributor Author

Thanks for the review!

The observations about row_limit are very much on point. When I implemented the PR, I wasn't aware that a row limit could be passed to ReadStat. I was a bit irritated by the name of the variable but failed to draw the conclusion and just treated it as the number of rows in the file. I then named the variable deleted_row_limit for consistency.

I'll fix this but it will take some time as I won't be able to work on it next week.

There are two aspects in the review that I'd like to have your short personal take on:

  • Is the test strategy to commit the SAS 8 sample file the way to go for you? As stated above, I'd consider it fine. But it would be the first such test and the round-trip test approach seems to have been the way to go in the past.
  • The design section of the review is critical about the validation whether the expected number of deleted rows was actually encountered and that it leads to an abort if it fails. Do you share that sentiment? My impression so far was that ReadStat contains many such validations. And that's a good thing because it gives me as a user the confidence that if ReadStat reads a file without error that it then also does so correctly. In my view, that is a key differentiator of ReadStat compared with other libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip deleted observations in SAS7BDAT files

4 participants