[Feat] 스크립트 파일 생성 (#38)#99
Conversation
📝 WalkthroughWalkthroughAdds ChangesCorpus XLSX-to-Postgres Import Script
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/import_corpus.py`:
- Line 502: The load_workbook function call on line 502 is loading the entire
XLSX file into memory, which causes excessive memory consumption for large
corpus files. Add the read_only parameter set to True in the load_workbook
function call to enable read-only mode, which will reduce memory pressure and
improve performance when importing large XLSX files.
- Around line 109-114: The to_int() function is silently truncating decimal
values by converting through float first and then to int, which corrupts data
like char_limit. Modify the to_int() function to validate that the numeric value
is actually an integer before conversion: after converting text to float, check
if the float value equals its integer conversion (using modulo or equality
check), and if not, raise an error or return None instead of silently truncating
with int(float(text)).
- Around line 512-522: The current code silently converts missing required
sheets to None when they are not found in the workbook (checking JD_SHEET_NAME
and QUESTION_SHEET_NAME in workbook.sheetnames), allowing the import functions
to skip processing without any error indication. Replace the conditional sheet
access with explicit validation that raises an error if the required sheets are
missing, ensuring the import fails fast instead of reporting success when
nothing was actually imported. Check for sheet existence before calling
import_job_posting_sheet and import_question_sheet functions, and raise an
appropriate exception if either required sheet is absent from the workbook.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 71f6a4d1-a711-4972-9506-9340f61250b6
📒 Files selected for processing (3)
README.mdscripts/import_corpus.pyscripts/requirements-corpus-import.txt
| def to_int(value: Any) -> int | None: | ||
| text = to_string(value) | ||
| if text is None: | ||
| return None | ||
| return int(float(text)) | ||
|
|
There was a problem hiding this comment.
Reject non-integer numeric inputs in to_int() instead of truncating.
Line 113 currently does int(float(text)), so values like "12.9" become 12 silently. That can corrupt persisted char_limit values.
Proposed fix
def to_int(value: Any) -> int | None:
- text = to_string(value)
- if text is None:
- return None
- return int(float(text))
+ text = to_string(value)
+ if text is None:
+ return None
+ number = float(text)
+ if not number.is_integer():
+ raise ValueError(f"Expected integer-compatible value, got: {value}")
+ return int(number)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/import_corpus.py` around lines 109 - 114, The to_int() function is
silently truncating decimal values by converting through float first and then to
int, which corrupts data like char_limit. Modify the to_int() function to
validate that the numeric value is actually an integer before conversion: after
converting text to float, check if the float value equals its integer conversion
(using modulo or equality check), and if not, raise an error or return None
instead of silently truncating with int(float(text)).
| raise SystemExit(f"XLSX file not found: {xlsx_path}") | ||
|
|
||
| load_env_file(Path(args.env_file)) | ||
| workbook = load_workbook(filename=xlsx_path, data_only=True) |
There was a problem hiding this comment.
Open the workbook in read-only mode for large corpus files.
Line 502 loads the entire workbook into memory. For large XLSX imports, this can cause unnecessary memory pressure and slower startup.
Proposed fix
- workbook = load_workbook(filename=xlsx_path, data_only=True)
+ workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| workbook = load_workbook(filename=xlsx_path, data_only=True) | |
| workbook = load_workbook(filename=xlsx_path, data_only=True, read_only=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/import_corpus.py` at line 502, The load_workbook function call on
line 502 is loading the entire XLSX file into memory, which causes excessive
memory consumption for large corpus files. Add the read_only parameter set to
True in the load_workbook function call to enable read-only mode, which will
reduce memory pressure and improve performance when importing large XLSX files.
| import_job_posting_sheet( | ||
| cur, | ||
| workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None, | ||
| stats, | ||
| company_cache, | ||
| classification_cache, | ||
| ) | ||
| import_question_sheet( | ||
| cur, | ||
| workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None, | ||
| stats, |
There was a problem hiding this comment.
Fail fast when required sheets are missing instead of silently skipping.
At Line 514 and Line 521, missing sheets are converted to None, and the import functions return early. This can report successful completion while importing nothing.
Proposed fix
load_env_file(Path(args.env_file))
workbook = load_workbook(filename=xlsx_path, data_only=True)
+ missing_sheets = [
+ name for name in (JD_SHEET_NAME, QUESTION_SHEET_NAME)
+ if name not in workbook.sheetnames
+ ]
+ if missing_sheets:
+ raise SystemExit(f"Missing required sheet(s): {', '.join(missing_sheets)}")
@@
import_job_posting_sheet(
cur,
- workbook[JD_SHEET_NAME] if JD_SHEET_NAME in workbook.sheetnames else None,
+ workbook[JD_SHEET_NAME],
stats,
company_cache,
classification_cache,
)
import_question_sheet(
cur,
- workbook[QUESTION_SHEET_NAME] if QUESTION_SHEET_NAME in workbook.sheetnames else None,
+ workbook[QUESTION_SHEET_NAME],
stats,
company_cache,
classification_cache,
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/import_corpus.py` around lines 512 - 522, The current code silently
converts missing required sheets to None when they are not found in the workbook
(checking JD_SHEET_NAME and QUESTION_SHEET_NAME in workbook.sheetnames),
allowing the import functions to skip processing without any error indication.
Replace the conditional sheet access with explicit validation that raises an
error if the required sheets are missing, ensuring the import fails fast instead
of reporting success when nothing was actually imported. Check for sheet
existence before calling import_job_posting_sheet and import_question_sheet
functions, and raise an appropriate exception if either required sheet is absent
from the workbook.
✨ 어떤 이유로 PR를 하셨나요?
📋 세부 내용 - 왜 해당 PR이 필요한지 작업 내용을 자세하게 설명해주세요
📸 작업 화면 스크린샷
🚨 관련 이슈 번호 [ ]
Summary by CodeRabbit
Release Notes
New Features
Documentation