Skip to content

Feashliaa/job-board-aggregator

Repository files navigation

Job Board Aggregator

Job Board Aggregator - table view with salary and ATS data

Automated job board aggregating 1,000,000+ positions from 20,000+ companies across seven major ATS platforms. Updated daily via GitHub Actions.

Live Site

View Job Board

Features

  • Multi-platform scraping: Greenhouse, Lever, Ashby, BambooHR, iCIMs, Paylocity, and Workday APIs scraped in parallel using concurrent.futures
  • Progressive loading: Chunked gzip data loaded via Web Workers for fast initial render
  • Advanced filtering: Filter by title, company, location, ATS platform, experience level, and exclude keywords. Toggle remote-only, hide recruiter postings, or hide already-applied jobs
  • Job tier classification: Automatic skill-level tagging (intern/entry/mid/senior) using weighted keyword scoring on job titles
  • Application tracking: Mark jobs as saved, applied, or ignored with batch update support via localStorage
  • URL state sync: Filter/sort/page state persisted in the URL for shareable/bookmarkable searches
  • Responsive design: Desktop table view with card-based mobile layout
  • Automated pipeline: Daily GitHub Actions workflow: fetch existing data → scrape → merge → push chunks to the data-live branch → create release
  • Interactive heatmap: Map view showing job density by location

Map view with job density heatmap and filtering

Tech Stack

Layer Tools
Frontend Vanilla JavaScript (ES Modules), Bootstrap 5, HTML/CSS
Scraping Python 3.12, requests, concurrent.futures, gzip
Data Chunked gzip JSON, Web Workers for decompression
CI/CD GitHub Actions (daily cron + manual dispatch)
Hosting GitHub Pages

Architecture

scripts/
├── scraper.py          # Multi-ATS scraper with parallel fetching
└── merge_data.py       # Deduplicates and prunes stale jobs (>30 days)

js/
├── app.js              # Main app class and initialization
├── jobs_loader.js      # Progressive chunk loading + Web Worker orchestration
├── chunk_worker.js     # Web Worker for gzip decompression
├── filters.js          # Filter logic with regex matching
├── sort_logic.js       # Client-side sort with alpha/numeric handling
├── renderer.js         # Table/card rendering with pagination
├── storage.js          # localStorage wrapper for application tracking
├── columns.js          # Column definitions and custom renderers
├── events.js           # Event listener setup
├── url_state.js        # URL query string sync
└── ui_utils.js         # Toast notifications, HTML escaping, utilities

data/
├── *_companies.json    # Company lists per ATS platform (tracked on main)
├── salary/             # Salary lookup table, sharded a-z (static input)
├── locations.json      # Geolocation lookup
└── trends/daily.jsonl  # Append-only daily trend history

# Chunked job data (jobs_chunk_*.json.gz + jobs_manifest.json) is NOT on main.
# It is force-pushed to the orphan `data-live` branch each run and served from there.

Data Pipeline

  1. Scrape: scraper.py fetches jobs from all seven ATS APIs concurrently (30 workers per platform, 10 for BambooHR to respect rate limits)
  2. Classify: Each job is tagged with a skill level based on title keywords and flagged if posted by a recruiting agency
  3. Clean: Jobs missing titles, URLs, or company info are dropped
  4. Chunk: Results are split into ~25k-job gzipped chunks with a manifest file
  5. Merge: merge_data.py deduplicates against existing data and prunes jobs older than 30 days
  6. Deploy: GitHub Actions commits the trend snapshot to main, force-pushes regenerated chunks to the data-live branch, and creates a tagged release. The frontend fetches chunks from data-live via raw.githubusercontent, keeping main code-only.

Company Discovery

Company lists are built from Common Crawl index data using a separate harvesting pipeline. The harvester scans CDX archives for URLs matching 20+ ATS domain patterns, extracts company slugs via regex, and deduplicates across multiple crawl snapshots. This currently yields give or take 95,000 unique company identifiers.

Local Development

git clone https://github.com/Feashliaa/job-board-aggregator.git
cd job-board-aggregator
python -m http.server 8000
# Visit http://localhost:8000

To run the scraper locally:

cd scripts
pip install -r requirements.txt
python scraper.py --source manual

License

Code in this repository is licensed under the MIT License - see the LICENSE file for details.

The curated company datasets in data/ are licensed under CC BY-NC 4.0. You're free to use, modify, and share the data for non-commercial purposes. Commercial use of the datasets requires permission - reach out via GitHub Issues or email.


Built by Riley Dorrington

About

Job board aggregator indexing 1,000,000+ active positions from 20,000+ companies across Greenhouse, Lever, Ashby, Workday, and other major ATS platforms. Multithreaded Python ETL pipeline, daily automated refreshes via GitHub Actions, filterable client-side search.

Topics

Resources

License

Stars

Watchers

Forks

Contributors