Skip to content
View igopalakrishna's full-sized avatar
🎯
Focusing
🎯
Focusing
  • New York City Metropolitan Area

Block or report igopalakrishna

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
igopalakrishna/README.md

Gopala Krishna Abba

Data & AI Systems Engineer

Data Engineering · Applied ML · Search & Ranking · Backend Systems

M.S. Computer Engineering, NYU Tandon — May 2026 · 4.0 GPA

View Portfolio · LinkedIn · Repositories · Résumé · Email

Recruiter Quick View

  • Best aligned with early-career Data Engineering, Applied Data Science, Search & Ranking, ML/Data Platform, and backend data systems roles.
  • Current research: Graduate Research Assistant at NYU Chunara Lab since January 2026, building reproducible news-data ingestion and structured LLM-labeling workflows.
  • Most recent industry experience: Data Science Intern — Emerging Technology at FOX Corporation / FOX Tech, February–April 2026.
  • Core differentiator: connecting data pipelines, retrieval and ranking, model evaluation, typed APIs, testing, and deployment into measurable end-to-end systems.

Selected Impact

4.0 / 4.0
NYU Tandon M.S. GPA
12,168 profiles
hybrid retrieval at <350 ms p95
~13M records
historical MTA analysis
92.7% fewer bytes
scanned in SignalLake benchmark
Systems flow from batch and streaming ingestion through transformation, models, retrieval, ranking, APIs, and products

Experience

NYU Chunara Lab — Graduate Research Assistant · Jan 2026–present
Building reproducible pipelines across 10 U.S. newspapers that process 7,000+ articles weekly, plus validated LLM-labeling and external research datasets. Experience details

FOX Corporation / FOX Tech — Data Science Intern — Emerging Technology · Feb–Apr 2026
Built a 9-stage Databricks pipeline and hybrid semantic clustering/ranking workflow for editorial content; validated duplicate-safe runs across 14 dates in 32–42 seconds. Experience details

Global Futures Group — Software Engineer Intern, AI/ML & Data Infrastructure · Sep–Dec 2025
Built hybrid FAISS and BM25 retrieval, ranking, FastAPI, and PostgreSQL infrastructure for 12,168 expert profiles with p95 search latency below 350 ms. Case study

NYU DICE Lab — Graduate Research Assistant · May–Dec 2025
Implemented LoRA/PEFT experiments across DistilGPT-2 and Pythia models to measure Dynamic Tanh versus LayerNorm quality and inference trade-offs. Research project

Featured Engineering Work

SignalLake — Operational log analytics

Local-first ingestion and analytics platform that preserves raw JSONL events, transforms them into Hive-partitioned Parquet, and queries operational metrics directly with DuckDB.

FastAPI · DuckDB · Parquet · Pydantic · Docker

Evidence: benchmarked 1M events across 192 files; partition pruning reduced data scanned from 86.7 MB to 6.3 MB. Validated with 21 tests across four pytest files and GitHub Actions. Case study

ExpertMatchAI — Hybrid expert search

Internship project combining FAISS vector retrieval, BM25 lexical search, structured geo filters, tunable ranking weights, and fallback logic behind FastAPI and PostgreSQL services.

FAISS · BM25 · FastAPI · PostgreSQL · Next.js

Evidence: indexed 12,168 profiles; average response time below 200 ms, p95 below 350 ms, and full index rebuilds below 60 seconds. Validated with pytest, Vitest, and Playwright. Proprietary source code is not public.

Combined analysis and model development over approximately 13M historical MTA records with a separate live self-hosted pipeline using simulated turnstile events.

Kafka · Spark Structured Streaming · MongoDB · Random Forest · Docker

Evidence: regression models reached approximately 2,700 RMSE and below 4.8% MAE; a separate traffic-level classifier reached 93.36% accuracy. Case study

Public-healthcare-data MLOps prototype covering feature selection, Gradient Boosting training, experiment tracking, Kubeflow orchestration, and Flask model serving; not a clinically validated system.

scikit-learn · MLflow · DAGsHub · Kubeflow · Flask

Evidence: processed 167,497 records, reduced 28 inputs to five through chi-square selection, and reached 92.9% accuracy with 0.89 ROC-AUC. Case study

Where My Experience Fits

Role family Strongest public evidence
Data Engineering FOX Databricks pipeline, current NYU ingestion, SignalLake, Kafka/Spark streaming
Applied Data Science / ML FOX clustering and ranking, NYU research, forecasting, reproducible MLOps
Search & Ranking Global Futures / ExpertMatchAI hybrid retrieval, geo filtering, tunable ranking, fallbacks
Backend Data Systems FastAPI, PostgreSQL, DuckDB, REST APIs, Docker, CI, structured logging, automated tests

Technical Toolkit

  • Data & Distributed Systems: Python, SQL, PySpark, Spark Structured Streaming, Kafka, Databricks, Delta Lake, Parquet, DuckDB
  • ML, Retrieval & Evaluation: scikit-learn, PyTorch, Hugging Face, FAISS, BM25, embeddings, MLflow, relevance and model evaluation
  • Backend & Databases: FastAPI, Flask, PostgreSQL, MongoDB, Prisma, Next.js, TypeScript, REST APIs
  • Cloud, Testing & Delivery: Docker, Kubernetes, Kubeflow, GitHub Actions, Vercel, Railway, pytest, Vitest, Playwright

How I Work

Start with the simplest correct system, define measurable behavior, and make failures visible before adding complexity.

  • Evaluate retrieval and ML systems with explicit quality, latency, and failure criteria.
  • Make data and model workflows reproducible through versioned artifacts, rerunnable pipelines, tests, and CI.
  • Document architectural trade-offs so another engineer can understand both the result and its limits.

Current Focus

I am currently building reproducible text-data and LLM-labeling systems at NYU Chunara Lab and am open to early-career roles in Data Engineering, Applied Data Science, Search & Ranking, ML/Data Platforms, and backend systems for data and AI.

View the portfolio · Explore repositories · Connect on LinkedIn · Email me

Pinned Loading

  1. signallake signallake Public

    Cloud-native log analytics platform. Ingests events, stores as Hive-partitioned Parquet, queries with DuckDB. Local-first MVP with AWS S3 + Athena deployment path.

    Python

  2. sbsim sbsim Public

    Forked from google/sbsim

    Google Open-Source Project: Stochastic building simulator and real-world dataset for training and benchmarking reinforcement learning agents in energy-efficient smart control environments. Built wi…

    Python 1

  3. DyT-NoNorm-LLMs-REWILD DyT-NoNorm-LLMs-REWILD Public

    Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.

    Jupyter Notebook

  4. VeriWire VeriWire Public

    VeriWire: voice banking fraud-confirmation agent (Deepgram + Twilio + LangGraph + FastAPI)

    Python 6 1

  5. FuseRank FuseRank Public

    Scalable Hybrid Recommender (TensorFlow + TF-IDF) - Trained on 5M/70M interactions to MAE 0.1863 / MSE 0.0727 (epoch 16, ~20 min) and shipped Flask on GKE (Docker, CI/CD GitHub Actions, HPA), −30% …

    Jupyter Notebook

  6. StreamVault StreamVault Public

    StreamVault is a web-based streaming platform management system. It extends the database schema with authentication capabilities and provides a full CRUD interface for managing web series, episodes…

    HTML 1