ECE · University of Waterloo

Vid
Sharda

Hardware and optical systems engineer with experience at the Canadian Space Agency, Teledyne DALSA, and Ford Motors — building from RTL and VLSI all the way up to FPGA-accelerated AI inference pipelines.

Verilog / SystemVerilog FPGA / ASIC LLM Pipelines 65nm TSMC CMOS AI Acceleration Cadence / Synopsys
Overview
About Me

I am an Electrical and Computer Engineering Masters Gradauate from the University of Waterloo. My expertise lies in hardware design, RTL, FPGA and ASIC development. My projects span from CMOS chip design in 65nm TSMC to building a 5 stage pipelined RISC-V CPU in Verilog, and acceleration the BERT LLM on hardware. I am looking to apply my skills in hardware design and AI acceleration to build the next generation of efficient computing systems.

Professionally I have worked at 7 different internships across the semiconductor, automotive, and aerospace industries, including the Canadian Space Agency, Teledyne DALSA, and Ford Motors. These experiences have given me a broad perspective on hardware design challenges across different domains, from radiation-hardened space systems to high-throughput automotive applications. I am now intersted in expanding my skills from graduate school into a full time role in hardware design and AI acceleration.

My current focus and interest areas are hardware-software codesign, RTL design and verification, and FPGA engineering. I am just interested in getting my hands dirty building cool hardware projects (hopefully something meaningful), and learning as much as I can along the way!

3
Industry Co-ops
18%
Power Reduction @ Ford
65nm
TSMC CMOS
MEng
UWaterloo ECE
Background
Education & Experience

Education

University of Waterloo
MEng — Electrical & Computer Engineering
University of Waterloo
2024 – 2026
Graduate program focused on hardware acceleration, digital design, and AI systems. Current project: FPGA-accelerated OpenVLA transformer inference on Intel Agilex-5.
FPGA AI Acceleration RTL
University of Waterloo
BASc — Nanotechnology Engineering
University of Waterloo
2019 – 2024
Honours degree spanning quantum devices, CMOS fabrication, analog & digital design, VLSI, and signal processing. Specialised in hardware-software co-design and electro-optical systems.
Nanotechnology VLSI 65nm CMOS Signal Processing

Experience

CANADIAN SPACE AGENCY · CSA
Hardware & Optical Systems Engineer
Canadian Space Agency
May 2025 – Aug 2025
Engineered a photon detection and signal processing chain (photodetectors, shutters, lock-in amplifiers, Raspberry Pi digitisation) for satellite link analysis. Implemented orbit propagation data into an atmospheric characterisation system via Arduino and stepper motors — improving end-to-end performance by 16%.
Photonics Signal Processing Raspberry Pi Arduino Satellite Systems
TELEDYNE DALSA
Aerospace Image Technologist
Teledyne DALSA
May 2023 – Aug 2023
Electro-optical testing (light saturation, dynamic range, noise) on CMOS image sensors for space missions. Led radiation and environmental testing for aerospace compliance. Designed image defect detection scanners and automation tools in LabVIEW, MATLAB, and Python — eliminating 30% of redundant workflows.
CMOS Sensors LabVIEW MATLAB Python Aerospace Testing
day5 analytics
Robotics Process Automation Engineer
Day5Analytics
Jan 2023 – Apr 2023
Communicated with clients to identify automation opportunities, then designed and implemented RPA solutions using Python, Neo4j, and KNIME. Provided automation solutions for 4 major projects in retail, finance, and healthcare; reducing manual processing time by 40% and improving data accuracy.
Python SQL KNIME Process Mining
PointClickCare
Integrations Analyst
PointClickCare
Jan 2022 – Apr 2022
Worked with the Integration and Enterprise Team to integrate data between several platforms including Salesforce, NetSuite, KnowBe4, and Boomi. Used Python and SQL to generate a golden dataset for the sales team; improving lead conversion, tracking performance, and improving client targeting.
Python SQL Salesforce NetSuite Data Analytics
Ford
Power Optimisation & Software Developer
Ford Motor Company
Sep 2021 – Dec 2021
Reduced power consumption and resource utilisation of onboard infotainment hardware by 18% through application call optimisation. Automated HIL testing validation via priority scheduling algorithms. Refactored legacy codebase — resolving recurring bugs and significantly improving long-term maintainability.
Embedded Software HIL Testing C++ Power Optimization
UNIVERSITY OF WATERLOO
IT Specialist
University of Waterloo
Jan 2021 – Apr 2021
Worked with the IT department to provide technical support and troubleshooting for students and faculty. Managed software installations, network issues, and hardware maintenance. Helped researchers create websites and survey platforms for their projects and improved data collection processes using Python scripts.
Python JavaScript MATLAB SQL Web Development
Technical Stack
Skills
Languages
  • Verilog / SystemVerilog
  • Python
  • C++
  • JavaScript
  • MATLAB
EDA & Design Tools
  • Cadence Virtuoso
  • Synopsys DC / PrimeTime
  • Xilinx Vivado / Quartus
  • ModelSim / XSim / Verilator
  • Altium Designer
Technologies
  • 65nm TSMC CMOS
  • FPGA & ASIC Design
  • Digital & Analog Design
  • RTL Verification
  • SPICE / COMSOL
Systems & AI
  • LLM / Transformer RTL
  • FPGA AI Acceleration
  • Intel AI Tensor Blocks
  • Electro-Optical Systems
  • Linux / Git
Portfolio
Projects
Hardware Architecture · Verilog

5-Stage Pipelined RISC-V CPU

A fully functional 5-stage pipelined processor implementing the RISC-V ISA, written in Verilog. Supports the RV32I base integer instruction set with full pipeline hazard handling.

Designed for correctness and performance: data hazards are resolved via forwarding paths from EX and MEM stages, minimizing pipeline stalls. Control hazards use branch prediction with flush logic.

  • IF → ID → EX → MEM → WB pipeline with full data forwarding
  • Hazard detection unit with stall and flush control
  • Register file with dual async read, synchronous write
  • ALU supporting R-type, I-type, S-type, B-type, U-type, J-type
  • Simulated and verified in ModelSim / On Board Testing
Pipeline Stages
IF
ID
EX
MEM
WB
Forwarding paths: EX→EX and MEM→EX reduce stall cycles from data hazards. Branch resolution at ID stage with 1-cycle flush on taken branches.
CPI Breakdown (approx.)
R-type
1.0
Loads
1.5×
Branches
+1 flush
Forwarded
0 stalls
hazard_unit.v Verilog
// Hazard detection + forwarding unit
module hazard_unit (
  input  [4:0] ID_EX_Rs1, ID_EX_Rs2,
  input  [4:0] EX_MEM_Rd, MEM_WB_Rd,
  input         EX_MEM_RegWrite, MEM_WB_RegWrite,
  input         ID_EX_MemRead,
  output reg [1:0] ForwardA, ForwardB,
  output reg       Stall
);
  // Load-use hazard: stall one cycle
  always @(*) begin
    Stall = ID_EX_MemRead &&
             ((ID_EX_Rs1 == EX_MEM_Rd) || (ID_EX_Rs2 == EX_MEM_Rd));
    // EX forwarding: most recent result
    ForwardA = (EX_MEM_RegWrite && EX_MEM_Rd == ID_EX_Rs1) ? 2'b10 :
               (MEM_WB_RegWrite && MEM_WB_Rd == ID_EX_Rs1) ? 2'b01 : 2'b00;
    ForwardB = (EX_MEM_RegWrite && EX_MEM_Rd == ID_EX_Rs2) ? 2'b10 :
               (MEM_WB_RegWrite && MEM_WB_Rd == ID_EX_Rs2) ? 2'b01 : 2'b00;
  end
endmodule
View on GitHub — github.com/nanovid
LLM Systems · SystemVerilog

iBERT LLM Inference Pipeline

A hardware-accelerated inference pipeline for BERT-based language models, implemented in SystemVerilog. Designed to execute transformer attention blocks and feed-forward layers with minimal latency.

iBERT leverages integer-only arithmetic throughout the pipeline, eliminating floating-point operations from the critical path. Attention score computation, softmax approximation, and matrix multiplications are all mapped to fixed-point integer datapaths.

  • Integer-only attention and computation blocks (iBERT quantization)
  • Systolic array structure for matrix multiply acceleration
  • Layer normalization approximation in integer arithmetic
  • Parallelised multi-head attention processing
  • End-to-end pipeline functional on Pynq FPGA
Transformer Pipeline Flow
Token Embeddings
Q
K
V
INT Attention Scores QKᵀ
Softmax Approx + ×V
FFN → LayerNorm → Output
Integer vs FP Efficiency
Throughput
↑ INT8
Area
↓ 4×
Power
↓ 3×
attention_core.sv SystemVerilog
// Integer attention score computation — QK^T / sqrt(d_k)
module int_attention_core #(
  parameter D_K = 64,
  parameter SEQ = 128
)(
  input  logic clk, rst_n,
  input  logic signed [7:0] Q [0:SEQ-1][0:D_K-1],
  input  logic signed [7:0] K [0:SEQ-1][0:D_K-1],
  output logic signed [15:0] scores [0:SEQ-1][0:SEQ-1]
);
  logic signed [15:0] acc;
  integer i, j, k;
  always_ff @(posedge clk) begin
    for (i=0; i<SEQ; i++)
      for (j=0; j<SEQ; j++) begin
        acc = 0;
        for (k=0; k<D_K; k++)
          acc += Q[i][k] * K[j][k];  // INT8 dot product
        scores[i][j] <= acc >>> 3;      // approx /sqrt(64) via right-shift
      end
  end
endmodule
View on GitHub — github.com/nanovid
VLSI · Cadence Virtuoso · DFT

256-Bit Scan Chain

A 256-bit serial scan chain implemented and laid out in Cadence Virtuoso, designed as a Design-for-Test (DFT) structure to enable full controllability and observability of internal flip-flop states.

Each scan flip-flop is a modified D-FF with an additional scan input and a mode-select mux. In scan mode, all 256 FFs form a serial shift register, allowing test vectors to be shifted in and captured results shifted out.

  • 256 scan-enabled D flip-flops in series
  • Designed, simulated, and laid out in Cadence Virtuoso
  • DRC and LVS clean layout with full connectivity verification
  • Mode-select logic: functional mode vs. scan shift mode
  • Supports shift-in, capture, and shift-out test sequences
Scan Chain Topology (256 FFs)
SI→
FF
0
bit 0
FF
1
bit 1
FF
2
bit 2
···
FF
254
bit 254
FF
255
bit 255
→SO
Functional Mode
D input from circuit logic
Normal FF operation
Scan Mode
D input from prev. FF
Serial shift register
Layout Coverage
DRC Clean
✓ Pass
LVS Match
✓ Pass
Test Coverage
256 / 256
Virtuoso Schematic and Layout
Virtuoso Schematic
Schematic and layout of the Transmission Gate flip-flop.
65nm Schematic and Layout View
65nm Layout
Schematic and DRC and LVS clean physical layout implementation of Clock Tree.
scan_ff.v Verilog
// Scan-enabled D flip-flop — core cell of the 256-bit chain
module scan_dff (
  input  wire clk,       // clock
  input  wire scan_en,  // 1 = scan mode, 0 = functional
  input  wire d,        // functional data input
  input  wire si,       // scan input (from prev FF)
  output reg  q,        // FF output
  output wire so        // scan output (feeds next FF)
);
  wire d_mux = scan_en ? si : d;  // select input based on mode

  always @(posedge clk)
    q <= d_mux;

  assign so = q;  // pass Q to next cell in chain
endmodule

// Instantiation across 256 cells
genvar i;
generate
  for (i = 0; i < 256; i++)
    scan_dff ff_inst (clk, scan_en, d_in[i],
      i == 0 ? SI : chain_q[i-1],  // SI chaining
      chain_q[i], chain_so[i]);
endgenerate
View on GitHub — github.com/nanovid
AI Acceleration · FPGA · SystemVerilog

OpenVLA FPGA Acceleration

Accelerating the Meta OpenVLA vision-language-action model by mapping transformer MLP layers onto an Intel Agilex-5 FPGA using SystemVerilog, Quartus Prime, and Intel's FPGA AI Suite with AI Tensor blocks.

Characterised matrix dimensions, precision requirements, and data reuse patterns of each transformer layer to guide FPGA mapping and accelerator architecture decisions. Targeting 10% lower power and 5% higher throughput versus CPU baselines.

  • Intel Agilex-5 FPGA target — AI Tensor block utilisation
  • Transformer MLP layer mapping: dimension analysis and dataflow optimisation
  • Mixed-precision strategy per layer to balance accuracy and resource usage
  • Quartus Prime + Intel FPGA AI Suite for synthesis and placement
  • Target: −10% power, +5% throughput vs CPU baseline
FPGA Mapping Strategy
OpenVLA Transformer Input
Matrix Dim Analysis
Precision Profiling
Intel Agilex-5 AI Tensor Blocks
Quartus Prime P&R → Bitstream
FPGA vs CPU Performance Targets
Throughput
+5% target
Power
−10% target
DSP util
AI Tensors
mlp_accelerator.sv SystemVerilog
// MLP layer accelerator for OpenVLA on Intel Agilex-5
// Maps transformer feed-forward blocks onto AI Tensor blocks
module mlp_accelerator #(
  parameter D_MODEL = 4096,
  parameter D_FF    = 11008,  // LLaMA-style FFN hidden dim
  parameter PREC   = 8         // INT8 weights
)(
  input  logic                     clk, rst_n, valid_in,
  input  logic signed [PREC-1:0] x    [0:D_MODEL-1],
  input  logic signed [PREC-1:0] W1   [0:D_FF-1][0:D_MODEL-1],
  output logic signed [15:0]      out  [0:D_FF-1],
  output logic                     valid_out
);
  // Systolic tile — maps to Agilex-5 AI Tensor block primitive
  genvar i;
  generate
    for (i = 0; i < D_FF; i++)
      dot_product_unit #(D_MODEL, PREC) dp (
        clk, valid_in, x, W1[i], out[i], valid_out
      );
  endgenerate
endmodule
View on GitHub — github.com/nanovid