Vid Sharda, Hardware & AI Engineer

Background

Education & Experience

Education

MEng — Electrical & Computer Engineering

University of Waterloo

2024 – 2026

Graduate program focused on hardware acceleration, digital design, and AI systems. Current project: FPGA-accelerated OpenVLA transformer inference on Intel Agilex-5.

FPGA AI Acceleration RTL

BASc — Nanotechnology Engineering

University of Waterloo

2019 – 2024

Honours degree spanning quantum devices, CMOS fabrication, analog & digital design, VLSI, and signal processing. Specialised in hardware-software co-design and electro-optical systems.

Nanotechnology VLSI 65nm CMOS Signal Processing

Experience

Hardware & Optical Systems Engineer

Canadian Space Agency

May 2025 – Aug 2025

Engineered a photon detection and signal processing chain (photodetectors, shutters, lock-in amplifiers, Raspberry Pi digitisation) for satellite link analysis. Implemented orbit propagation data into an atmospheric characterisation system via Arduino and stepper motors — improving end-to-end performance by 16%.

Photonics Signal Processing Raspberry Pi Arduino Satellite Systems

Aerospace Image Technologist

Teledyne DALSA

May 2023 – Aug 2023

Electro-optical testing (light saturation, dynamic range, noise) on CMOS image sensors for space missions. Led radiation and environmental testing for aerospace compliance. Designed image defect detection scanners and automation tools in LabVIEW, MATLAB, and Python — eliminating 30% of redundant workflows.

CMOS Sensors LabVIEW MATLAB Python Aerospace Testing

Robotics Process Automation Engineer

Day5Analytics

Jan 2023 – Apr 2023

Communicated with clients to identify automation opportunities, then designed and implemented RPA solutions using Python, Neo4j, and KNIME. Provided automation solutions for 4 major projects in retail, finance, and healthcare; reducing manual processing time by 40% and improving data accuracy.

Python SQL KNIME Process Mining

Integrations Analyst

PointClickCare

Jan 2022 – Apr 2022

Worked with the Integration and Enterprise Team to integrate data between several platforms including Salesforce, NetSuite, KnowBe4, and Boomi. Used Python and SQL to generate a golden dataset for the sales team; improving lead conversion, tracking performance, and improving client targeting.

Python SQL Salesforce NetSuite Data Analytics

Power Optimisation & Software Developer

Ford Motor Company

Sep 2021 – Dec 2021

Reduced power consumption and resource utilisation of onboard infotainment hardware by 18% through application call optimisation. Automated HIL testing validation via priority scheduling algorithms. Refactored legacy codebase — resolving recurring bugs and significantly improving long-term maintainability.

Embedded Software HIL Testing C++ Power Optimization

IT Specialist

University of Waterloo

Jan 2021 – Apr 2021

Worked with the IT department to provide technical support and troubleshooting for students and faculty. Managed software installations, network issues, and hardware maintenance. Helped researchers create websites and survey platforms for their projects and improved data collection processes using Python scripts.

Python JavaScript MATLAB SQL Web Development

Portfolio

Projects

Hardware Architecture · Verilog

5-Stage Pipelined RISC-V CPU

A fully functional 5-stage pipelined processor implementing the RISC-V ISA, written in Verilog. Supports the RV32I base integer instruction set with full pipeline hazard handling.

Designed for correctness and performance: data hazards are resolved via forwarding paths from EX and MEM stages, minimizing pipeline stalls. Control hazards use branch prediction with flush logic.

IF → ID → EX → MEM → WB pipeline with full data forwarding
Hazard detection unit with stall and flush control
Register file with dual async read, synchronous write
ALU supporting R-type, I-type, S-type, B-type, U-type, J-type
Simulated and verified in ModelSim / On Board Testing

Pipeline Stages

→

MEM

→

Forwarding paths: EX→EX and MEM→EX reduce stall cycles from data hazards. Branch resolution at ID stage with 1-cycle flush on taken branches.

CPI Breakdown (approx.)

R-type

1.0

Loads

1.5×

Branches

+1 flush

Forwarded

0 stalls

hazard_unit.v Verilog

// Hazard detection + forwarding unit
module hazard_unit (
  input  [4:0] ID_EX_Rs1, ID_EX_Rs2,
  input  [4:0] EX_MEM_Rd, MEM_WB_Rd,
  input         EX_MEM_RegWrite, MEM_WB_RegWrite,
  input         ID_EX_MemRead,
  output reg [1:0] ForwardA, ForwardB,
  output reg       Stall
);
  // Load-use hazard: stall one cycle
  always @(*) begin
    Stall = ID_EX_MemRead &&
             ((ID_EX_Rs1 == EX_MEM_Rd) || (ID_EX_Rs2 == EX_MEM_Rd));
    // EX forwarding: most recent result
    ForwardA = (EX_MEM_RegWrite && EX_MEM_Rd == ID_EX_Rs1) ? 2'b10 :
               (MEM_WB_RegWrite && MEM_WB_Rd == ID_EX_Rs1) ? 2'b01 : 2'b00;
    ForwardB = (EX_MEM_RegWrite && EX_MEM_Rd == ID_EX_Rs2) ? 2'b10 :
               (MEM_WB_RegWrite && MEM_WB_Rd == ID_EX_Rs2) ? 2'b01 : 2'b00;
  end
endmodule

View on GitHub — github.com/nanovid

LLM Systems · SystemVerilog

iBERT LLM Inference Pipeline

A hardware-accelerated inference pipeline for BERT-based language models, implemented in SystemVerilog. Designed to execute transformer attention blocks and feed-forward layers with minimal latency.

iBERT leverages integer-only arithmetic throughout the pipeline, eliminating floating-point operations from the critical path. Attention score computation, softmax approximation, and matrix multiplications are all mapped to fixed-point integer datapaths.

Integer-only attention and computation blocks (iBERT quantization)
Systolic array structure for matrix multiply acceleration
Layer normalization approximation in integer arithmetic
Parallelised multi-head attention processing
End-to-end pipeline functional on Pynq FPGA

Transformer Pipeline Flow

Token Embeddings

↓

INT Attention Scores QKᵀ

↓

Softmax Approx + ×V

↓

FFN → LayerNorm → Output

Integer vs FP Efficiency

Throughput

↑ INT8

Area

↓ 4×

Power

↓ 3×

attention_core.sv SystemVerilog

// Integer attention score computation — QK^T / sqrt(d_k)
module int_attention_core #(
  parameter D_K = 64,
  parameter SEQ = 128
)(
  input  logic clk, rst_n,
  input  logic signed [7:0] Q [0:SEQ-1][0:D_K-1],
  input  logic signed [7:0] K [0:SEQ-1][0:D_K-1],
  output logic signed [15:0] scores [0:SEQ-1][0:SEQ-1]
);
  logic signed [15:0] acc;
  integer i, j, k;
  always_ff @(posedge clk) begin
    for (i=0; i<SEQ; i++)
      for (j=0; j<SEQ; j++) begin
        acc = 0;
        for (k=0; k<D_K; k++)
          acc += Q[i][k] * K[j][k];  // INT8 dot product
        scores[i][j] <= acc >>> 3;      // approx /sqrt(64) via right-shift
      end
  end
endmodule

View on GitHub — github.com/nanovid

VLSI · Cadence Virtuoso · DFT

256-Bit Scan Chain

A 256-bit serial scan chain implemented and laid out in Cadence Virtuoso, designed as a Design-for-Test (DFT) structure to enable full controllability and observability of internal flip-flop states.

Each scan flip-flop is a modified D-FF with an additional scan input and a mode-select mux. In scan mode, all 256 FFs form a serial shift register, allowing test vectors to be shifted in and captured results shifted out.

256 scan-enabled D flip-flops in series
Designed, simulated, and laid out in Cadence Virtuoso
DRC and LVS clean layout with full connectivity verification
Mode-select logic: functional mode vs. scan shift mode
Supports shift-in, capture, and shift-out test sequences

Scan Chain Topology (256 FFs)

SI→

FF
0

bit 0

FF
1

bit 1

FF
2

bit 2

···

FF
254

bit 254

FF
255

bit 255

→SO

Functional Mode

D input from circuit logic
Normal FF operation

Scan Mode

D input from prev. FF
Serial shift register

Layout Coverage

DRC Clean

✓ Pass

LVS Match

✓ Pass

Test Coverage

256 / 256

Virtuoso Schematic and Layout

Schematic and layout of the Transmission Gate flip-flop.

65nm Schematic and Layout View

Schematic and DRC and LVS clean physical layout implementation of Clock Tree.

scan_ff.v Verilog

// Scan-enabled D flip-flop — core cell of the 256-bit chain
module scan_dff (
  input  wire clk,       // clock
  input  wire scan_en,  // 1 = scan mode, 0 = functional
  input  wire d,        // functional data input
  input  wire si,       // scan input (from prev FF)
  output reg  q,        // FF output
  output wire so        // scan output (feeds next FF)
);
  wire d_mux = scan_en ? si : d;  // select input based on mode

  always @(posedge clk)
    q <= d_mux;

  assign so = q;  // pass Q to next cell in chain
endmodule

// Instantiation across 256 cells
genvar i;
generate
  for (i = 0; i < 256; i++)
    scan_dff ff_inst (clk, scan_en, d_in[i],
      i == 0 ? SI : chain_q[i-1],  // SI chaining
      chain_q[i], chain_so[i]);
endgenerate

View on GitHub — github.com/nanovid

AI Acceleration · FPGA · SystemVerilog

OpenVLA FPGA Acceleration

Accelerating the Meta OpenVLA vision-language-action model by mapping transformer MLP layers onto an Intel Agilex-5 FPGA using SystemVerilog, Quartus Prime, and Intel's FPGA AI Suite with AI Tensor blocks.

Characterised matrix dimensions, precision requirements, and data reuse patterns of each transformer layer to guide FPGA mapping and accelerator architecture decisions. Targeting 10% lower power and 5% higher throughput versus CPU baselines.

Intel Agilex-5 FPGA target — AI Tensor block utilisation
Transformer MLP layer mapping: dimension analysis and dataflow optimisation
Mixed-precision strategy per layer to balance accuracy and resource usage
Quartus Prime + Intel FPGA AI Suite for synthesis and placement
Target: −10% power, +5% throughput vs CPU baseline

FPGA Mapping Strategy

OpenVLA Transformer Input

↓

Matrix Dim Analysis

Precision Profiling

↓

Intel Agilex-5 AI Tensor Blocks

↓

Quartus Prime P&R → Bitstream

FPGA vs CPU Performance Targets

Throughput

+5% target

Power

−10% target

DSP util

AI Tensors

mlp_accelerator.sv SystemVerilog

// MLP layer accelerator for OpenVLA on Intel Agilex-5
// Maps transformer feed-forward blocks onto AI Tensor blocks
module mlp_accelerator #(
  parameter D_MODEL = 4096,
  parameter D_FF    = 11008,  // LLaMA-style FFN hidden dim
  parameter PREC   = 8         // INT8 weights
)(
  input  logic                     clk, rst_n, valid_in,
  input  logic signed [PREC-1:0] x    [0:D_MODEL-1],
  input  logic signed [PREC-1:0] W1   [0:D_FF-1][0:D_MODEL-1],
  output logic signed [15:0]      out  [0:D_FF-1],
  output logic                     valid_out
);
  // Systolic tile — maps to Agilex-5 AI Tensor block primitive
  genvar i;
  generate
    for (i = 0; i < D_FF; i++)
      dot_product_unit #(D_MODEL, PREC) dp (
        clk, valid_in, x, W1[i], out[i], valid_out
      );
  endgenerate
endmodule

View on GitHub — github.com/nanovid

VidSharda

Education

Experience

5-Stage Pipelined RISC-V CPU

iBERT LLM Inference Pipeline

256-Bit Scan Chain

OpenVLA FPGA Acceleration

Vid
Sharda