Reference no: EM133793928
Operating Systemss Parallel Programming Assignment
Objective
This assignment aims to deepen your understanding of CUDA programming by requiring you to explore CUDA's architecture and theoretical performance benefits without requiring GPU access. You will select a real-world computational problem, propose a CUDA-based solution, analyze its theoretical performance, and reflect on your findings. The goal is to synthesize the knowledge you've gained about parallel programming frameworks and apply it to GPU programming concepts.
Assignment Overview
You will:
Select a computational problem suitable for CUDA parallelization.
Research CUDA-specific techniques for solving the problem and justify your approach.
Design a CUDA kernel for the problem, focusing on thread and block organization as well as memory optimization strategies.
Theoretically evaluate the kernel's performance, including execution time, scalability,
and bottlenecks.
Reflect on your work, challenges faced, and lessons learned.
Deliverables
Your submission will consist of a detailed report with the following structured sections (titles required):
Problem Selection and Justification (20%) What to Include:
Select a computational problem that benefits from parallelism (e.g., image convolution, matrix multiplication, scientific simulation).
Justify your selection by explaining:
Why the problem is parallelizable.
Why CUDA is a suitable framework for solving it.
Provide at least two references (see acceptable types below) supporting your problem choice and its relevance to CUDA.
Tips for Depth:
Discuss specific aspects of the problem that align with GPU parallelism, such as repetitive computations or large datasets.
Compare the potential benefits of CUDA with other frameworks (e.g., MPI, OpenMP) for the selected problem.
Kernel Design and Memory Optimization (30%) What to Include:
Provide detailed pseudocode for your CUDA kernel.
Clearly annotate how threads and blocks are indexed.
Explain how the kernel distributes work across threads and blocks.
Propose at least two memory optimization strategies (e.g., using shared memory, minimizing global memory accesses). Justify your strategies with references to CUDA documentation or technical resources.
Tips for Depth:
Highlight how the kernel design maximizes GPU utilization (e.g., balancing threads, minimizing memory contention).
Discuss how the memory hierarchy (global, shared, constant) influences your design choices.
Theoretical Performance Analysis (30%) What to Include:
Estimate the execution time of your kernel on a hypothetical GPU (e.g., assume a GPU with 2048 cores and 256 KB shared memory). Calculate metrics like throughput (operations/sec) or speedup compared to a serial CPU implementation.
Identify potential bottlenecks (e.g., warp divergence, memory bandwidth).
Analyze the scalability of your kernel for larger datasets or increased computational complexity.
Tips for Depth:
Use references to support your performance assumptions (e.g., benchmarks reporting similar tasks).
Include hypothetical scenarios to illustrate how increasing thread or block counts impacts performance.
Reflection and Lessons Learned (20%) What to Include:
Reflect on the challenges you faced while designing the kernel or analyzing performance.
Discuss any trade-offs you made in kernel design or memory usage.
Compare your experience with insights from at least one external reference that addresses similar challenges.