-
Tweets1K
-
Followers183
-
Following304
-
Likes570
42/100 of GPU Grind ran into some code with cuda graphs today and as i wasn't familiar enough with it i looked for some resources and found a lecture given for a oak ridge training series, by nvidia employees straight to the point, explaining clearly the point of using cuda graphs to reduce cpu launch overhead, and how to create one using either stream capture or by describing them manually (or mixing both!)
41/100 of GPU Grind watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found
@MainzOnX which usecases did people give you ? i've heard people using fp128 but never fp256
41/100 of GPU Grind watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found that surprising at first but i guess when you have enough expertise with ptx that makes sense, you directly understand how things are going to be from the ptx. they presented a tool they created to analyze generated PTX and compare it to what you would expect for a given algorithm for a given compute capability, it flags unexpected behaviors and you can review it manually. i thought it didn't make much sense on their first example, failed to see how you couldn't have drawn the same conclusion from looking at cpp source code, but for larger libraries to analyze a lot of compiled kernels at once, it looked super cool ! they ran it on cuBLAS (3553 kernels), and got 41k signals organized by priority etc, i guess such a tool would be useful for cuBLAS developers for example (if the signals are actually interesting)
40/100 of GPU Grind worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers
40/100 of GPU Grind worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers between each runs). this also allowed me to discover nvtx to instrument my code and find everything in the profiler. i don't time this initialization part so it doesn't matter for the result in itself, but it's still time i spend waiting in front of the screen, and modal credits i'm burning for no reason. so i replaced my manual method with calls to cuRAND, and it cut the total execution time of the benchmark by like x20 !! it's much better, now the main "bottleneck" regarding the sizes of problems i can benchmark is the kernel itself or the available memory (i still need to allocate them on cpu for result checking etc)
39/100 of GPU Grind continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc it's quite interesting to see how there's bridges with how a gpu works everywhere in
39/100 of GPU Grind continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc it's quite interesting to see how there's bridges with how a gpu works everywhere in those explanations, for example the simd divergence issues / warp divergence issues
38/100 of GPU Grind i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a
38/100 of GPU Grind i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a semester of high performance computing a few months ago, but i think it'll still be interesting to have some refreshers when i'm tired, and also see if stanford professors address the topic differently first lecture was really an introduction, making students perform parallel computing things (like counting the number of students in class) with different methods, introducing moore law's etc.. quite funny how most of the figures were the exact same we saw in class, i guess every professor use those
37/100 of GPU Grind i spent a lot of time setting up my development environment for CUB, and while my tests were building (multiple hours actually 🫠 ) i watched the gpu mode lecture on consumer gpus from Jake Cannel (vast ai), which was actually a lot of background on pre-cuda
@maharshii what do you do to prevent power throttling during the benchmark?
@shpostx ah dans son thread il parle de ptx mais alors oui faire ecrire du SASS directement c’est osé
@tekbog you underestimate how important nicotine is in France
@shpostx c’est à dire? c’est plutôt fréquent en CUDA d’écrire du inline ptx voire pour certains des routines entières en ptx
37/100 of GPU Grind i spent a lot of time setting up my development environment for CUB, and while my tests were building (multiple hours actually 🫠 ) i watched the gpu mode lecture on consumer gpus from Jake Cannel (vast ai), which was actually a lot of background on pre-cuda gpu programming, in the graphics ecosystem early 2000s and its evolution to stick with the topic of consumer gpus performance, the benchmarks in the talk show that you get more flops per dollar with a 4090 than with a H100, that might be common knowledge but i was actually quite surprised. however h100 gets obviously more flops, and also has more memory
35-36/100 of GPU Grind trying to find a project in which i can apply my cuda skills to actually do something useful, instead of just writing kernels for the sake of learning (which i love to do, but i think it'd be cool to do both!) familiarized myself with the CUB repository,
building cub is so long i need to find a lecture to watch meanwhile
35-36/100 of GPU Grind trying to find a project in which i can apply my cuda skills to actually do something useful, instead of just writing kernels for the sake of learning (which i love to do, but i think it'd be cool to do both!) familiarized myself with the CUB repository,
i asked gpt to rewrite a matmul kernel in cute
35-36/100 of GPU Grind trying to find a project in which i can apply my cuda skills to actually do something useful, instead of just writing kernels for the sake of learning (which i love to do, but i think it'd be cool to do both!) familiarized myself with the CUB repository, i like the idea, header only library providing building blocks for high performance kernels, the CCCL repo (which contains thrust, cub, libcu++) is very active but there seems to be things i could contribute to. though it's not easy to understand how things work in the project just by myself, there's a lot of concepts (13k commits, obviously there's been so many things done on the project so there's a lot of depth), but i'm going to stick with this!
33-34/100 of GPU Grind been super busy lately but back to kernels ! these last days i finished writing up a blog post about gpu sorting algorithms, and especially the Onesweep algorithm, explaining how it works as visually as i could, still checking it and i'll publish it soon
Idk how my matrix multiplication skills will fair in a nuclear apocalypse
Taliaa S @talia_westf0g2
2 Followers 215 Following living outside the routine bc i cant do the same stuff everyday deadass
فيصل المجال... @FMajali64506340
27 Followers 139 Following مهندس برمجيات في عمان، أحب التقنية والقراءة والرياضة. عازب ومستمتع بالحياة!
Ebrima camara @Ebrimacama39662
144 Followers 4K Following May ALLAH remove your burdens and bless you with happiness.❤️
Elizabeth Carter @ElizabeEbeez
2K Followers 4K Following
eenterwebz @eenterwebz
25 Followers 2K Following
KadehimBoşBuSonSigar... @fastRCNN
37 Followers 811 Following
Mrinal @Mrinal349536
0 Followers 167 Following
Anas 🇲🇦 @L_Anas99
78 Followers 841 Following
Jerome Patel @jeromepatel_
252 Followers 3K Following Generating god tier quality data of next gen robotics @microagi, currently SLAM, I wire neurons to harness magic
sacriailleurs @lmav354254
14 Followers 420 Following
Kane @Donkeykane114
40 Followers 163 Following
Julien Khlaut @JKhlaut
3 Followers 73 Following
Mbappe = 200M de prim... @1lbert
98 Followers 2K Following
Bene @bengeltinger
42 Followers 834 Following
pip @smallboxswe
127 Followers 2K Following I launder bits to make money | for the record, I work on query engines
PL Venard @PL_Venard
549 Followers 1K Following Doing stuff - Prev/ CEO building robots @phospho_ai (YC W24) - 🇫🇷/acc
Guillaume deRouville @memorphism
158 Followers 2K Following Following reality’s gradient, with some faith that there’s good along the way.
enmax @enm4x
0 Followers 275 Following
maharshi @maharshii
43K Followers 1K Following learning deeply about life one gradient step at a time - ml perf optimizer @ fal
Aleks Shar 🇺🇦�... @aleks_sharik
485 Followers 2K Following Make GPUs go BURR @AMD - 🩻 AI Startups - Healthcare & Sciences 🦆 SCO DUCKS 🪶✝️ Opinions are my own
عبد السلام �... @lka3h_taime
468 Followers 2K Following
どっこいしょ @vyzN6gf1UoVw4fx
0 Followers 1K Following
Xilatech @xilatech
17 Followers 2K Following
Ravi Chandra Veeramac... @ravichandra
241 Followers 2K Following Knowledge is cheap. Wisdom is expensive.
Ismail TG @taghchti
0 Followers 163 Following
@dMatrix @dMatrix_AI
819 Followers 182 Following d-Matrix delivers high-performance AI inference with digital in-memory compute and ultra-high-bandwidth architecture for modern data centers.
Seth Karten @sethkarten
2K Followers 670 Following Agents….Continual Harness, PokeAgent, LLM Economist | Research Intern @PrimeIntellect | CS PhD @Princeton | Former CMU Waymo
Robert Nishihara @robertnishihara
17K Followers 846 Following Co-founder @anyscalecompute. Co-creator of @raydistributed. Previously PhD ML at Berkeley.
Georgy Evtushenko @g_evtushenko
633 Followers 203 Following Member of CUDA C++ Core Libraries team @nvidia. Opinions are my own.
Annie Shea Weckesser @asheaw
3K Followers 6K Following Chief Marketing & Communications Officer (CMCO) @Intel. Formerly @SambaNovaAI @Uniphore @NIOglobal @Cisco. Mom 3x. Tweets are my own.
Matej Sirovatka @m_sirovatka
3K Followers 483 Following head of hr @ prime intellect | int64 upcaster
arb8020 @arb8020
5K Followers 2K Following s̶k̶i̶l̶l̶ ̶i̶s̶s̶u̶e̶ will issue | prev: @wafer_ai, @morph_labs, five rings, @nascent
Michael Søndergaard @SpectralMichael
23 Followers 37 Following
Jonny Kaye @k44yej
3K Followers 155 Following Lead Technical Talent Sourcer for @microsoftAI Superintelligence Lab EMEA - building the teams that are building the next phase of humanist superintelligence
Daniel Estévez @ea4gpz
10K Followers 399 Following Everything space & RF. Amateur radio operator (EA4GPZ / M0HXM). PhD in Mathematics from @Matematicas_UAM. he/him
Banghua Zhu @BanghuaZ
8K Followers 1K Following Cofounder & CTO @radixark. Assistant Professor @UW. Prior @Nvidia @Berkeley_EECS
Ash Vardanian @ashvardanian
3K Followers 458 Following Building @unum_cloud since 2015 · Investing @aal_vc · author of USearch, StringZilla, NumKong - some of the world's most widely used open-source infra
Lei Zhang @LeiLMx
788 Followers 122 Following LLM GPU kernel & system @AMD: Triton, ROCm. ex-@Google: IREE, MLIR, TensorFlow, SPIR-V, Vulkan.
Nicholas Malaya @nicholasmalaya
1K Followers 990 Following AMD Fellow, HPC & Sovereign AI Building AI Factories for Science
JFPuget 🇫🇷🇺�... @JFPuget
20K Followers 2K Following Machine Learning at @Nvidia, 6x Kaggle Grandmaster CPMP. Arc Prize winner. ML PhD. Ex ENS Ulm, ILOG CPLEX, IBM. Views are my own.
TensorWave @tensorwave
3K Followers 1K Following Empowering the next wave of AI with @AMD Instinct™ GPUs 🌊
elie @eliebakouch
17K Followers 4K Following training llm @PrimeIntellect (prev: @huggingface) anon feedback: https://t.co/JmMh7Sg3mL
Dirhousssi Amine @DirhousssiAmine
535 Followers 379 Following 🇲🇦 ML engineer - post training team @huggingface 🤗 Rustacean 🦀 ● BJJ competitor ● Longtime Martial Artist
dylan ツ @demian_ai
17K Followers 2K Following GTM for @nebiustf @nebiusai // ex @Scaleway // from silicon to token, inference and anything in between. Views are my own - not financial advice
Jackmin @jackminong
2K Followers 923 Following making sand smarter @PrimeIntellect 🇺🇸 Previously @JinaAI_ 🇩🇪 @MoneyLion 🇲🇾
Roman Elizarov @relizarov
20K Followers 922 Following Software infrastructure & libs, language design, sports programming/ICPC, concurrency & algorithms, math & quantitative finance. ex-project lead for @Kotlin
Nick Brown @NickBrownHPC
2K Followers 560 Following Senior Research Fellow @EPCCed, University of Edinburgh. Interested in novel architectures, HPC, FPGAs, RISC-V, programming language design and LLVM & MLIR.
Jeremy Howard @jeremyphoward
313K Followers 7K Following 🇦🇺 Co-founder: @AnswerDotAI/@FastDotAI ; Prev: Professor@UQ; @kaggle founding president; founder @fastmail/@enlitic/… https://t.co/16UBFTX7mo
matt godbolt is mostl... @mattgodbolt
17K Followers 2K Following Husband, father, coder, sometime verb, real person. Fond of old hardware. Co-host @twoscp. #BlackLivesMatter. @matt.godbolt.org on bsky He/him
Ray Wang @rwang07
29K Followers 2K Following Analyst @SemiAnalysis_ / fundamental research on AI infra/semis. "when your motivation runs low, your discipline takes over"
James Bradbury @jekbradbury
17K Followers 9K Following Compute at @AnthropicAI! Previously JAX, TPUs, and LLMs at Google, MetaMind/@SFResearch, @Stanford Linguistics, @Caixin.
Zeeshan Patel @zeeshanp_
11K Followers 805 Following something new. prev @xAI grok imagine / video pretrain, research @nvidia @apple @berkeley_ai | views my own
Bryce, the CUDA Colon... @blelbach
18K Followers 3K Following Principal Engineer at @NVIDIA working on programming languages. @adspthepodcast co-host. C++ Library Evolution chair emeritus. Frequent flyer. Horology fan.
PL Venard @PL_Venard
549 Followers 1K Following Doing stuff - Prev/ CEO building robots @phospho_ai (YC W24) - 🇫🇷/acc
Darwesh Singh @darwesh_singh
823 Followers 537 Following building GPUs in 🇺🇸. tired of no GPU competition? we figured out how to make them faster, consume less power, and affordable @boltgraphicsinc DMs open
Vikram @msharmavikram
4K Followers 645 Following @NVIDIA Sr. Research Scientist | UIUC PhD All opinions and tweets are personal. Tweets about AI Inference, CUDA and GPU systems.
Patrick C Toulme @PatrickToulme
3K Followers 210 Following TPU/XLA @google. Formerly worked on Meta MTIA and AWS Trainium. Opinions are my own. All blogs shared are personal work.
Aleks Shar 🇺🇦�... @aleks_sharik
485 Followers 2K Following Make GPUs go BURR @AMD - 🩻 AI Startups - Healthcare & Sciences 🦆 SCO DUCKS 🪶✝️ Opinions are my own
Julien | Tech & Inves... @JulienTechInvst
19K Followers 198 Following Network Analyst @SemiAnalysis_. Ex Network Engineer @ AWS ⚠️ Pas de conseil en investissement - DYOR ! ⚠️ Opinions are my own
Jerry Tworek @MillionInt
37K Followers 1K Following CEO and co-founder of Core Automation former VP of RL @ OpenAI : reasoning models, o3, o1, GPT4, ChatGPT, Codex, RL for robots cautious AI optimist
rohan anil @_arohan_
42K Followers 2K Following member of technical staff & co-founder of @coreautoai - and continuing to aspire to understand deep learning.
Viv @Vtrivedy10
13K Followers 2K Following applied research @LangChain, prev @awscloud, phd cs @templeuniv




































