thom✨ @gpuwaster

highly performative computing thom.gg 🇫🇷/🇨🇭 Joined December 2017

Tweets

1K
Followers

183
Following

304
Likes

570

thom✨ @gpuwaster

11 hours ago

42/100 of GPU Grind ran into some code with cuda graphs today and as i wasn't familiar enough with it i looked for some resources and found a lecture given for a oak ridge training series, by nvidia employees straight to the point, explaining clearly the point of using cuda graphs to reduce cpu launch overhead, and how to create one using either stream capture or by describing them manually (or mixing both!)

thom✨ @gpuwaster

3 days ago

0 0 1 88 0

0 0 0 31 0

View Details

thom✨ @gpuwaster

3 days ago

@MainzOnX which usecases did people give you ? i've heard people using fp128 but never fp256

1 0 0 311 0

View Details

thom✨ @gpuwaster

3 days ago

the duality of a man (getting carried)

0 0 0 11 0

View Details

thom✨ @gpuwaster

3 days ago

41/100 of GPU Grind watched yesterday GPU mode's lecture on PTX/SASS from Gestwell founders ; some insights on ptx and sass behaviors, and how to read it. i think one of the speakers even said he looked more at the generated ptx than the profiler when writing a kernel, found that surprising at first but i guess when you have enough expertise with ptx that makes sense, you directly understand how things are going to be from the ptx. they presented a tool they created to analyze generated PTX and compare it to what you would expect for a given algorithm for a given compute capability, it flags unexpected behaviors and you can review it manually. i thought it didn't make much sense on their first example, failed to see how you couldn't have drawn the same conclusion from looking at cpp source code, but for larger libraries to analyze a lot of compiled kernels at once, it looked super cool ! they ran it on cuBLAS (3553 kernels), and got 41k signals organized by priority etc, i guess such a tool would be useful for cuBLAS developers for example (if the signals are actually interesting)

thom✨ @gpuwaster

4 days ago

0 0 0 97 0

0 0 1 88 0

View Details

Paul Kuruvilla @RohitPaulK

4 days ago

@BullTheoryio

27 464 19K 572K 616

View Details

thom✨ @gpuwaster

4 days ago

40/100 of GPU Grind worked a little on a routines library i'm making, and after implementing the benchmark part, profiling the application with nsight systems i realized most of the time the application ran was spent in the vector initializations (i reset them to random numbers between each runs). this also allowed me to discover nvtx to instrument my code and find everything in the profiler. i don't time this initialization part so it doesn't matter for the result in itself, but it's still time i spend waiting in front of the screen, and modal credits i'm burning for no reason. so i replaced my manual method with calls to cuRAND, and it cut the total execution time of the benchmark by like x20 !! it's much better, now the main "bottleneck" regarding the sizes of problems i can benchmark is the kernel itself or the available memory (i still need to allocate them on cpu for result checking etc)

thom✨ @gpuwaster

a week ago

39/100 of GPU Grind continuing the cs149 parallel programmin course, watching lecture 2 ! it's about multi-core processors, SIMD concepts and examples with avx intrinsics, caches hierarchy etc it's quite interesting to see how there's bridges with how a gpu works everywhere in

0 0 1 109 0

0 0 0 97 0

View Details

thom✨ @gpuwaster

4 days ago

paris boutta blow up

0 0 0 53 0

View Details

thom✨ @gpuwaster

5 days ago

@shreyansj yeah i think theyre done starting up

0 0 0 57 0

View Details

thom✨ @gpuwaster

a week ago

thom✨ @gpuwaster

a week ago

0 0 3 387 1

0 0 1 109 0

View Details

thom✨ @gpuwaster

a week ago

38/100 of GPU Grind i've seen screenshots of it on the tl multiple times, and it got my interest so i'm starting course cs149 on parallel computing from stanford, the lectures are available on youtube. i expect to know already most of the concepts from there since i just did a semester of high performance computing a few months ago, but i think it'll still be interesting to have some refreshers when i'm tired, and also see if stanford professors address the topic differently first lecture was really an introduction, making students perform parallel computing things (like counting the number of students in class) with different methods, introducing moore law's etc.. quite funny how most of the figures were the exact same we saw in class, i guess every professor use those

thom✨ @gpuwaster

2 weeks ago

0 0 1 421 0

0 0 3 387 1

View Details

thom✨ @gpuwaster

2 weeks ago

@maharshii what do you do to prevent power throttling during the benchmark?

0 0 0 141 0

View Details

thom✨ @gpuwaster

2 weeks ago

@shpostx ah dans son thread il parle de ptx mais alors oui faire ecrire du SASS directement c’est osé

0 0 0 88 0

View Details

thom✨ @gpuwaster

2 weeks ago

@tekbog you underestimate how important nicotine is in France

0 0 1 99 0

View Details

thom✨ @gpuwaster

2 weeks ago

@shpostx c’est à dire? c’est plutôt fréquent en CUDA d’écrire du inline ptx voire pour certains des routines entières en ptx

1 0 0 274 0

View Details

thom✨ @gpuwaster

2 weeks ago

37/100 of GPU Grind i spent a lot of time setting up my development environment for CUB, and while my tests were building (multiple hours actually 🫠 ) i watched the gpu mode lecture on consumer gpus from Jake Cannel (vast ai), which was actually a lot of background on pre-cuda gpu programming, in the graphics ecosystem early 2000s and its evolution to stick with the topic of consumer gpus performance, the benchmarks in the talk show that you get more flops per dollar with a 4090 than with a H100, that might be common knowledge but i was actually quite surprised. however h100 gets obviously more flops, and also has more memory

thom✨ @gpuwaster

2 weeks ago

0 0 1 288 1

0 0 1 421 0

View Details

thom✨ @gpuwaster

2 weeks ago

@mil000 they just showed a bookshelf

0 0 0 28 0

View Details

thom✨ @gpuwaster

2 weeks ago

building cub is so long i need to find a lecture to watch meanwhile

thom✨ @gpuwaster

2 weeks ago

0 0 1 288 1

0 0 0 57 0

View Details

Anne Ouyang @anneouyang

8 months ago

i asked gpt to rewrite a matmul kernel in cute

50 88 2K 128K 286

View Details

thom✨ @gpuwaster

2 weeks ago

35-36/100 of GPU Grind trying to find a project in which i can apply my cuda skills to actually do something useful, instead of just writing kernels for the sake of learning (which i love to do, but i think it'd be cool to do both!) familiarized myself with the CUB repository, i like the idea, header only library providing building blocks for high performance kernels, the CCCL repo (which contains thrust, cub, libcu++) is very active but there seems to be things i could contribute to. though it's not easy to understand how things work in the project just by myself, there's a lot of concepts (13k commits, obviously there's been so many things done on the project so there's a lot of depth), but i'm going to stick with this!

thom✨ @gpuwaster

2 weeks ago

33-34/100 of GPU Grind been super busy lately but back to kernels ! these last days i finished writing up a blog post about gpu sorting algorithms, and especially the Onesweep algorithm, explaining how it works as visually as i could, still checking it and i'll publish it soon