Skip to content

Added padding for 8-byte objects alignment in memory#1

Open
kraglik wants to merge 1 commit intoRSpliet:masterfrom
kraglik:master
Open

Added padding for 8-byte objects alignment in memory#1
kraglik wants to merge 1 commit intoRSpliet:masterfrom
kraglik:master

Conversation

@kraglik
Copy link
Copy Markdown

@kraglik kraglik commented Jan 27, 2021

I had a very long night trying to figure out why my (dead simple and therefore definitely correct) code was failing on my Nvidia GPU with 4 gigs of RAM.

@RSpliet
Copy link
Copy Markdown
Owner

RSpliet commented Feb 14, 2021

Thank you for investing a long night in debugging this issue, and my apologies for leaving your PR on the shelf for a little. I'm glad to see there is some interest for this (experimental) code, especially since NVIDIA Ampère seems to permit implementing mutex-like synchronisation primitives that make dealing with globally shared data structures a lot more feasible! Surely adds some new use-cases for malloc().
It looks like your fix hints at some additional constraints for GPUs wrt 64-bit pointers and/or 64-bit alignment of struct elements. I vaguely recall there being some constraints with 64-bit pointers back in the days, but my memory is too hazy to say anything sensible about it. Still, I do wonder if we can come up with a solution that doesn't introduce this padding or extra space for 32-bit systems or in other situations where it's not required. Can these alignment properties be queried and made optional?
If so, for a quick example of how to use host-queried properties to define preprocessor symbols in the OpenCL kernel - such that platform-specific variations can be coded up - see this bit of code in another one of my projects, plus the consumer of the newly defined preprocessor define. Admittedly, I haven't thought about the mechanics of this when including KMA as a "library"...
I appreciate it if it's beyond your scope to get a "perfect upstream" solution working, but if you could then I look forward to an updated patch. If not, I'll think about pulling this in wholesale, but I don't currently have a test set-up to triple-check nothing regresses... so bear with me.

@kraglik
Copy link
Copy Markdown
Author

kraglik commented Feb 14, 2021

Thanks for the reply. I'll try to find some time to make this padding optional. Interestingly enough, KMA without any changes works perfectly fine on MacBook AMD GPU but fails on GTX 970 and newer. Anyways, thank you for your work! It seems to be impossible to implement Hierarchical Temporal Memory with dynamic synapse allocation in OpenCL without KMA.

Also, I'll check again if there is any regress in performance on AMD GPU. If I recall correctly, there was no regress, but still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants