C is not a low-level programming language
Eric Raymond blogged about a new article that points out I think correctly that C (and derivatives) are not well matched to the current design of the computer hardware. In essence C is no longer a low-level language. And the cited article advocates I think correctly that we must move to massive parallelism and redesign CPUs to get rid of speculative execution pipelines.
It’s also important to note the speculative execution obtains performance by wasting computation and creating heat. Remember @keean that I had argued recently to you that I think we must abandon speculative execution because (given the revelation of SPECTRE) it can never be secure because it can leak information in potentially numerous subtle ways. Edward Cree argues there are solutions for the cache line state leakage, but I and others think that such complexity has potentially an unbounded number of statistically detectable vulnerabilities. And again the point is that speculative pipelining is wasteful, which is an important factor these days due to difficulties with waste heat in data centers and battery life on mobile.
I wrote a comment on Eric’s blog: (and my comment appears to be censored or spam filtered)
to one that seriously jacks up the percentage of your processor clocks doing productive work by using concurrencyParallelism Is Not Concurrency. I’m not claiming Robert Harper is correct to exclude the use of the term concurrency for both deterministic parallelism and nondeterministic asynchronous programming. Though it seems to add to the confusion because for example Wikipedia’s derelict article on concurrency doesn’t even mention asynchronous programming. Asynchronous programming can execute in a single-thread with no parallelism which for example is the enforced case with JavaScript (and as you know WebWorkers have nothing to do with asynchronous programming).
Do we need something much more divergent from C (Erlang? Ocaml? Even perhaps Haskell?)AFAICT, immutability and pure functions are invariants that help with reasoning about concurrency and parallelism. Should we also look into Clojure’s persistent data structures?
It might be the case that (say) Erlang or Rust or Haskell imply a better toolkit for massively concurrent programming, if one but knew how to use it.Indeed. Preferably we need a language that doesn’t pride itself on being maximally abstruse for legacy C programmers. Although I also really like some facets of Python such as the readability and regularity of block indenting instead of curly braces. [Insert point that C’s optional curly braces can cause catastrophic bugs] You apparently also like some facets of Python. I’m aware of your oft-stated point that readability and regularity is also a very high priority in PL design.
Some of reposurgeon’s core operations can’t be factored into things done by a flock of threads at all, because for each commit you might want to vectorize over they do lookups unpredictably far back in time in the repository metadata.Could you essentially do what speculative pipelines are doing at a higher level by doing speculative parallelism?
(a) is not going away even partially without a lot of very hard work changing our algorithmsWe need to find parallelizable replacements and/or paradigm shift the holistic context for SICK algorithms anyway in order to scale our networked existence with potential for massive user counts multiplexing on a hub algorithm. The epoch of a desktop as an island is departing.
Steve Jobs emphasized scalability.
computing the nth term in a crypto hash chainVerification of blockchains is the relevant scalability hurdle of your stated context. Computing the initial instance of the hash chain is an irrelevant cost. Thus, segment (vectorize) the hash chain and verify it in parallel.
The other algorithms you cited are serialized because they’re presuming a total order. But the real world is not a total order, e.g. the data being searched may be changing while it is being searched. Thus I think we’ll find suitable approximations or relativistic algorithms which are parallelizable. IOW, paradigm shift our perspective.
And thus yes this article you cited seems to be very important. And thank you for recognizing it and bringing it to my attention.
we need the target language’s threading overhead to be sufficiently low on the target architecture that it doesn’t swamp the gains.Amdahls law. Thus Rust’s zero overhead synchronization or any lockless design. Goroutines (green threads) have more efficient context switches and creation overhead than OS threads because OS threads have many more concerns to juggle.
Are we supposed to annotate our programs with cache-priority properties attached to each storage allocation?Holistic PL design and algorithm+data structure choices can help without such complexity. For example, boxed data and lists are much worse than unboxed arrays for cache coherency because of the extra pointer indirection. Copying and immutability instead of pointer indirection seems to also be faster. Avoiding dynamic dispatch and enabling monomorphisation, thus typeclasses instead of virtual methods. These are features we’ve been contemplating for a mainstream PL to hit a sweet spot for mainstream use on both the client and the server.
I think the answer is really “No, because there are very good reasons we’ll never have a ‘low-level’ language again.”I’ve also bought into this theme you expressed in your recent blogs about Go versus Rust et al.
BTW, after studying the extant GC variants, I have devised a new GC scheme that leverages the best of generational GC with the best of RC. I’m positing this might eliminate the worst facets of each but experimentation and more design work is needed.
the reality of buzzing threads and lots of test-and-set instructions that has to be going on underneath. Naked, that reality is unmanageable by mere human brainsAlternatives to that dynamic synchronization, race conditions, dead+livelock hair shredder include Rust’s zero overhead lockless model which also carries significant tsuris+inflexibility for the programmer. My new GC scheme posits to gain much of Rust’s zero overhead stack memory reclamation performance and cache coherency benefits. We’re contemplating immutability and single-threaded asynchronicity. For lockless design, restricting nondeterministic asynchronicity to a single-thread eliminates many of the cases that would require Rust’s global total order of exclusivity of mutable access. There’s more to explore down that rabbit hole, yet I digress from the main topic…
My hexacore Great Beast already has more concurrency capacity than reposurgeon’s graph-surgery language can use effectivelySomeone of your intellect can I presume multitask the conversion of more than one repo simultaneously. Presumably it would be more challenging.
Or multitask other activities such as (editing and iterative or just batch) compiling while doing reposurgery. Or sell your idle CPU power into the market and run hardware virtualization for security.
Chisnall seems to me to be suffering from a bit of a streetlight effect here.I’m less than a decade younger than you. Perhaps (collectively the elders here) our brains may be ossifying because of our legacy attachments. Let’s think out of the box. Certainly the future is not as pessimistic as the presumption that serial algorithms are an insoluble problem due to limitations of physics. The young CS dudes may actually be correct on this one.
@mlwmohawk wrote at the linked article:
This is an excellent troll and strawman argument. The "C" programming language enforces nothing in the way of memory model or threads or processor layout. The language itself can be applied to almost any hypothetical processor system. All that is required is some sort of sane operational characteristic that can be programmed. I will grant that most developers and generic libraries assume thing like malloc, stacks, cache, serial execution but not C. C takes an expression and translate it to a set of instructions, nothing less and nothing more.
The kettle calling the teapot black. This troll missed the point that C’s pointer arithmetic and packed ordered fields in data structures presumes a flat (i.e. sequential memory buckets) memory model, which makes numerous cache coherency optimizations unattainable by the compiler. Of course the programmer could choose to not use those features of C and manual code in such optimizations but why do that when the compiler could if the language was designed differently. And if operating in a C ecosystem where one must interface with 3rd party code, how can one dictate that all 3rd parties abstain from using prominent C features. And C provides no language features for aiding parallelism.
What would be really interesting is if you could describe a programming model that would be better than C and not be implemented in C.
Perhaps he has a valid argument that C is so low-level that you can do whatever you want. However the author’s point is also that C encourages programming which requires complex hardware caches, so in that respect C makes it difficult to control performance because the model of C encourages a hardware model which is difficult to control.
The point is that C by not optimally optimizing for parallelism and low-latency memory, forces hardware to adopt a design incorporating wasteful speculative pipelines and highly complex (e.g. high set associativity and data sharing between threads) caches which are difficult to control at the low-level.
@bill wrote at the linked article:
I am made dumber by reading this tripe. A low level language is one that targets - to some degree - the underlying processor architecture. Meltdown et al, and many of the hardware features that you discuss, are the domain of micro-architecture. The breakdown was within the contract between the microarchitecture and the architecture. C remains a "low level" language - regardless of whether the microarchitecture changes. And regardless of whether the architecture include anything remotely like PDP-11 semantics.
Again this person seems to not grok the author’s point that the large gap between the controllable and uncontrollable architecture of the CPU is being driven by C given that C encourages serial algorithms and a flat memory model, the hardware is forced to be designed to employ wasteful speculative pipelines and complex caches.
@andrew wrote at the linked article:
The C language is the standard by which every other language is measured. In terms of speed and readability and learning curve and maintainability it is number one. As a result, programs written in C will have fewer patches/bugs than those written in other languages. It is a good language but it is said to have some vulnerabilities (such as buffer overflows) which upon examination tend to be problems with the C-runtime and not C itself. With a few minor modifications, it could be made more mature, but there is no money in it like there are with the hyped up faddish languages.
It is as if this guy didn’t even comprehend the article he is commenting on! I experience this often when I write on a complex issue, then these ostensibly low IQ Dunning-Krugers make statements which indicate they really don’t understand what they just “read”. C is so semantically low-level (and differentiate this from being low-level w.r.t. the hardware which C is not as explained above) that it is not the most readable and maintainable because there’s too much cruft hiding the actual algorithm. I get the point that C is a simplified language that is easy to grasp and the code thus easy to read, but that doesn’t equate with that the algorithms are easy to extract from the code that is read. And given that C encourages serial algorithms and a flat memory model, the hardware is forced to be designed to employ wasteful speculative pipelines and complex caches so thus C is not the fastest. (regardless what the contrived benchmarks say, because they’re not measuring what a more optimum paradigm of language+hardware would produce)
The buffer overflows solution doesn’t necessarily have to be incorporated into a runtime. It is possible to statically type the access to arrays such that buffer overflows are not possible. For example, one of the hindrances to such in C is that C allows access to pointer arithmetic such that instead of employing map
, fold
, and such over the array, one is able to point anywhere into memory as an unrestrained operation.
His remark about “faddish” languages clearly exemplifies that he is a C and serial algorithms fanboy.
@Keith wrote at the linked article:
I do realize that we find ourselves in a "let's tear down C" environment lately. I think it threatens millennials who think of it as low-level when compared to the training-wheel languages being used to produce web apps - the ones that allow "programmers" to keep one eye on their phones while "working".
Ditto the prior rebuttal to @andrew and including especially my reaction to the “faddish” comment. These guys don’t seem to understand that C doesn’t suffice anymore because it has significant deficiencies in the current epoch of computing.
@Hans Jorgensen wrote at the linked article:
GPU code, even with much better parallelism and memory usage, is still remarkably C-like (or even explicitly C in the case of Nvidia's CUDA), so I imagine that C (which is still a high-level language) could be adapted to use these alternate paradigms.
The C aspect of these GPU platforms does little or nothing to help extract parallelism out of general algorithms. There are certain hardware features that can be accessed which the algorithm must be specifically designed for. Only a narrow class of highly vectorizable algorithms qualify.
@Victor Yodaiken wrote at the linked article:
- Most GPU programming is in C, no matter what you imagine.
- Erlang coding style does not solve the "too few threads" problem magically. That's an algorithmic problem, not a programming language problem. Some problems are suited to vectorization or GPU style architectures and some don't.
Ditto my prior reaction to @Hans Jorgensen. These guys don’t seem to know about and grok the author’s point that certain language paradigms promote extraction of parallelism out of algorithms that are even not highly vectorizable and/or can facilitate vectorizing portions of algorithms. Also the points about promoting cache coherency and less reliance on complex caches. The GPU may not be the optimum hardware model for the sweet spot. The author discusses architectures that leverage a high volume of threads not necessarily all with non-shared data and/or with a GC model that optimizes caching. Again I refer readers to my design of a new GC paradigm combining generational collection and RC. There may be a sweet spot which isn’t just a GPU and GPU coding platforms.
It is interesting so see sort of endemic, and not terribly informed, hostility towards C on the part of LLVM developers.
Ditto the prior rebuttal to @andrew and including especially my reaction to the “faddish” comment.
Specifically why a Transputer-like GPU might not his the sweet spot for general purpose computing even in a parallelism focused regime is because of large working sets as explain by @Jeff Read in a comment on Eric’s blog:
I was watching some marketing videos for the Transputer last night. They claimed that the Transputer always scales linearly in the number of CPUs and demonstrated this with–er, ray tracers, Mandelbrot set renderers, and programs that drew Mandelbrot sets into ray-traced scenes.
The Transputer was once hailed as the next big thing, arguably the greatest thing to happen to computing since the microprocessor itself. It failed in part because the system architecture falls down hard on SICK algorithms, particularly those with large working sets. Not only would runtime of such an algorithm be constrained by the speed of a single Transputer, but Transputers were inherently NUMA — each CPU came with a sliver of RAM on board, and system memory was the sum of all the RAM slivers on each CPU die. Accordingly, to work with a large dataset, Transputers had to divide the working set across all CPUs, and then coordinate with one another to either fetch needed data to a CPU that had it, or hand off control to another CPU, feeding it the data it needs.
So the author of the article explained that instead of complex caches (and to accommodate the large working sets that the Transputer didn’t) we perhaps need an architecture where we have so many threads running that the latency to shared memory can be offset by computation these threads have to do in between accesses to main memory such that the CPU is always fully utilized. IOW, while some threads are waiting on main memory, other threads are running computation.
Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.
Here’s another asinine comment found at Eric’s blog where @Ian Bruene wrote:
How does someone get to the level where they can write an article on this – that gets noticed no less – without having had their faces rubbed in Amdahl’s Law already?
This imbecilic troll doesn’t seem to appreciate that serial algorithms are stuck in the mud and Moore’s law has ceased to be true for them. The hardware for serial algorithms is not getting any faster.
Thus striving for maximum parallelism is the only way to continue Moore’s law. Amdahl’s doesn’t apply if there’s no serial component. To the extent a serial component is unavoidable, extracting the parallelism is still the only way to go faster.
Duh!
@James P wrote at Eric’s blog:
[…] but the slow down is not due to the “cores”, it was RAM.
That’s correct. Latency doesn’t scale by Moore’s law. Latency improvements can’t keep up with the rate at which computational capacity increases due to Moore’s law (and Moore’s law has not ceased for parallelized computation) nor Nielsen's Law of Internet Bandwidth.
Congratulations @anonymint! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Important follow-up to this blog post which continued in a new issues thread Parallelism.
I dont think C was ever a low level language in the sense "close To cpu", or maybe very long time ago.
On 286, memory was segmented, and C compilers wrapped the whole program To a single segment, IIRC, you could never manipulate segment register directly in C.
Then 386, added paged memory, multi tasking, etc, you couldnt touch that with C.
Then pentium with pipelines etc, then mmx, simd, still nothing To manipulate this in C.
But if low level is taken as in "low abstraction level" then its more true, and it means C program are built with very simple abstraction that allow to define every small step of the program, without any "implicit logic".
Hi @NodixBlockchain. As you know we’ve been discussing this with you.
Yeah C was closer to the hardware model 40 years ago.
Ah yeah I had forgotten about the Intel 80286 having segmented memory space.
That’s a reasonable point to make.