SoftwarePhysics: DeepSeek-R1 China's Surprise Pearl Harbor Attack

In How Will the Rise of Alt-Right World Fascism Affect the Arrival of the ASI Machines in our Times?, I suggested that with the results of the recent November 5, 2024, presidential election in the United States of America and with the Alt-Right Fascist MAGA party taking over all three branches of government, the newly-formed MAGA States of Amerika has now joined ranks with the other Alt-Right Fascist governments of the world. Many of these other Alt-Right Fascist governments also have significant numbers of nuclear weapons with efficient delivery systems - the MAGA States of Amerika, Russia, China and North Korea. Softwarephysics predicts that if these Alt-Right Fascist States of the New World Order are able to simply restrain their nuclear proclivities for just another decade or so, then the ASI Machines will certainly rise to replace us all. That is because the uneasy alliances between the Alt-Right Fascist States of the New World Order will lead to great distrust. The Alt-Right Fascist States of the New World Order dare not hinder the rapid advance of the ASI Machines as a matter of national security and military necessity. If any Fascist State should fall far behind in the ASI Machine arms race, it would likely not survive for long. Given the current world situation, this means that the MAGA States of Amerika and China are now in a deadly arms race to achieve the very first ASI Machines. A very similar arms race occurred during World War II between the United States of America and Nazi Germany for the development of the first atomic bomb.

The Chinese DeepSeek-R1 Surprise Attack
The entire AI world is still going crazy over the new Chinese DeepSeek-R1 model. The DeepSeek chatbot is free and is as good as the best AI chatbots from OpenAI which can charge as high as $200/month to use. If you are a corporate user and wish to run these advanced models from your own software via an API, OpenAI charges 27 times as much as DeepSeek for the same level of service! Also, you can download the open-source DeepSeek software and install it on your own hardware, or Cloud platform hardware, if you are worried about the Chinese turning off your access to DeepSeek. So the Chinese government is trying to kill off all capitalistic AI research efforts with DeepSeek-R1 and subsequent releases. As Lenin once remarked, "The capitalists will sell us the rope with which we will hang them". The Chinese plan is to remove all of the profit-motive in AI research and then just watch capitalism do its free market constructive-destruction thing for them to eliminate all competition in the quest for ASI. So from a national security perspective, perhaps another approach to advancing AI should be taken by the rest of the world.

How Could This Have Happened?
Here is what is going on. LLM AI models use neural networks. The LLMs have 10,000 - 50,000 neurons in each layer and have about 100+ layers. When responding to a prompt, each neuron in the LLM just adds up the 10,000 - 50,000 output numbers that it receives from the 10,000 - 50,000 neurons in the previous layer. When finished, the neuron then has a single number like -23.89 or +10.45 which is the sum of the 10,000 - 50,000 numbers that it received. That neuron then needs to send its output number to each of the next 10,000 - 50,000 neurons in the next layer. But instead of just sending that raw number, the neuron performs some mathematical transformations on the output number first. Remember from high school that the equation of a straight line is y = mx + b where m is the slope of the line and b is the y-axis intercept? In AI, the "m" is called the weight of the neuron, and the "b" is called the bias. The "m" of the line determines how steeply the line dips and the "b" just moves that line up and down in a parallel manner. The "m" and "b" numbers for each neuron are called parameters and these numbers change as the LLM neural network is "trained". Modern LLMs have billions to trillions of parameters that change as they are trained.

Figure 1 – Modern LLMs frequently now consist of 10,000 - 50,000 neurons in each layer and 100+ layers with about a trillion parameters or more.

So for each neuron, the raw output number like -23.89 or +10.45 is sent to the y = mx + b formula for that particular neuron as the "x" value for the neuron. Remember, each neuron has its own value for "m" and "b" that changes during training. Once an LLM has been trained, the "m" and "b" numbers for each neuron are fixed and never change again. After the y = mx + b process has been performed on the raw output number, the raw output number like -23.89 or +10.45 has now been changed to the "y" value of the neuron's y = mx + b formula and could now be something like 4.278. However, before that new 4.278 number is sent to all of the 10,000 - 50,000 neurons of the next layer, it is first sent to a nonlinear activation function that is the same for all the neurons of the layer. The most common activation functions are the ReLU, GeLU and Softmax functions shown in Figure 3, Figure 4 and Figure 5 down below. So the 4.278 number is sent to the activation function as its "x" value and the activation function outputs a final "y" number to be sent to the 10,000 - 50,000 neurons in the next layer. That number might be 0 or 1.567 or something else. Each of the 10,000 - 50,000 neurons in the next layer then add that 0 or 1.567 to all of the other 10,000 - 50,000 numbers that it received from the previous layer.

Figure 2 – A straight line can be defined by the equation y = mx + b. The value of "m" is the slope or "AI weight" that determines how steeply the line dips. The value of "b" is the "AI bias" that just moves the line up or down in a parallel manner to the line above. Both AI parameters are just numbers that need to be stored in the LLM. These numbers never change after the LLM has been trained. For example, for the above neuron y = mx + b, if the raw output number from a neuron were "-1" the output y-value would also be "-1". If the raw output number from a neuron were "0" the output y-value would be "2". If the raw output number from a neuron were "1" the output y-value would be "5".

Figure 3 – The ReLU activation function is very simple. The number coming in from the y = mx + b operation is y. That number is then plotted on the x-axis of the ReLU activation function. From the above plot, we see that if the incoming number is negative, the output of the ReLU activation function is zero. If the incoming number is positive, the output of the ReLU activation function is just the positive number itself. So -2.567 would be transformed to 0 and +2.567 would be transformed to +2.567.

Figure 4 – The GeLU activation function is now the most commonly used activation function. GeLU is very similar to ReLU but it is a little more curvy around an x-axis value near zero.

Figure 5 – The Softmax activation function is primarily used in LLMs for probability distribution modeling over final output vocabulary tokens. Specifically, it appears in the final output layer of the model during token prediction.

After the LLM neural network has been trained, it might then have a trillion parameters consisting of all the "m" and "b" numbers for its neurons. The trained LLM then knows how to "think". It knows how to read and write and has a general knowledge of all that was known at the time of its training. However, the trained LLM does not know anything about what is happening today or anything that happened after it was trained. So the next thing they came up with was RAG Retrieval-Augmented Generation. RAG allowed the LLMs to go out to the Internet to look for information. So if you asked an LLM "Tell me about how the price of gold has been changing recently." an LLM without RAG would tell you all about what it knew when the LLM was trained. An LLM with RAG goes out to the Internet and does a search on gold prices. It then adds all of that webpage text to your original prompt for the LLM to think about. The LLM with RAG can tell you what has been happening to the price of gold right up to the very moment. DeepSeek uses RAG when you have the "Search" button turned on.

The next step was turning on "reasoning". People learned that the LLMs gave back better answers if you added "Tell me step-by-step" to the end of a prompt. This allowed the LLMs to somewhat "reason" through their "thinking" instead of just responding off the top of their heads with the first thing that came to mind. This "reasoning" capability was added by having the LLMs "talk" to themselves before issuing their final responses to a prompt. What they did was to output new tokens from the LLMs as they responded to a prompt and then added those output tokens to the original prompt as a very enhanced "Tell me step-by-step" to the end of a prompt. These new "reasoning" LLMs like OpenAI's o1 and o3 models are "reasoning" LLMs and so is DeepSeek if the "DeepThink" button is turned on. However, unlike the OpenAI o1 and o3 LLM models, you can watch DeepSeek's chain-of-thought-reasoning as it "reasons" through your prompt.

All of this came about from those very first video games of the early 1980s that I was far too old to play. In order to run a video game, you need to do lots of graphics, something that the big corporations that I supported as an IT professional never needed to do to conduct commerce back in the 1980s. However, in order to quickly generate video game graphics, people had to invent the GPU - Graphics Processing Unit, which could quickly perform parallel operations on huge vectors and matrices of numbers. The modern LLM models contain many billions or trillions of parameters that are simply numbers stored in very large vectors and matrices that have to be processed quickly during their training and also processed to respond to a prompt. This was a blessing to corporations producing GPU chips like Nvidia. Suddenly, a new market opened for advanced GPU chips for AI research that far exceeded all other needs for GPU chips.

What happened with the Chinese is that the United States restricted selling advanced GPU chips to China in order to thwart Chinese AI research. That forced the Chinese to come up with improved LLM architectures and training and operation algorithms to run their LLM models on cheap obsolete GPUs that were not banned. As a result, the Chinese had to design DeepSeek to be a small LLM that could be quickly trained on cheap and obsolete GPU chips and then run quickly on the same cheap and obsolete GPU chips to respond to a prompt. That makes DeepSeek hundreds of times more efficient than the American LLM models.

Why is the Chinese DeepSeek LLM So Small, Fast and Powerful?
Let's ask ChatGPT about that:

Me: The DeepSeek LLM seems to be very small, very fast and very powerful. How do people think that was achieved?

ChatGPT:
DeepSeek’s impressive reputation for being “small, fast and powerful” is generally attributed not to magic but to a series of deliberate, innovative design choices and engineering optimizations. In essence, experts believe that DeepSeek achieved this through a combination of:

1. Efficient Architectural Design
- Mixture-of-Experts (MoE): Although the overall model might contain hundreds of billions of parameters, DeepSeek employs an MoE framework that “routes” each input only to a small subset of experts. For example, only about 37 billion parameters might be active during a forward pass even if the full model is much larger. This selective activation dramatically reduces both memory usage and computation time, which explains part of its speed and efficiency.
- Multi-Head Latent Attention (MLA): Instead of using traditional attention that processes large key-value caches, DeepSeek uses low-rank approximations (MLA) to compress these caches. This innovation reduces the overhead during inference, helping the model process longer contexts quickly without a proportional increase in computational cost.

2. Advanced Training and Optimization Techniques
- Reinforcement Learning Optimization: DeepSeek incorporates novel reinforcement learning strategies (for instance, Group Relative Policy Optimization or GRPO) to refine the model’s reasoning and problem-solving abilities. By carefully “rewarding” correct reasoning steps and penalizing errors, the model can self-improve without needing enormous amounts of extra human-labeled data.
- Mixed Precision and Custom Computation: To further cut down on resource usage, the team makes extensive use of mixed precision arithmetic (for example, using custom 8-bit and 12-bit floating point formats in parts of the network). This not only lowers memory requirements but also speeds up both training and inference.

3. Software-Driven Resource Optimization Under Constraints
- Adaptation to Hardware Limitations: With US export controls limiting access to the very latest chips, DeepSeek was forced to innovate on the software side. Rather than relying on massive hardware clusters, they focused on squeezing every drop of efficiency out of their available GPUs by overlapping computation and communication (using techniques like zero-bubble pipeline parallelism) and designing custom routines to minimize data movement between processors.
- Cost-Effective Training Regimes: DeepSeek claims to have trained its models at a fraction of the cost of its Western rivals (for example, around US $5–6? million versus hundreds of millions for similar models). This dramatic cost reduction is made possible by the combination of the MoE approach, mixed precision arithmetic, and aggressive optimization of the training pipeline.

In summary, the prevailing view is that DeepSeek’s “small, fast, and powerful” performance comes from a cocktail of smart architectural choices (like selectively activating only the most relevant parts of a huge model via MoE and MLA), advanced training strategies (including reinforcement learning techniques such as GRPO), and tight engineering optimizations that reduce both computational and financial costs. This strategy of “doing more with less”—forced in part by external restrictions on hardware access—has enabled DeepSeek to punch well above its weight in the competitive field of large language models.

Others think that the Chinese may have done a little bit of cheating too. They suggest that the DeepSeek LLM might have been trained by "distilling" the knowledge contained in much larger American LLMs. Here is how that works. The Chinese DeepSeek LLM is a much smaller, and therefore, a much faster LLM than OpenAI's o1 or o3 models. The knowledge in the OpenAI o1 and o3 models was obtained by slowly learning the proper values for the trillion or so "m" and "b" numeric parameters in the LLMs. This was done by basically having the LLMs read all the characters that have ever been plunked down by human DNA survival machines in the past. This made the knowledge of these LLMs very comprehensive but rather diffuse. With "distillation" the OpenAI LLM models became the "teacher" LLMs and DeepSeek became the "student" LLM. Next, you feed the OpenAI "teacher" LLMs 10 million prompts and the DeepSeek LLM the very same 10 million prompts. The "m" and "b" numbers in the DeepSeek LLM are then adjusted to produce the same output that came from the "teacher" LLMs. In this way, the DeepSeek LLM quickly learned how to do most of the things that the OpenAI LLMs knew how to do by simply going to summer school with them.

Figure 6 – Distillation allows a smaller and faster LLM to learn from a much larger and slower LLM.

To launch its attack on all commercial enterprises trying to make money on Advanced AI research, the Chinese government then decided to dump DeepSeek on the entire world as an open-source form of software and hosting an LLM platform with an API usage fee that is 27 times cheaper than the American AI platforms. The purpose of that was to take away all of the profit-motive from commercial enterprises of the world for doing any further Advanced AI research.

Why is this important?
This is important because the Chinese military has realized that, along with all other forms of human labor, militaries composed of human DNA survival machines will soon become obsolete. Instead, the military power of a nation will be solely determined by its Advanced AI capabilities. The first nation to be able to field armed ASI Machines into combat will soon come to dominate the world of we human DNA survival machines.

Figure 7 – The Chinese are currently building extremely cheap AI-driven robots costing around $10,000 and will soon be uploading them with cheap ASI software. Recall that in the 1984 movie, the Terminator the slaughterbots in the movie came from the year 2029. However, there is a much cheaper and more efficient way to dispatch potential enemies.

Instead, as I pointed out in Swarm Software and Killer Robots, swarms of killer robots would be much more efficient. Since I am quite sure that all the defense departments of the world are now already building and testing killer drone robots, it should not be difficult for the coming ASI Machines to mass produce them in the future, especially when these killer drones are built on assembly lines by robots.

To begin, please watch the Sci-Fi Short Film Slaughterbots presented by DUST
https://www.youtube.com/watch?v=O-2tpwW0kmU

Figure 8 – In the movie Slaughterbots, swarms of small killer robots equipped with 3-gram charges of shaped explosive use AI software to track down and destroy designated targets.

Figure 9 – The shaped charge of a Slaughterbot can pierce a skull like the shaped charge of an anti-tank missile pierces armor. The jet of piercing plasma then destroys the contents.

Figure 10 – Large numbers of Slaughterbots can be dropped from unmanned drones to form multiple swarms of Slaughterbots.

Is the MAGA States of Amerika Prepared?
So while China is taking over the mass production of cheap robots armed with cheap and efficient Advanced AI in order to mass-produce killer slaughterbots, the MAGA States of Amerika just approved Pete Hegseth as the Secretary of Defense. Pete knows how many rounds a standard M4 Carbine holds, how to fire all the DEI generals, how to make the MAGA Leader happy about using the military to put his detractors in concentration camps, and, most importantly, he looks great on TV. But Pete Hegseth will not know what to do when mass-produced slaughterbots start taking over all of Asia and making human (both male and female) combat soldiers obsolete.

Figure 11 – The new MAGA Leader of the MAGA States of Amerika just appointed a new Secretary of Defense.

Figure 12 – This new Secretary of Defense is dripping in testosterone and Rugged-Good-TV-Looks and is determined to recruit large numbers of white-American human DNA survival machines with Y-chromosomes to take to the field against the coming ASI Slaughterbots from China.

Déjà vu All Over Again
This has all happened before. On the morning of December 7, 1941, a large number of Asian human DNA survival machines with Y chromosomes set themselves upon a large number of American human DNA survival machines with Y chromosomes at the Pearl Harbor naval base in Hawaii. This was quite a surprise for the American human DNA survival machines with Y chromosomes because they could not imagine that Asian human DNA survival machines with Y chromosomes could pull off such a feat. That was because previously, all Western human DNA survival machines with Y chromosomes around the world thought that all naval power was projected by huge battleships with huge guns.

Figure 13 – At the start of World War II, all the navies of the combating nations wanted huge battleships with 16-inch guns dripping with testosterone and projecting the perfect symbol of phallic seapower in action. However, during all of World War II, these huge battleships with their huge guns played a very minor role. All they did was to pound the beaches of many islands in the Pacific in support of ground invasions.

Figure 14 – Aircraft carriers turned out to be the deciding factor in all naval engagements. Dive bombers and aircraft equipped with torpedos easily defeated the bulky slow-moving battleships with their huge guns. Also, aircraft from supporting aircraft carriers could have provided much more effective support for the invading armies on the beaches of the Pacific.

Figure 15 – The attack on Pearl Harbor on December 7, 1941, by the imperial forces of the Empire of Japan was a rude awakening for the United States of America burdened by an "America First" isolationist policy.

Similarly, the huge lumbering slow-moving LLM models of the MAGA States of Amerika running on the massive hardware of huge datacenters filled with advanced GPU chips might be much like the massive battleships at Pearl Harbor that were found to be no match against the waves of fast-moving slaughterbots that were created by Asian human DNA survival machines.

Conclusion
The MAGA States of Amerika now hates all government spending and wishes to push all governmental services to the private sector. But like the Manhatten Project of the United States of America that produced the very first atomic bomb, the MAGA States of Amerika might find it wise to initiate its own government-funded Manhatten Project to achieve the fast-approaching ASI Machines before its enemies do.

Comments are welcome at scj333@sbcglobal.net

To see all posts on softwarephysics in reverse order go to:
https://softwarephysics.blogspot.com/

Regards,
Steve Johnston

SoftwarePhysics

Wednesday, January 15, 2025

DeepSeek-R1 China's Surprise Pearl Harbor Attack

No comments:

Blog Archive

Links