👨🏻‍🚀The Chronically Online Algorithm👽: The most complex model we actually understand

ಥ_ಥ𓀘) (　o=^•ェ•)o𓇇𓆴𓆯𓆰𓆲𓆒𓆖𓆫𓅦𓅝𓅒𓅑𓅌𓅋𓅕𓅙𓄸𓄿𓅉𓅃𓅈𓅇𓅗𓃱𓃲𓃩𓄍𓃇𓅈𓅄𓅁𓐤𓏟𓏞𓎅𓂡𓂖𓂛𓁲𓁶𓁵𓀹𓀽𓁀𓁁𓀬𓇹𓈋𓈋𓈠𓋞𓊰𓋂𓋪𓋌𓊈𓊉𓊇𓉵𓊅𓊅𓊀𓊝𓊜𓊜𓊘𓊕𓊕𓊔𓎁𓍭　┏━┓◑﹏◐𓊆𓉼𓉦𓉔𓉖𓉚𓉑𓉕𓉖𓉖𓉖𓉵 𓁊 THE CHRONICALLY ONLINE ALGORITHIM 𓁊𓁢𓅈𓅄𓅁𓐤𓏟𓏞𓏣𓏅𓏎𓎅𓂡𓂖𓂛𓁲𓁶𓁵𓀹𓀽𓁀𓁁𓁏𓀒𓀐𓀋𓀬𓀬𓇹𓈋𓈋𓈠𓋞𓊰𓋂𓋪𓋌𓊈𓉼𓉦𓉔𓉖𓉚𓉑𓉕𓉖𓉖𓉖𓉵𓉾𓊅𓊀𓊝𓊜𓊜𓊘𓊕𓊕𓊔𓎁𓍭𓍓𓍝𓌻𓌅𓌍𓌋𓌊𓋼𓋲𓋨𓋬𓋂𓋅𓊠𓊟𓉳𓉡𓈿𓈫𓈂𓇸𓇻𓇼𓇉𓉖𓉖

Tuesday

The most complex model we actually understand

The most complex model we actually understand - YouTube

Transcripts:

No one understands modern AI. Each new little piece of text known as a token produced by Chat GPT is the result of hundreds of billions of separate calculations. The parameters used in these calculations are learned from data by training Chat GPT to predict a single token at a time. But somehow from just learning to predict the next little piece of text again and again across trillions of examples, what feels like real intelligence emerges? What pathways through the network's billions of computations are responsible

for specific knowledge or abilities? Why do certain skills only emerge from models of a certain size or after training for a certain duration? Are these giant models just memorizing or are they actually learning? Today we have many compelling clues but no definitive answers to these questions. One interesting question we can ask is how much complexity do we have to strip away before we can really truly understand a model? We know how the individual artificial neurons that make up these models work. Although this did

take some time to sort out back in the 1960s. As we connect more and more of these neurons together, when exactly does our understanding really start to break down? In this video, I'm going to claim that one specific example, groing modular arithmetic with a single layer transformer, is the most complex AI model that we fully understand.

This is obviously highly subjective. If you have a different example that you think fits, please share it in the comments. Your answers could make for a fun follow-up video. Like many scientific discoveries, we stumbled onto groing completely by accident. The initial discovery led to some remarkable follow-up work that allows us to rigorously understand what the model's parameters are actually learning, why certain behaviors emerge later in training.

And incredibly, we can even watch the model progress from just memorizing training examples to learning a robust forier space solution to the modular arithmetic problem. This example is a few years old at this point, but it's an amazing and still very relevant way to look under the hood of modern transformers. At the end of this video, we'll also look at some more recent fascinating results from a team at anthropic where the team found a six-dimensional manifold in the activations of Claude Haiku that appears to be responsible for handling the

arithmetic required for the model to figure out when to create new lines. As Claude writes, in 2021, a research team at OpenAI was training small models to perform modular arithmetic. If we take a mathematical operation like X + Y, we can turn this operation into a data set by creating a table with various X values as our columns and various Y values as our rows.

From here, we can fill in each cell with the sum of X and Y. 0 + 0 is 0. 0 + 1 is 1 and so on. The team was studying modular arithmetic, meaning we need to pick a largest number or modulus. When our number reaches or exceeds the modulus, we divide by the modulus and take the remainder. If we choose a modulus of 5, when we reach 1 + 4 on our table, the answer is actually 5 modulo 5 equals 0.

4 + 2 equals 6 modulo 5 giving a final answer of 1 and so on. The modulo operation gives our model some interesting structure to learn and nicely bounds the number of individual tokens our model needs. We know that in this case our answer will always be 0 1 2 3 or four. From here we set aside a portion of our data for testing and train on the remaining examples.

It's worth taking a moment to consider what this data set really looks like from our model's perspective. Our model has one input and one output for each token in its vocabulary. We need five tokens to represent our numbers 0 through 4, and we'll add one more token to represent our equal sign. We could also add a token for the plus sign, but since we'll only be training our model on addition, it's not needed.

Having a token for the equal sign is helpful, however, as we'll see. This effectively gives our model a placeholder for its final answer. So our model has six total inputs, one for each token. For comparison, GPT5 has 200,000 inputs. Again, one for each token in its vocabulary. To input a math problem into our model, for example, 1 + 2, we pass in the first token in our math problem one into the model by switching on the one position and switching off all the other positions.

This is known as one hot encoding and is how the model sees our first token. Our second token two is passed into our model by switching on the second input and switching off the rest. Finally, our equal sign tells us to switch on only the final input to our model. So the math problem 1 + 2 from the perspective of our model looks like its first input switched on then its second input and then its sixth input.

Transformers like these are generally configured to return outputs of the same dimension that they're given. So our model's final output will also be 6x3.In this case, we're only going to look at the final column of the model's output. This is where we want the right answer to show up.

And in this case, we want the three output to be switched on since 1 + 2 is three. So what our model is really learning is to map this pattern of 18 values, mostly zeros, to this new pattern of six values. Now imagine someone just handed you a bunch of different target input and output patterns. Here are the input and output patterns for 1 + 3= 4.

Here's 2 + 3= 0, and so on. After you saw enough of these examples, do you think you could figure out the underlying structure of the problem? This is precisely how large language models work. When we pass in the text the capital of France is into llama, for example, the token for the tells us to switch on input 791.

The token for capital tells us to switch on input 6864 and so on. Moving to llama's output, the final column is maximized at an index of 12366, which corresponds to the token for Paris. It's easy to forget that the symbols we assign to our model's inputs and outputs have this extra meaning that we attach to them.

But to the model, they're just patterns of inputs and outputs. Now, when the OpenAI team trained their model on modular arithmetic, their initial results were pretty underwhelming. The model was able to quickly learn to match the patterns in the training data, giving the correct output on all training examples. However, the model performed very poorly on the test set.

It appeared that the model had simply memorized the training data without actually learning modular addition. But then something interesting happened. One of the researchers went on vacation but accidentally left a model training. Returning from vacation, the researcher was shocked to discover that after a very large number of training steps, the model had suddenly generalized, performing perfectly on both training and test sets.

What mechanism could possibly be causing the model to perfectly fit the training examples after just a couple hundred steps, appear to lay dormant for a couple thousand steps, and then suddenly actually learn? And could similar dynamics happen in full-size models? In Robert A. Highland's 1961 novel, Stranger in a Strange Land, he coins the term grocking.

The book's main character, a human who was raised on Mars and returns to Earth, uses the Martian word gro throughout the book. Grock has no direct translation from the far more complex Martian language. But one meaning is to understand something so thoroughly that you merge with it and it merges with you. The OpenAI team was able to replicate the sudden generalization phenomenon across a range of arithmetic operations and model configurations and in January 2022 published this paper where they called the phenomenon groing.

Grocking is a provocative name but the phenomenon itself is shocking. What could be causing the model to suddenly perform perfectly on the test set? A year after the publication of the OpenAI groing paper, a team led by researcher Neil Nandanda published an incredibly detailed analysis of the phenomenon.

Their paper digs deep into the model's parameters and activations to produce a very satisfying and elegant explanation. Nandanda and his collaborators studied a single layer transformer. This is the same architecture used in most large language models just with fewer layers. A transformer layer is composed of an attention and multi-layer perceptron compute block.

As we saw with our toy example earlier, our data is fed into our model using one hot vectors. NAND used a modulus of 113. So the model's input vectors are of length 114 with 113 positions for the digits 0 through 112 and a final position for the equal sign. So to ask our model what 1 + 2 is, we pass in this 114x3 matrix made up of all zeros except for a one in the one spot of our first column, a one in the two spot of our second column, and a one in the equal spot of our final column.

From here, our 113x3 matrix is multiplied by a matrix of learned weights known as an embedding matrix, producing three new vectors of length 128 each. These resulting embedding vectors are no longer sparse and as we'll see contain some interesting structure. From here, our embedding vectors are passed into our attention block and then our multi-layer perceptron compute block.

The output of our multi-layer perceptron is of length 128. We multiply this output by an unmbbedding matrix to compute a final vector of length 114. The model's answer is given by the largest value in this final vector. So if our model is working well, its maximum output value should occur in the three position corresponding to the correct answer 1 + 2 equals 3.

Training this model on modular edition, we see the same groing behavior observed by the OpenAI team with the model first memorizing the training data after around 140 steps and then generalizing after 7,000 training steps. Let's explore the model's intermediateoutputs, better known as activations. Specifically, let's have a close look at the outputs of some of the neurons in the second layer of our multi-layer perceptron block.

This layer has 512 total neurons. If we pass in the problem 0 plus 0 into our network, the first neuron of this layer returns an output value of 1.17. Our second neuron returns an output of 0.6 and so on. Now let's visualize how these values change as we change the input math problem. Let's fix the value of x to 0 and explore a range of y values starting with 0 + 0.

then 0 + 1, then 0 + 2, and so on. Sweeping through all 113 possible values for y, we see some interesting structure with the outputs of some of our neurons looking like sine waves. Digging deeper, let's explore the correlation between all the different pairs of these neurons. Let's color our points using the input y value to our model.

So our neuron outputs given the input 0 0 are colored purple and outputs given the input 0 + 112 are colored yellow. From here we'll create a 7x7 grid of scatter plots for each pair of neurons. So on our second scatter plot on our first row for example we'll plot the output of our first neuron as the y value and the output of our second neuron as the x value.

Bringing our two waves together like this results in a nice loop shape. creating the same plots for each pair of neuron outputs, we see more interesting structures. So our model has clearly learned some type of structure. But could this structure be related to groing? If we move backwards in our training process and visualize these structures as we go, we see that by the time we reach our model that just memorizes our training set, these structures completely disappear.

So while this early model performs perfectly on the training set, we don't see any evidence of the waves and loops that we see after grocking. So perhaps these structures are related to why the model gro is sponsored by me. The Welsh Labs team and I have written a whole new book on AI. It's beautifully illustrated and is a great way to dig deeper into the topics we cover in these videos.

Each chapter includes thoughtprovoking exercises and supporting code. Our first print run is totally sold out, but we have another batch coming quickly in January. And if you order now, I'll send you a discount code for a free download of the ebook version. Books and education are really near and dear to my heart, and we've poured a ton of effort into this book.

I really think you're going to like it. Now, back to Groing modular arithmetic. The wave shapes and loops we see inside our model as it gro suggest that the model is potentially computing and making use of the signs and cosiness of our inputs x and y. If we take a discrete 4a transform of our activation pattern, we can compute the frequencies of the waves learned by our model.

This first wave yields a largest frequency component of 8 pi over 113. And our third wave shows a largest frequency component of 6 pi over 113. If we plot these waves on top of our model's outputs, we see nice alignment. Let's look for these frequencies in other places in our model. Let's visualize a single value in our first embedding vector.

Just as we did with the neurons in our multi-layer perceptron, let's plot this value as we sweep through a range of input values. Note that our first embedding vector only depends on our first input x. So here we'll sweep from x= 0 to x= 112 while keeping y fixed at zero. We don't see quite the same smooth plots that we saw earlier.

But if we compare our curve to a cosine wave with a frequency of 8 pi over 113, we do see reasonably good alignment. Part of the challenge here is that this early signal in our network also appears to contain higher frequency information, which makes sense given that we found evidence of multiple frequencies later in our model.

We could analyze the frequency content of our full embedding vectors at this stage of the model. But for now, let's build what's known as a sparse linear probe. If we sample the values at a few more positions of our embedding vector, we see similar semeriodic curves. Now it turns out that if we take a weighted sum of these eight curves, we end up with a curve that looks very close to a cosine curve with a frequency of 8 pi over 113.

The weighted sum is very relevant here because taking weighted sums like this is a big part of what our attention and multi-layer perceptron blocks do. Meaning that these compute blocks have access to a very clean cosine wave. The signal is just spread across a few different locations in our model. At this stage, we can compute a similar sparse linear probe for the sign of x * 8 pi over 113.

Now, our first embedding vector only depends on our first input x and our second embedding vector only depends on our second input y. These inputs are combined in our attention block. Since the same embedding matrix is used to process our three inputs independently,we can use the same sparse linear probe on our second embedding vector.

And we'll see the same nice cosine and sign curves, but now as a function of y. So very early in the model, our model learns to compute the signs and cosiness of our inputs. But why? What did these functions from trigonometry have to do with learning modular addition? The modular addition problem may seem a bit foreign or contrived, but we actually do it all the time.

A 2-hour meeting that starts at 11 a.m. will end at 11 + 2 modulo 12 equals 1 p.m. Analog clocks are implementing modular addition physically. Each hour that ticks by adds one with the hour hand. And the circular motion of the hands perfectly matches the modulo arithmetic problem. starting over when reaching 12. Now, as we saw when probing the neurons in our multi-layer perceptron, our network learns to form circular patterns in its activations.

Could these circular structures be solving the modular arithmetic problem in the same way that an analog clock does? The signs and cosiness we see computed by our model in its first layer could be part of this puzzle. If we put the output of our sparse cosine probe on an x axis and the output of our sparse sign probe on the y-axis of a scatter plot, we get a nice circle when we sweep through our input values.

However, it's not enough to learn a circular structure for x and y independently. Our network has to figure out how to actually add x and y together. Adding x and y may seem trivial for our model to learn. After all, neural networks are literally built from a bunch of adds and multiplies. But remember that we aren't actually passing in, for example, the number two or a direct representation of it.

Instead, we're switching on the input to our model that we have labeled two. The network cannot just use one of the additions in one of its neurons to add X and Y together. What happens instead turns out to be way more interesting. It is straightforward for our attention layer to add together the various signs and cosiness computed by our first layer.

Our attention layer could easily compute cosine x plus cosine of y. However, that's still not what we need to solve the problem. We need to add together x and y themselves in our clock analogy. We need to add the angles of the clock hands, not the signs and cosiness of these angles. Let's return to the second layer of neurons in our multi-layer perceptron compute block.

Earlier, we explored how these neuron outputs changed as we varied a single input. Let's now explore how these outputs change as we vary both X and Y to see if we can figure out how our network is bringing these variables together. Again, visualizing the output of a single neuron. If we keep y fixed at zero and sweep through all possible x values, we get a familiar wave shape.

Now let's add another axis to our visualization and plot our neurons output now as we vary y. Let's explore all combinations of values for x and y. With this many points, it's easier to visualize our neurons outputs as the height of a surface where the color of the surface corresponds to our neuron's output value.

Like many of the outputs we've seen so far, our surface is approximately wavelike. What combinations of signs and cosiness best capture this wave structure that our network has learned? As we did earlier, we can take a 4A transform, but this time with respect to both X and Y. Extracting our top frequencies, we can decompose our surface into a few key components.

This component is the cosine of x and this component is the cosine of y. This top component is the strongest and the most interesting. It's equal to the cosine of x times the cosine of y. So the strongest frequency component of our surface is equal to the product of the cosine of x and cosine of y functions that we saw computed earlier in our network.

Now, it turns out that it's more natural for our network to take a sum of signs and cosiness than a product. I'll put a note about this in the description. So, why are we finding a strong product like this in the middle of our network? And does this get us any closer to actually computing the sum of X and Y? Remarkably, it does.

Let me show you one more thing. Let's go one layer of neurons deeper into our multi-layer perceptron and plot the outputs of a neuron in this layer as a function of X and Y. We see similar wavelike shapes here, but the wave is less regular and it moves diagonally across our surface. This orientation of the wave is really important.

Consider these top two crests where the output of our neuron is maximized. Let's move to an overhead view and look at the combinations of our input values that fall on these wave crests. The first crest starts at x= 0 and y= 65. Moving along our crest, we find intermediate values at x= 20 and y= 45, x= 40 and y = 25, x= 60 and y = 5, and finally x= 65 and y = 0.

All of these pairs of inputs add to the same value of 65. So this neuron fires maximally when x + y equals 65.In its own specialized way, this neuron has learned to add or more precisely this neuron fires for any pair of inputs that add to 65. Our second wave crest starts at x= 66, y= 112. From there it moves through values like x= 91 and y= 87 and ends on x= 112 and y = 66.

Adding these pairs together we get 178 in each case. Recall that our model is trained on modular addition with a modulus of 113. Our result of 178 modulo 113 is 65. So this second crest also finds pairs of inputs that add to 65. But how in just one layer of neurons do we go from products like the cosine of x times the cosine of y to actually adding together x and y themselves.

Here's the output of another neuron in the second layer of our multi-layer perceptron. The strongest frequency component here is s of x time s of y. Now each neuron in our following layer takes a weighted sum of the outputs of the neurons in our current layer. Let's consider how this weighted sum causes our surfaces to interact.

We saw earlier that our first neuron's output has a strongest frequency component of cosine of x time the cosine of y and our new second layer neuron has a strongest frequency component of the s of x time the s of y. Let's assume for a moment that the weight assigned to our cosine x * cosine y neuron is 1 and the weight assigned to our sin x * sin y neuron is negative 1.

Visually, this negative weight flips our second surface vertically. Now, when we add these weighted surfaces together, the signs and cosiness remarkably interfere in just the right way to create the diagonal symmetry that we see in our neuron in the following layer that allowed our neuron to fire on combinations of inputs that add to 65.

As you may remember from trigonometry class, the cosine of x time the cosine of y minus the s of x * the s of y is actually a trigonometric identity. specifically a sum of angles identity that exactly equals the cosine of x + y. This identity allows us to convert the sum of products of s and cosiness into a sum of x and y, which is exactly what our network needs to compute.

And remarkably, the network appears to have learned to effectively use this trigonometric identity to solve the modular addition problem. And remember that our training data is just these sparse patterns that have nothing to do with signs, cosiness, or trigonometric identities. The final unmbed portion of our model takes one more weighted sum.

This time of the outputs of the final layer neurons in our multi-layer perceptron. Visualizing the outputs of a few more of these neurons, we see the same types of diagonal symmetries with various shifts and scales. Our unmbedding layer takes different combinations of these outputs for each possible token that the network could return.

Here's the resulting surface for the seven output. As we saw with our multi-layer perceptron neuron that detected all combinations of numbers that added to 65, this surface reaches a maximum for all the combinations of X and Y that add to 7. Here's 7 plus 0. Here's 0 plus 7. And here's 3 + 4. So remarkably to solve this modular arithmetic problem our network learns to numerically estimate the signs and cosiness of our inputs computes the products of these functions and then uses a clever trig identity to create the diagonal symmetry needed to solve

the modular addition problem and then brings multiple versions of these resulting patterns together to compute a final answer. Now, can this detailed understanding of how the model solves modular addition help us understand why it gro? Let's watch the training process again, but this time while visualizing the evolution of the various structures learned by our model.

After a few hundred steps, our model perfectly fits the training data. But we don't yet see any hints of signs or cosiness. As our model continues to learn, its performance stays flat, giving the appearance that nothing is happening. However, as we can now clearly see under the hood, the model is starting to piece together the relevant structures needed to solve the modular arithmetic problem.

This is such a wild phenomenon. It's very common to visualize training and test performance as a model learns. And when both metrics are flat for this long, the typical assumption is that the model is done learning and has settled into a stable solution. Neil Nandanda and his co-authors propose a clever new metric in their paper called excluded loss.

Note that thus far we've been plotting the model's accuracy as it learns. And here we'll switch to plotting the model's cross entropy loss. So lower values are better. See my gradient descent video or chapter 2 of my new AI book for more on cross entropy loss. Now that we know that our model is operating in the frequency domain at a few key frequencies, what happens when we remove the information at these frequencies from the model's final output before measuring performance? Removing the 8 pi over 113 frequency that we found and plotting this excludedloss as the model learns. We see our new

metric dip down quickly with training loss, but then slowly climb as our model builds the sign and cosine representations. This excluded loss increases because we've taken away the model's ability to use this key frequency. And importantly, during this long period of flat training and testing performance, our excluded loss slowly climbs, showing that our model is making more and more use of patterns at this frequency.

Interestingly, Nanda and his collaborators show that groing occurs not necessarily when the sign and cosine structures are completed, but just after during a phase they call the cleanup phase, where the model actually removes the memorized examples that it relied on early in training. These dynamics are fascinating and explain very nicely why this model gross on this problem.

It's so satisfying to me that we can take apart this model, understand the actual mechanisms that it learns, and then use these mechanisms to design a new metric that clearly shows the model's slow progression from memorization to learning and that nicely explains the surprising groing behavior. This level of clarity is a beautiful and rare exception in modern AI, a transparent box in a world of black boxes.

The approach Nandanda and his collaborators use to perform this analysis is generally known as mechanistic interpretability. Since Nand's paper came out in early 2023, we've seen some really interesting progress in this field, but are still very far away from anywhere near this level of understanding of full large language models.

There's some recent work from a research team at Anthropic that gives a nice feel for the current edge of our understanding using this type of bottomup mechanistic interpretability approach. The team studies how a full-sized model Claude 3.5 Haiku figures out when to create line breaks when writing. The team finds that the Haiku model represents the number of characters that it's written on a given line on a manifold in sixdimensional space.

This structure is somewhat analogous to the loops that we saw in the multi-layer perceptron of our model. To figure out when to insert a line break, Haiku needs to know both how many characters it's written on the current line and how many characters long the lines of the text it's currently writing it need to be.

Using linear probes similar to the ones we used here to find the signs and cosiness early in our model. The anthropic team mapped character count and line length to this sixdimensional manifold and found that haik coup represents these concepts in this space in a very similar way. This 70 character count probe lines up right next to this line length of 70 probe and so on.

Now, this gets really wild when these representations are passed into Haiku's attention blocks. We see what the team calls a QK twist, where these helix-like geometries are rotated relative to each other in this sixdimensional space. After rotation, the probe for a character count of 70 is now closest to a line width of 75.

And we see a similar offset of four to five characters across the length of our curve. The proximity of these points in the model's attention heads leads to a high dot product when the model is about five characters away from the end of a line. The team goes on to show that there are multiple attention heads that specialize in detecting various distances from the end of the current line of text.

And this mechanism allows Haiku to precisely estimate how much more room it has before the end of the line. Now, compared to Claude Haiku's full range of capabilities, deciding when to create a new line is very simple. However, it is exciting to see that the anthropic team found such a clean mechanism that controls this behavior in a full-size model.

The story of groing is such a nice arc of scientific discovery and progress. We accidentally discovered a new phenomenon and the search for an explanation genuinely helped push forward our understanding of model training dynamics and the inner workings of transformers. The names we give our discoveries matter and I like the name groing.

It feels alien and originates from the complex Martian language in Highland's novel. The AI researcher Andre Karpathy recently commented that training large language models is less like building animal intelligence and more like summoning ghosts. You can think of a ghost as a fundamentally different kind of point in the space of possible intelligences.

The literal meaning of gro to understand something profoundly and deeply is a nice fit for what the model appears to be doing. But what I really appreciate here is the connotation of this thing being alien. I think it's a really nice counterpoint to overly personifying models. We communicate with these models in human language.

But as we've seen, this is a thin veneer. If we go one layer deeper into what these models actually process in return, we find these absurdly complex patterns.As we build more intelligent models and learn more about how they work, it will be fascinating to see if these artificial intelligences feel more alien, ghost, or human. I am tired.

So, this has been my first full year working fully on Welch Labs. Um, we made some progress. So, we did nine videos this year and we did one book. Um, and man, getting that done filled like every available second of time that I had. Um, for now on the business, I'm trying to keep things simple. Um, so really just focusing on making sure that the business and the channel work well enough to support my family and I.

Um, I left my full-time job last year. Um, my goal is to earn as much from Welch Labs as I did from my engineering job. I was hoping to replace my whole income this year. It's probably going to be like 75%. Um, the book helped a lot, but there's always challenges. The business side is hard. Um, I've tried to do this full-time before, once in 2018.

Um, I just didn't have enough runway and enough focus on the business. So, I think we're doing it right this time, but gosh, it takes time and man, it takes a lot of work. So, I hope you enjoyed what we've done this year. Um, a lot more of it next year. Uh, kind of working on the focus and direction for next year right now.

Um, but I'm really happy with the book. I hope you're able to get a copy. I know we're not shipping internationally yet. That will be a focus early next year. I I promise. Um but yeah, what a year, man. Thank you so much for your support. If you are able to support on Patreon, that helps a ton.

Or just liking and sharing the videos. Um thanks for a great year. I'll see you next year.

INTERESTORNADO

Michael's Interests

Esotericism & Spirituality

Technology & Futurism

Culture & Theories

Creative Pursuits

Hermeticism

Artificial Intelligence

Mythology

YouTube

Tarot

AI Art

Mystery Schools

Music Production

The Singularity

YouTube Content Creation

Songwriting

Futurism

Flat Earth

Archivist

Sci-Fi

Conspiracy Theory/Truth Movement

Simulation Theory

Holographic Universe

Alternate History

Jewish Mysticism

Gnosticism

Google/Alphabet

Moonshots

Algorithmicism/Rhyme Poetics

map of the esoteric

Esotericism Mind Map Exploring the Vast World of Esotericism Esotericism, often shrouded in mystery and intrigue, encompasses a wide array of spiritual and philosophical traditions that seek to delve into the hidden knowledge and deeper meanings of existence. It's a journey of self-discovery, spiritual growth, and the exploration of the interconnectedness of all things. This mind map offers a glimpse into the vast landscape of esotericism, highlighting some of its major branches and key concepts. From Western traditions like Hermeticism and Kabbalah to Eastern philosophies like Hinduism and Taoism, each path offers unique insights and practices for those seeking a deeper understanding of themselves and the universe. Whether you're drawn to the symbolism of alchemy, the mystical teachings of Gnosticism, or the transformative practices of yoga and meditation, esotericism invites you to embark on a journey of exploration and self-discovery. It's a path that encourages questioning, critical thinking, and direct personal experience, ultimately leading to a greater sense of meaning, purpose, and connection to the world around us.

Welcome to "The Chronically Online Algorithm" 1. Introduction: Your Guide to a Digital Wonderland Welcome to "👨🏻‍🚀The Chronically Online Algorithm👽". From its header—a chaotic tapestry of emoticons and symbols—to its relentless posting schedule, the blog is a direct reflection of a mind processing a constant, high-volume stream of digital information. At first glance, it might seem like an indecipherable storm of links, videos, and cultural artifacts. Think of it as a living archive or a public digital scrapbook, charting a journey through a universe of interconnected ideas that span from ancient mysticism to cutting-edge technology and political commentary. The purpose of this primer is to act as your guide. We will map out the main recurring themes that form the intellectual backbone of the blog, helping you navigate its vast and eclectic collection of content and find the topics that spark your own curiosity. 2. The Core Themes: A Map of the Territory While the blog's content is incredibly diverse, it consistently revolves around a few central pillars of interest. These pillars are drawn from the author's "INTERESTORNADO," a list that reveals a deep fascination with hidden systems, alternative knowledge, and the future of humanity. This guide will introduce you to the three major themes that anchor the blog's explorations: * Esotericism & Spirituality * Conspiracy & Alternative Theories * Technology & Futurism Let's begin our journey by exploring the first and most prominent theme: the search for hidden spiritual knowledge. 3. Theme 1: Esotericism & The Search for Hidden Knowledge A significant portion of the blog is dedicated to Esotericism, which refers to spiritual traditions that explore hidden knowledge and the deeper, unseen meanings of existence. It is a path of self-discovery that encourages questioning and direct personal experience. The blog itself offers a concise definition in its "map of the esoteric" section: Esotericism, often shrouded in mystery and intrigue, encompasses a wide array of spiritual and philosophical traditions that seek to delve into the hidden knowledge and deeper meanings of existence. It's a journey of self-discovery, spiritual growth, and the exploration of the interconnectedness of all things. The blog explores this theme through a variety of specific traditions. Among the many mentioned in the author's interests, a few key examples stand out: * Gnosticism * Hermeticism * Tarot Gnosticism, in particular, is a recurring topic. It represents an ancient spiritual movement focused on achieving salvation through direct, personal knowledge (gnosis) of the divine. A tangible example of the content you can expect is the post linking to the YouTube video, "Gnostic Immortality: You’ll NEVER Experience Death & Why They Buried It (full guide)". This focus on questioning established spiritual history provides a natural bridge to the blog's tendency to question the official narratives of our modern world. 4. Theme 2: Conspiracy & Alternative Theories - Questioning the Narrative Flowing from its interest in hidden spiritual knowledge, the blog also encourages a deep skepticism of official stories in the material world. This is captured by the "Conspiracy Theory/Truth Movement" interest, which drives an exploration of alternative viewpoints on politics, hidden history, and unconventional science. The content in this area is broad, serving as a repository for information that challenges mainstream perspectives. The following table highlights the breadth of this theme with specific examples found on the blog: Topic Area Example Blog Post/Interest Political & Economic Power "Who Owns America? Bernie Sanders Says the Quiet Part Out Loud" Geopolitical Analysis ""Something UGLY Is About To Hit America..." | Whitney Webb" Unconventional World Models "Flat Earth" from the interest list This commitment to unearthing alternative information is further reflected in the site's organization, with content frequently categorized under labels like TRUTH and nwo. Just as the blog questions the past and present, it also speculates intensely about the future, particularly the role technology will play in shaping it. 5. Theme 3: Technology & Futurism - The Dawn of a New Era The blog is deeply fascinated with the future, especially the transformative power of technology and artificial intelligence, as outlined in the "Technology & Futurism" interest category. It tracks the development of concepts that are poised to reshape human existence. Here are three of the most significant futuristic concepts explored: * Artificial Intelligence: The development of smart machines that can think and learn, a topic explored through interests like "AI Art". * The Singularity: A hypothetical future point where technological growth becomes uncontrollable and irreversible, resulting in unforeseeable changes to human civilization. * Simulation Theory: The philosophical idea that our perceived reality might be an artificial simulation, much like a highly advanced computer program. Even within this high-tech focus, the blog maintains a sense of humor. In one chat snippet, an LLM (Large Language Model) is asked about the weather, to which it humorously replies, "I do not have access to the governments weapons, including weather modification." This blend of serious inquiry and playful commentary is central to how the blog connects its wide-ranging interests. 6. Putting It All Together: The "Chronically Online" Worldview So, what is the connecting thread between ancient Gnosticism, modern geopolitical analysis, and future AI? The blog is built on a foundational curiosity about hidden systems. It investigates the unseen forces that shape our world, whether they are: * Spiritual and metaphysical (Esotericism) * Societal and political (Conspiracies) * Technological and computational (AI & Futurism) This is a space where a deep-dive analysis by geopolitical journalist Whitney Webb can appear on the same day as a video titled "15 Minutes of Celebrities Meeting Old Friends From Their Past." The underlying philosophy is that both are data points in the vast, interconnected information stream. It is a truly "chronically online" worldview, where everything is a potential clue to understanding the larger systems at play. 7. How to Start Your Exploration For a new reader, the sheer volume of content can be overwhelming. Be prepared for the scale: the blog archives show thousands of posts per year (with over 2,600 in the first ten months of 2025 alone), making the navigation tools essential. Here are a few recommended starting points to begin your own journey of discovery: 1. Browse the Labels: The sidebar features a "Labels" section, the perfect way to find posts on specific topics. Look for tags like TRUTH and matrix for thematic content, but also explore more personal and humorous labels like fuckinghilarious!!!, labelwhore, or holyshitspirit to get a feel for the blog's unfiltered personality. 2. Check the Popular Posts: This section gives you a snapshot of what content is currently resonating most with other readers. It’s an excellent way to discover some of the blog's most compelling or timely finds. 3. Explore the Pages: The list of "Pages" at the top of the blog contains more permanent, curated collections of information. Look for descriptive pages like "libraries system esoterica" for curated resources, or more mysterious pages like OPERATIONNOITAREPO and COCTEAUTWINS=NAME that reflect the blog's scrapbook-like nature. Now it's your turn. Dive in, follow the threads that intrigue you, and embrace the journey of discovery that "The Chronically Online Algorithm" has to offer.