AI's Secret Brain: Memory vs. Logic 🤯📈

Researchers at Goodfire.ai have uncovered a fundamental distinction within AI language models: a clear separation between how they handle memorization and how they perform logical reasoning. Their investigation reveals that these functions don’t operate through complex circuits, but rather via distinct neural pathways. The team’s research, detailed in a recent preprint, showed a dramatic effect when they targeted the model’s memorization pathways. Removing these pathways led to a near-total loss of the model’s ability to repeat verbatim text it had previously encountered – around 97 percent – but surprisingly, its logical reasoning abilities remained largely intact. Specifically, within the Allen Institute for AI’s OLMo-7B model, the parts of the network most involved in recalling memorized data showed a significantly higher activation rate, while the more active parts focused on general text. This separation allowed the researchers to surgically remove the memorization functions without impacting the model’s core reasoning capabilities. Furthermore, the team’s methodology hinged on a concept called the “loss landscape,” which provides a visual representation of how accurate a model’s predictions are as its internal settings are adjusted. The model ‘rolls downhill’ in this landscape, seeking to minimize errors. Even more intriguing, the researchers found a connection between memorization and arithmetic. When they eliminated the memory pathways, the model’s mathematical performance plummeted to just 66 percent, while its ability to follow logical rules stayed remarkably strong. This suggests that current AI language models aren't truly mathematical problems, but rather recalling them from a limited, memorized table – similar to a student relying on memorized times tables rather than understanding multiplication. The team tested this with K-FAC, identifying and removing “sharp” memorized elements from several AI systems, including the Allen Institute’s OLMo-2 models and custom Vision Transformers trained on ImageNet with deliberately misleading labels. Removing these elements dramatically reduced the model's ability to recall specific facts – dropping recall to just 3.4 percent – while logical reasoning tasks continued to perform at nearly full capacity, including tests like Boolean expression evaluation and common sense reasoning benchmarks. The researchers discovered that individual facts often created sharp, pronounced peaks in the model’s internal landscape, indicating rigid memorization of specific details. However, when these elements were averaged together, the overall effect created a much flatter profile, suggesting a more robust and generalized understanding. Reasoning abilities, on the other hand, tended to maintain consistent, rolling-hill-like curves across the landscape – pathways that remained relatively stable regardless of the direction they were approached from. The team also noted that reasoning relies on shared mechanisms, consistently maintaining high curvature. Conversely, memorization relies on idiosyncratic, sharply focused directions linked to specific examples, which became smoothed out when averaged across the data. Despite these findings, the researchers cautioned that due to the distributed nature of information storage in neural networks, complete elimination of sensitive data isn't yet guaranteed, and that the tools they used to map the model's internal landscape could become unreliable when pushed to their limits. Interestingly, other research has indicated that current unlearning techniques tend to just suppress information, rather than truly erasing it from the model’s underlying connections – a few targeted training steps could easily bring those suppressed areas back to life. Finally, the team wasn’t entirely sure why certain skills, like math, were so vulnerable when memorization was taken away.