INDEX
Explanations
multilingual actions
The neuron fires on self-referential AI identity phrases (e.g. “As a large language model,” “AI,” “model,” etc.).
New Auto-Interp
Negative Logits
computation
0.41
čných
0.41
bins
0.40
খানে
0.40
কম
0.39
ujte
0.38
UL
0.38
差
0.38
TR
0.37
такое
0.37
POSITIVE LOGITS
انجام
0.49
performed
0.48
ನೀವು
0.45
influencia
0.43
্যাগ
0.43
influencing
0.42
handled
0.41
influences
0.41
influência
0.40
извър
0.40
Activations Density 0.034%