INDEX
Explanations
It seems that neuron 4 is having trouble identifying a specific pattern in the text provided as there are many different characters and seemingly random activations
characters from various alphabets and symbols
New Auto-Interp
Negative Logits
anwhile
-0.99
msec
-0.82
theless
-0.81
ftime
-0.78
agre
-0.75
nyder
-0.73
abase
-0.71
dope
-0.70
espie
-0.70
cocaine
-0.69
POSITIVE LOGITS
¥
1.69
ı
1.58
Į
1.52
İ
1.51
Ł
1.51
»
1.50
Ī
1.49
²
1.49
´
1.48
º
1.47
Activations Density 0.017%