INDEX
Explanations
phrases related to political or controversial topics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1589
+0.08
0.2%
62
+0.08
0.2%
1379
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
62
+0.08
0.04
1379
+0.08
0.03
382
+0.07
0.04
Negative Logits
awtextra
-0.64
MessageOf
-0.61
bonté
-0.59
Komple
-0.59
dignité
-0.58
notori
-0.57
Preparación
-0.57
intit
-0.57
notor
-0.57
Normdatei
-0.57
POSITIVE LOGITS
still
1.01
STILL
1.00
nevertheless
0.92
still
0.92
nonetheless
0.87
stills
0.82
tolerably
0.82
ftill
0.81
intersper
0.81
Still
0.80
Activations Density 0.293%