INDEX
Explanations
words related to notable individuals and specific events, potentially from news articles or online forums
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.25
0.9%
1978
+0.16
0.6%
1577
+0.13
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
50
+0.25
0.18
1919
+0.16
0.14
227
+0.13
0.18
Negative Logits
<bos>
-0.99
***!
-0.74
__;
-0.73
RectangleBorder
-0.72
HtmlAttribute
-0.70
.
-0.69
>=",
-0.69
<",
-0.67
;#
-0.65
;
-0.64
POSITIVE LOGITS
impra
2.16
increa
2.12
disagre
2.07
maneu
2.03
affor
2.00
emphat
1.98
reluct
1.94
unspeak
1.93
unden
1.93
fuf
1.93
Activations Density 4.557%