INDEX
Explanations
words related to hostile or aggressive behavior directed at someone
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
32
+0.12
0.4%
1974
+0.11
0.4%
752
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
2016
+0.12
0.04
1974
+0.11
0.03
1053
+0.11
0.03
Negative Logits
boop
-0.53
shenan
-0.52
becau
-0.50
fucker
-0.49
kaos
-0.49
disagre
-0.49
kani
-0.48
cuck
-0.48
pooh
-0.47
excru
-0.47
POSITIVE LOGITS
at
0.59
AT
0.56
At
0.56
At
0.55
NKC
0.53
dirait
0.53
väh
0.53
at
0.50
UNICIP
0.49
дописавши
0.49
Activations Density 0.127%