INDEX
Explanations
sentences related to political or controversial statements
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1445
+0.10
0.3%
605
+0.10
0.3%
1526
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1154
+0.10
0.02
1526
+0.10
0.02
2000
+0.09
0.02
Negative Logits
embra
-1.14
immen
-1.09
oner
-1.03
inder
-1.02
incess
-1.01
effe
-0.98
dises
-0.98
„,
-0.98
interse
-0.98
abnorm
-0.97
POSITIVE LOGITS
no
0.82
NO
0.76
No
0.71
no
0.69
NO
0.67
Nein
0.67
nor
0.67
not
0.66
No
0.65
neither
0.63
Activations Density 0.067%