INDEX
Explanations
phrases related to controversial political events or statements
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
781
+0.09
0.3%
1823
+0.08
0.2%
883
+0.07
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1259
+0.09
0.04
1001
+0.08
0.04
182
+0.07
0.04
Negative Logits
Keny
-0.97
Sted
-0.95
kram
-0.88
panik
-0.88
Intere
-0.85
abnorm
-0.83
Nö
-0.81
Juf
-0.79
Teks
-0.79
Miscell
-0.79
POSITIVE LOGITS
😭😭
0.60
velkommen
0.60
vrea
0.57
as
0.56
drept
0.56
vedea
0.56
виправи
0.55
dă
0.54
Ці
0.53
gezet
0.51
Activations Density 0.484%