INDEX
Explanations
references to nationalities or ethnic identities
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.16
0.6%
1252
+0.10
0.4%
1177
+0.10
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1252
+0.16
0.07
227
+0.10
0.07
143
+0.10
0.06
Negative Logits
<bos>
-2.14
EconPapers
-0.87
▼
-0.80
makeText
-0.80
AsUp
-0.79
脚注の使い方
-0.79
Paglinawan
-0.78
SEDS
-0.77
HasAnnotation
-0.77
ynb
-0.76
POSITIVE LOGITS
unspeak
2.06
Juf
1.92
McLaugh
1.92
hentai
1.86
inconce
1.83
reluct
1.81
depic
1.77
perfet
1.77
indescri
1.77
increa
1.76
Activations Density 0.309%