INDEX
Explanations
instances of claims or statements related to societal issues
New Auto-Interp
Negative Logits
ÌĨ
-0.16
elif
-0.15
(strtolower
-0.14
espec
-0.14
ãĥ©ãĥĥãĤ¯
-0.14
ÙĦÛĮس
-0.13
ADM
-0.13
uden
-0.13
ç¨ĭ度
-0.13
eps
-0.12
POSITIVE LOGITS
means
0.81
Means
0.72
means
0.68
Means
0.65
meaning
0.63
meaning
0.59
mean
0.55
æĦıåij³
0.53
Mean
0.52
Meaning
0.52
Activations Density 0.285%