INDEX
Explanations
assertions that challenge the validity of claims or narratives
New Auto-Interp
Negative Logits
unpredict
-0.16
/fw
-0.15
acht
-0.15
achts
-0.15
ambigu
-0.15
cyn
-0.15
Unexpected
-0.14
åįł
-0.14
odom
-0.14
تÙĦ
-0.14
POSITIVE LOGITS
fall
0.33
bunk
0.27
fiction
0.27
base
0.27
pat
0.25
hog
0.24
false
0.24
fabrication
0.24
fig
0.23
Fall
0.23
Activations Density 0.204%