INDEX
Explanations
phrases that express skepticism or critique about common beliefs and notions
New Auto-Interp
Negative Logits
154
-0.15
iards
-0.15
109
-0.15
care
-0.14
320
-0.14
-0.14
ARRANT
-0.14
fc
-0.14
909
-0.14
iot
-0.14
POSITIVE LOGITS
edy
0.18
udy
0.17
ewis
0.16
olini
0.16
erras
0.15
.experimental
0.15
agini
0.14
enthal
0.14
VF
0.14
obvious
0.14
Activations Density 0.122%