INDEX
Explanations
expressions of contradiction or contrast
New Auto-Interp
Negative Logits
rk
-0.16
arella
-0.15
erk
-0.15
ļĮ
-0.14
Ìĥ
-0.14
eteria
-0.14
eres
-0.14
äng
-0.14
ello
-0.14
????????????????
-0.13
POSITIVE LOGITS
it
0.21
they
0.20
there
0.18
fact
0.18
Helm
0.17
wards
0.17
he
0.16
all
0.15
Fact
0.15
knowing
0.15
Activations Density 0.040%