INDEX
Explanations
phrases related to making assumptions or beliefs about others' intentions
New Auto-Interp
Negative Logits
erman
-0.18
esian
-0.18
essler
-0.17
rome
-0.16
etting
-0.16
udas
-0.15
Ø©
-0.15
osl
-0.15
Assert
-0.15
ness
-0.15
POSITIVE LOGITS
ably
0.23
/assert
0.22
ptions
0.19
Worst
0.17
nal
0.17
PTION
0.16
ptive
0.16
267
0.16
ively
0.16
worst
0.16
Activations Density 0.026%