INDEX
Explanations
expressions of frustration or disbelief
phrases that express confusion or outrage
New Auto-Interp
Negative Logits
ierrez
-0.74
onial
-0.64
orea
-0.61
ridor
-0.61
izont
-0.60
itures
-0.60
ateral
-0.60
utor
-0.59
selves
-0.58
ettlement
-0.57
POSITIVE LOGITS
hell
1.59
heck
1.48
fuck
1.44
HELL
1.33
Fuck
1.13
FUCK
1.10
Hell
1.05
heavens
1.04
fuck
0.99
gods
0.98
Activations Density 0.097%