INDEX
Explanations
phrases related to societal expectations and criticisms of societal norms
New Auto-Interp
Negative Logits
eldorf
-0.15
avra
-0.15
andi
-0.14
undan
-0.14
olla
-0.14
likelihood
-0.13
znik
-0.13
_EXISTS
-0.13
CRET
-0.13
idak
-0.13
POSITIVE LOGITS
supposed
0.99
suppose
0.75
meant
0.58
supposedly
0.58
purported
0.52
alleged
0.47
allegedly
0.45
intended
0.43
Suppose
0.39
SUP
0.37
Activations Density 0.329%