INDEX
Explanations
phrases related to expectations and societal norms
New Auto-Interp
Negative Logits
avra
-0.16
á»ĵi
-0.16
пÑĢидеÑĤÑģÑı
-0.15
kil
-0.14
znik
-0.14
otal
-0.14
aal
-0.14
orial
-0.14
tility
-0.13
_EXISTS
-0.13
POSITIVE LOGITS
supposed
1.05
suppose
0.81
meant
0.63
supposedly
0.51
purported
0.44
Suppose
0.43
alleged
0.43
SUP
0.43
intended
0.41
allegedly
0.40
Activations Density 0.259%