INDEX
Explanations
words related to expectations or norms
phrases related to expectations or societal norms
New Auto-Interp
Negative Logits
Blaz
-0.56
Ey
-0.55
Kis
-0.54
Flavoring
-0.54
redo
-0.53
Bulgar
-0.51
stru
-0.51
owitz
-0.51
Bohem
-0.49
Quote
-0.49
POSITIVE LOGITS
to
1.06
to
0.95
TO
0.77
toc
0.71
ered
0.69
entious
0.69
Disclaimer
0.67
"$:/
0.67
ta
0.67
ALLY
0.67
Activations Density 0.026%