INDEX
Explanations
phrases expressing strong personal preferences or identities
New Auto-Interp
Negative Logits
menacing
-0.80
unrecogn
-0.72
majesty
-0.72
overshadow
-0.71
incrim
-0.70
unheard
-0.70
assassinate
-0.70
menace
-0.69
virtues
-0.68
believable
-0.66
POSITIVE LOGITS
math
0.78
consumer
0.74
OCD
0.73
betting
0.69
intend
0.68
sucker
0.68
fan
0.66
avid
0.66
chool
0.65
price
0.65
Activations Density 0.488%