INDEX
Explanations
phrases expressing permission or approval
expressions of approval or acceptance
New Auto-Interp
Negative Logits
hani
-0.80
marine
-0.79
ulhu
-0.75
raped
-0.70
riber
-0.65
ocrat
-0.64
ulator
-0.62
enegger
-0.61
leneck
-0.61
hunt
-0.60
POSITIVE LOGITS
AY
0.91
lahoma
0.83
okay
0.69
HOME
0.68
alright
0.66
bye
0.65
terday
0.65
ok
0.65
margin
0.64
margins
0.64
Activations Density 0.018%