INDEX
Explanations
phrases expressing strong opinions or beliefs
New Auto-Interp
Negative Logits
oled
-0.77
blem
-0.76
omsky
-0.72
oling
-0.71
=~=~
-0.69
ositories
-0.68
ernels
-0.67
oÄŁan
-0.67
lia
-0.63
Newsletter
-0.62
POSITIVE LOGITS
goodbye
1.14
bye
0.98
aloud
0.84
lihood
0.75
amen
0.69
publicly
0.69
hello
0.67
loudly
0.62
YN
0.62
sorry
0.61
Activations Density 0.065%