INDEX
Explanations
phrases expressing strong emotions or opinions
familiar conversational phrases or expressions
New Auto-Interp
Negative Logits
respectively
-0.82
..."
-0.75
thereto
-0.70
.","
-0.69
�
-0.68
"],"
-0.67
incub
-0.65
``(
-0.63
\"
-0.61
predomin
-0.61
POSITIVE LOGITS
resa
1.33
odore
1.26
xiety
1.13
swers
0.98
notations
0.93
laughter
0.89
romeda
0.88
bye
0.87
chieve
0.84
nir
0.79
Activations Density 0.629%