INDEX
Explanations
phrases indicating a strong emotional reaction or emphasis
expressions of comparison or similarity
New Auto-Interp
Negative Logits
alogue
-0.75
hiba
-0.73
arers
-0.72
atform
-0.70
tein
-0.70
Versions
-0.69
FL
-0.68
Lower
-0.67
ircraft
-0.67
verning
-0.66
POSITIVE LOGITS
lihood
0.94
lier
0.80
liest
0.77
crazy
0.77
wow
0.70
liness
0.69
ably
0.68
goddamn
0.65
parity
0.65
crap
0.64
Activations Density 0.069%