INDEX
Explanations
descriptions portraying something as impressive or appealing
New Auto-Interp
Negative Logits
mediate
-0.77
pora
-0.73
ugal
-0.72
unfit
-0.68
avored
-0.67
Inquiry
-0.67
ãĥĺãĥ©
-0.65
condemned
-0.65
condemns
-0.65
URRENT
-0.64
POSITIVE LOGITS
stuff
1.00
ness
0.94
nels
0.94
GIF
0.90
gadgets
0.90
dude
0.89
guy
0.88
ery
0.86
kid
0.85
sounding
0.85
Activations Density 0.157%