INDEX
Explanations
phrases indicating likelihood or probability
phrases that express similarity or comparison
New Auto-Interp
Negative Logits
arse
-0.86
utical
-0.82
alt
-0.82
ocaust
-0.77
helicop
-0.75
ographies
-0.74
bard
-0.73
isexual
-0.73
itles
-0.72
ategory
-0.72
POSITIVE LOGITS
lier
0.89
lihood
0.86
premature
0.73
somebody
0.69
everybody
0.69
liest
0.68
everyone
0.68
they
0.66
fireworks
0.66
someone
0.66
Activations Density 0.023%