INDEX
Explanations
phrases related to classification and categorization
New Auto-Interp
Negative Logits
undra
-0.18
sth
-0.17
acock
-0.17
rieb
-0.15
verbatim
-0.14
inston
-0.14
cela
-0.14
ohen
-0.14
kü
-0.14
asename
-0.13
POSITIVE LOGITS
ness
0.23
ifi
0.19
ifying
0.16
-looking
0.15
utter
0.14
izz
0.14
ly
0.14
ewing
0.14
NESS
0.13
ifiable
0.13
Activations Density 0.007%