INDEX
Explanations
items that have been specifically marked or identified with a label
terms related to labeling or categorizing objects or ideas
New Auto-Interp
Negative Logits
ramid
-0.75
=-=-
-0.73
ppa
-0.72
yre
-0.72
perty
-0.70
vati
-0.69
abama
-0.68
hire
-0.67
compr
-0.67
vous
-0.67
POSITIVE LOGITS
phas
0.85
ging
0.69
unfit
0.68
ged
0.67
own
0.67
labelled
0.67
labeled
0.66
labeling
0.64
loyalty
0.64
branded
0.63
Activations Density 0.034%