INDEX
Explanations
words associated with behavioral concepts
New Auto-Interp
Negative Logits
eon
-0.20
asz
-0.18
893
-0.16
tiles
-0.16
ean
-0.15
enaire
-0.15
uguay
-0.15
erate
-0.15
ties
-0.15
ecko
-0.15
POSITIVE LOGITS
emoth
0.44
older
0.31
aviors
0.31
old
0.29
emo
0.28
aviour
0.27
aviours
0.27
olders
0.26
emouth
0.26
olding
0.26
Activations Density 0.010%