INDEX
Explanations
phrases related to social interactions and relationships
New Auto-Interp
Negative Logits
ofire
-0.15
prive
-0.14
ynet
-0.13
gamber
-0.13
Britt
-0.13
_mD
-0.13
ovatel
-0.12
adele
-0.12
ovice
-0.12
exh
-0.12
POSITIVE LOGITS
get
0.18
done
0.16
rix
0.15
went
0.15
going
0.15
stay
0.15
going
0.14
doing
0.14
gone
0.14
doing
0.14
Activations Density 0.623%