INDEX
Explanations
words related to behavior and actions
New Auto-Interp
Negative Logits
Pr
-0.15
quiz
-0.15
pr
-0.15
gers
-0.14
Bowen
-0.14
eum
-0.14
iones
-0.13
igers
-0.13
Right
-0.13
enade
-0.13
POSITIVE LOGITS
emoth
0.25
beh
0.25
beh
0.23
aviour
0.23
Beh
0.22
aviors
0.21
Beh
0.20
avior
0.19
avour
0.18
aviours
0.18
Activations Density 0.010%