INDEX
Explanations
references to social roles and relationships within communities
New Auto-Interp
Negative Logits
clude
-0.16
clud
-0.15
apiro
-0.15
agos
-0.15
present
-0.15
gu
-0.14
present
-0.14
tant
-0.14
embros
-0.14
allows
-0.14
POSITIVE LOGITS
everywhere
0.30
shouldn
0.26
across
0.23
with
0.22
throughout
0.20
today
0.20
without
0.20
around
0.19
anywhere
0.19
Everywhere
0.18
Activations Density 0.299%