INDEX
Explanations
references to collective identity or community
New Auto-Interp
Negative Logits
itself
-0.21
themselves
-0.21
e
-0.17
oad
-0.16
er
-0.15
lectron
-0.15
ton
-0.15
inia
-0.15
(s
-0.15
noon
-0.14
POSITIVE LOGITS
/us
0.39
/me
0.31
/her
0.28
/th
0.27
enet
0.27
ury
0.26
urious
0.23
ourselves
0.23
self
0.22
usal
0.22
Activations Density 0.068%