INDEX
Explanations
phrases that reference groups or collectives
New Auto-Interp
Negative Logits
ry
-0.19
hone
-0.19
eri
-0.16
/up
-0.16
chl
-0.15
baum
-0.15
crow
-0.15
иÑĤов
-0.14
ifer
-0.14
appropriate
-0.14
POSITIVE LOGITS
ings
0.40
think
0.24
usc
0.24
INGS
0.23
/group
0.22
sWith
0.21
mates
0.18
aroo
0.18
ement
0.18
members
0.18
Activations Density 0.057%