INDEX
Explanations
mentions of "America" and associated phrases
New Auto-Interp
Negative Logits
istica
-0.17
ilver
-0.16
rts
-0.16
ists
-0.16
ippo
-0.15
iform
-0.15
orthand
-0.15
ê·Ģ
-0.15
lak
-0.15
ToWorld
-0.15
POSITIVE LOGITS
Ferr
0.21
JR
0.18
BirleÅŁik
0.18
anness
0.17
latina
0.16
indiv
0.15
eturn
0.15
Latina
0.15
KIT
0.15
611
0.15
Activations Density 0.026%