INDEX
Explanations
references to the concept of "nation" and its variations
New Auto-Interp
Negative Logits
s
-0.18
nice
-0.16
orie
-0.15
ive
-0.15
sse
-0.15
otty
-0.15
avian
-0.15
out
-0.15
sko
-0.14
oke
-0.14
POSITIVE LOGITS
hood
0.40
wide
0.34
-wide
0.30
-state
0.29
ally
0.28
alse
0.28
-states
0.28
aal
0.26
/world
0.26
nal
0.26
Activations Density 0.017%