INDEX
Explanations
references to popular book or movie series and their characters
New Auto-Interp
Negative Logits
erior
-0.17
鬼
-0.15
culo
-0.15
pong
-0.15
otel
-0.15
inkle
-0.15
ongan
-0.14
serg
-0.14
ardi
-0.14
keh
-0.14
POSITIVE LOGITS
Kat
0.35
District
0.33
District
0.33
districts
0.32
Mock
0.32
Kat
0.31
district
0.30
district
0.29
Hunger
0.28
trib
0.27
Activations Density 0.030%