INDEX
Explanations
names and references to historical figures and events
New Auto-Interp
Negative Logits
aeda
-0.17
idunt
-0.16
urum
-0.15
reh
-0.15
gorit
-0.15
bakan
-0.15
aket
-0.14
uellement
-0.14
currently
-0.14
mastur
-0.14
POSITIVE LOGITS
pole
0.18
152
0.16
propag
0.16
propaganda
0.15
Mgr
0.15
contacts
0.15
favor
0.15
pap
0.15
-CS
0.14
ophile
0.14
Activations Density 0.037%