INDEX
Explanations
punctuation and formatting cues in the text
New Auto-Interp
Negative Logits
oman
-0.16
Rena
-0.16
oga
-0.15
ilig
-0.13
/*č↵
-0.13
κε
-0.13
_imp
-0.13
homosex
-0.13
oras
-0.13
illard
-0.13
POSITIVE LOGITS
Category
0.19
#__
0.17
Contents
0.17
automát
0.16
history
0.16
Contents
0.16
HISTORY
0.16
Category
0.16
category
0.15
_contents
0.15
Activations Density 0.029%