INDEX
Explanations
phrases indicating disapproval or violation of rules
Followed by "(" or "Q" (likely question)
question tokens
New Auto-Interp
Negative Logits
'\\;'
-1.17
ſelves
-1.14
་་
-1.12
etheless
-1.12
dafx
-1.12
$_"
-1.10
>\<^
-1.10
―――――
-1.10
olesale
-1.05
BibitemShut
-1.04
POSITIVE LOGITS
<eos>
1.19
↵↵
1.05
↵
1.04
..
0.95
↵↵↵
0.94
...
0.89
</em>
0.87
0.86
0.86
</h2>
0.86
Activations Density 1.435%