INDEX
Explanations
expressions indicating logical negation or conditional statements
New Auto-Interp
Negative Logits
so
-0.76
of
-0.68
-0.67
de
-0.67
[toxicity=0]
-0.66
</em>
-0.65
a
-0.64
↵
-0.62
-
-0.60
,
-0.60
POSITIVE LOGITS
(!
1.36
verwijspagina
1.26
(!
1.15
(!__
1.13
للمعارف
1.11
pleaſure
1.05
HasFactory
1.04
autorytatywna
1.04
nahilalakip
1.04
moustache
1.01
Activations Density 0.018%