INDEX
Explanations
comparative phrases or constructs that suggest increasing levels or risks related to various factors
the more the more
New Auto-Interp
Negative Logits
فريبيس
-1.02
للمعارف
-0.97
queſta
-0.93
<unused23>
-0.92
<unused79>
-0.92
<unused52>
-0.92
<unused68>
-0.92
<unused42>
-0.92
<unused3>
-0.92
<unused28>
-0.92
POSITIVE LOGITS
!
0.39
chance
0.39
chances
0.36
you
0.34
progress
0.34
the
0.34
.
0.33
it
0.33
↵↵
0.32
success
0.32
Activations Density 0.020%