INDEX
Explanations
instances of conditional language or terms related to uncertainty and limitations in software guarantees
New Auto-Interp
Negative Logits
neceff
-0.64
ſte
-0.64
purpoſe
-0.62
ſta
-0.62
ſelf
-0.60
ſever
-0.60
tranſ
-0.60
Theſe
-0.60
ſtate
-0.59
ؤلاء
-0.58
POSITIVE LOGITS
even
4.11
even
3.54
Even
3.01
Even
2.98
sogar
2.96
даже
2.91
persino
2.86
EVEN
2.81
incluso
2.79
навіть
2.78
Activations Density 0.531%