INDEX
Explanations
phrases that express contrast or exception
New Auto-Interp
Negative Logits
ſelf
-0.90
Dorian
-0.83
Galicia
-0.75
Eros
-0.75
Caine
-0.74
yayım
-0.73
Aire
-0.73
PHA
-0.73
Jefus
-0.71
Jeune
-0.71
POSITIVE LOGITS
but
2.40
But
2.28
but
2.19
BUT
2.05
But
2.05
BUT
1.81
pero
1.68
tetapi
1.49
nhưng
1.47
但
1.43
Activations Density 0.128%