INDEX
Explanations
phrases or words that express contrast or contradiction
New Auto-Interp
Negative Logits
'
-0.62
hassee
-0.55
יצד
-0.55
WA
-0.54
Chartres
-0.53
Ge
-0.52
Torino
-0.50
jazdu
-0.50
imedes
-0.50
Skinner
-0.49
POSITIVE LOGITS
ostante
1.82
despite
1.47
Despite
1.41
Despite
1.36
despite
1.36
nonostante
1.34
Malgré
1.34
spite
1.32
Trotz
1.29
Trotz
1.28
Activations Density 0.080%