INDEX
Explanations
instances where an unexpected outcome is described
the phrase "even though" indicating contrast or concession
New Auto-Interp
Negative Logits
Eye
-0.78
isible
-0.65
ru
-0.65
ursed
-0.64
umped
-0.63
cycl
-0.62
ricks
-0.61
aven
-0.61
irt
-0.60
Ingredients
-0.60
POSITIVE LOGITS
acknowledging
0.82
lihood
0.77
deleting
0.74
conced
0.72
itals
0.69
admitting
0.69
admittedly
0.68
clair
0.68
olulu
0.67
anamo
0.67
Activations Density 0.026%