INDEX
Explanations
phrases describing strong and clear contrasts
references to stark contrasts or inequalities
New Auto-Interp
Negative Logits
hops
-0.83
annis
-0.71
ipop
-0.71
uthor
-0.70
diligently
-0.69
andom
-0.69
aceae
-0.68
onz
-0.68
RAFT
-0.67
ilk
-0.64
POSITIVE LOGITS
contrasts
1.22
contrast
1.09
ly
1.07
stark
0.95
departure
0.91
difference
0.90
differences
0.86
reminders
0.83
contradiction
0.82
ethy
0.82
Activations Density 0.069%