INDEX
Explanations
reasons or explanations for specific situations
mentions of "reasons" in various contexts
New Auto-Interp
Negative Logits
yss
-0.70
oba
-0.67
enged
-0.67
ibaba
-0.66
ILA
-0.64
boro
-0.62
tatt
-0.60
likeness
-0.59
aila
-0.59
ascus
-0.59
POSITIVE LOGITS
reasons
1.06
abl
0.94
unrelated
0.90
why
0.87
Reasons
0.82
¶
0.81
asons
0.76
alone
0.75
why
0.74
Reason
0.74
Activations Density 0.028%