INDEX
Explanations
sentences that provide explanations or reasons for a given situation or action
New Auto-Interp
Negative Logits
nces
-0.70
torch
-0.70
readable
-0.69
kun
-0.68
imet
-0.65
uania
-0.65
borg
-0.64
ona
-0.63
adiq
-0.63
istered
-0.62
POSITIVE LOGITS
Because
1.22
Reason
1.20
Because
1.19
reasons
1.13
Cause
1.13
cause
1.11
Reasons
1.11
ecause
1.07
WHY
1.00
because
0.97
Activations Density 0.169%