INDEX
Explanations
phrases related to causality and attribution
New Auto-Interp
Negative Logits
Carbuncle
-0.63
ura
-0.61
ahs
-0.59
iverpool
-0.59
aws
-0.57
ourse
-0.57
esc
-0.57
talk
-0.57
Chains
-0.57
clipboard
-0.56
POSITIVE LOGITS
partly
1.00
solely
0.98
chiefly
0.90
principally
0.87
primarily
0.84
partially
0.82
mainly
0.81
entirely
0.80
largely
0.78
squarely
0.74
Activations Density 0.105%