INDEX
Explanations
terms related to effectiveness and impact
New Auto-Interp
Negative Logits
ffects
-0.21
Effects
-0.19
effects
-0.18
affected
-0.18
erable
-0.18
effectively
-0.18
Effects
-0.18
_effects
-0.17
ffect
-0.17
affected
-0.17
POSITIVE LOGITS
iveness
0.36
ual
0.31
ively
0.29
uate
0.28
ives
0.28
ors
0.28
uated
0.27
ivity
0.26
uating
0.26
ivement
0.22
Activations Density 0.057%