INDEX
Explanations
reasoning or explanation related to decisions
New Auto-Interp
Negative Logits
lez
-0.84
wn
-0.77
nin
-0.71
agin
-0.70
Gas
-0.68
iac
-0.67
hal
-0.66
ax
-0.66
gall
-0.65
yan
-0.64
POSITIVE LOGITS
they
1.01
otherwise
0.92
nobody
0.87
unlike
0.83
there
0.82
*/(
0.81
it
0.81
THEY
0.79
obviously
0.71
we
0.70
Activations Density 2.933%