INDEX
Explanations
phrases indicating justification or reasoning
New Auto-Interp
Negative Logits
wn
-0.84
ns
-0.78
robe
-0.78
ety
-0.74
shaw
-0.72
mint
-0.70
emi
-0.70
jet
-0.69
spect
-0.68
ses
-0.68
POSITIVE LOGITS
unlike
0.79
they
0.78
nobody
0.76
fuck
0.74
obviously
0.74
there
0.74
otherwise
0.73
hey
0.73
evidenced
0.72
frankly
0.70
Activations Density 0.060%