INDEX
Explanations
phrases related to debates, explanations, or arguments
New Auto-Interp
Negative Logits
ãĥ¬
-0.76
breaking
-0.74
hen
-0.71
shit
-0.70
oses
-0.69
Desk
-0.69
ante
-0.66
enge
-0.66
TY
-0.66
hens
-0.66
POSITIVE LOGITS
cher
0.81
they
0.76
accompanies
0.72
someday
0.72
justifies
0.69
there
0.69
although
0.69
mismatch
0.68
arose
0.67
characterize
0.66
Activations Density 1.253%