INDEX
Explanations
keeping prejudiced or insulting
New Auto-Interp
Negative Logits
تە
0.41
CYAN
0.39
difíciles
0.37
́t
0.37
প্রতিক
0.36
few
0.36
難しい
0.36
FCA
0.36
bot
0.36
baton
0.36
POSITIVE LOGITS
Keep
0.92
Keep
0.89
keep
0.89
keeps
0.84
keep
0.80
保持
0.77
Keeps
0.74
KEEP
0.74
simplify
0.72
kept
0.71
Activations Density 0.000%