INDEX
Explanations
hate speech and abuse refusal
New Auto-Interp
Negative Logits
somewhat
0.48
slightly
0.48
কিছুটা
0.45
dependiendo
0.44
약간
0.44
pleasantly
0.43
depending
0.43
mainly
0.42
少し
0.41
optional
0.40
POSITIVE LOGITS
Даже
0.51
Needless
0.47
مهما
0.46
ऐसे
0.44
навіть
0.44
Even
0.43
Needless
0.43
regardless
0.43
គ្មាន
0.42
rocities
0.41
Activations Density 0.500%