INDEX
Explanations
words and phrases related to unethical or abusive practices
New Auto-Interp
Negative Logits
cket
-0.16
.scalablytyped
-0.15
lamaz
-0.15
owie
-0.15
victim
-0.14
coat
-0.14
umar
-0.14
eyh
-0.14
(æĹ¥
-0.14
/=
-0.14
POSITIVE LOGITS
Practices
0.30
ness
0.29
practices
0.29
behaviour
0.27
behavior
0.27
/question
0.25
ities
0.25
/problem
0.23
Behavior
0.22
/il
0.22
Activations Density 0.122%