INDEX
Explanations
profane or strong language
references to curses or swearing
New Auto-Interp
Negative Logits
å§«
-1.00
nington
-0.81
atican
-0.80
parency
-0.78
issance
-0.77
arnaev
-0.76
olitan
-0.76
oulos
-0.76
anooga
-0.75
itutional
-0.74
POSITIVE LOGITS
curse
0.91
words
0.84
curses
0.84
words
0.78
hammer
0.78
cursing
0.75
cursed
0.73
bones
0.70
word
0.67
vine
0.67
Activations Density 0.036%