INDEX
Explanations
profane words or vulgar language
New Auto-Interp
Negative Logits
lihood
-0.68
NING
-0.67
enegger
-0.67
nces
-0.65
Äĩ
-0.63
risome
-0.63
atical
-0.61
senal
-0.61
manship
-0.61
POL
-0.60
POSITIVE LOGITS
ogether
1.26
imore
1.20
itude
1.11
reatment
0.96
itudes
0.95
uve
0.92
zman
0.86
itud
0.82
ournament
0.81
ree
0.80
Activations Density 0.015%