INDEX
Explanations
profane and derogatory terms
derogatory terms and phrases related to unflattering behaviors or characteristics
New Auto-Interp
Negative Logits
Corm
-0.64
Source
-0.64
WER
-0.63
Prophe
-0.61
Novel
-0.61
divisions
-0.59
ãģ®ç
-0.59
Feature
-0.59
rics
-0.57
Defender
-0.57
POSITIVE LOGITS
jer
1.17
jerk
1.14
usalem
1.03
weed
0.86
offs
0.82
boa
0.82
ometer
0.80
ety
0.80
bucks
0.80
itude
0.79
Activations Density 0.021%