INDEX
Explanations
words related to negative or offensive behavior or comments
derogatory terms and references to socially taboo activities
New Auto-Interp
Negative Logits
Plate
-0.67
spring
-0.67
Sources
-0.65
Source
-0.65
Source
-0.62
ensus
-0.61
Oral
-0.60
Celtic
-0.59
Sacrament
-0.59
Novel
-0.59
POSITIVE LOGITS
jer
1.22
jerk
1.13
etsk
0.97
ithing
0.89
bucks
0.88
boa
0.85
balls
0.84
>>\
0.84
artifacts
0.83
EStream
0.82
Activations Density 0.010%