INDEX
Explanations
occurrences of words related to swearing or vulgar expressions
New Auto-Interp
Negative Logits
νή
-0.16
idis
-0.16
urer
-0.16
raf
-0.16
apsed
-0.16
κÏģι
-0.16
inement
-0.15
unte
-0.15
bris
-0.15
echn
-0.15
POSITIVE LOGITS
stakes
0.21
ies
0.17
endor
0.16
artz
0.16
ombat
0.16
ollen
0.16
itzer
0.16
enburg
0.15
sock
0.15
enberg
0.15
Activations Density 0.056%