INDEX
Explanations
phrases related to derogatory remarks
instances of the word "sn"
New Auto-Interp
Negative Logits
heid
-0.87
limited
-0.70
maj
-0.64
medium
-0.62
und
-0.61
Cond
-0.60
EMENT
-0.60
respectfully
-0.60
belts
-0.59
Ind
-0.59
POSITIVE LOGITS
iping
1.44
uggle
1.43
ipe
1.41
ugg
1.41
appy
1.37
atches
1.35
atching
1.35
arling
1.34
ipes
1.33
ickers
1.33
Activations Density 0.033%