INDEX
Explanations
evaluative and promotional language about products or experiences
New Auto-Interp
Negative Logits
<<<<<<<<<<<<<<
-0.61
fucking
-0.57
fucking
-0.54
UnusedPrivate
-0.52
FUCKING
-0.52
fuck
-0.51
raped
-0.50
Fucking
-0.49
retarded
-0.49
stupid
-0.48
POSITIVE LOGITS
summertime
0.57
festive
0.54
sizzling
0.52
letoe
0.51
holiday
0.50
frosty
0.50
ruff
0.49
BibitemShut
0.49
roars
0.48
garantiert
0.47
Activations Density 0.459%