INDEX
Explanations
inappropriate content such as vulgar language
instances of vulgar language
New Auto-Interp
Negative Logits
Winged
-0.73
atform
-0.70
ulin
-0.64
eger
-0.63
aah
-0.63
umbledore
-0.62
DOC
-0.61
UL
-0.61
WIND
-0.60
ulation
-0.60
POSITIVE LOGITS
folk
0.77
eric
0.72
lists
0.71
Strait
0.69
yang
0.65
pend
0.64
trader
0.63
roth
0.62
cousin
0.62
shit
0.62
Activations Density 0.000%