INDEX
Explanations
potentially harmful or dangerous
New Auto-Interp
Negative Logits
ޟ
0.74
persön
0.64
textSize
0.63
extravagant
0.62
硕
0.60
volupt
0.58
enorm
0.58
moral
0.57
sprawling
0.57
geweld
0.57
POSITIVE LOGITS
Cross
0.76
Cross
0.75
cross
0.68
Pass
0.68
lành
0.67
której
0.66
Move
0.66
জরি
0.65
Pasar
0.65
studi
0.64
Activations Density 0.218%