INDEX
Explanations
mentions of specific actions or descriptors that evoke a strong response
New Auto-Interp
Negative Logits
myſelf
-0.98
itſelf
-0.85
Majefty
-0.82
Monfieur
-0.81
ſeveral
-0.79
AttributeSet
-0.76
الحياه
-0.76
himſelf
-0.74
་་
-0.74
fubject
-0.74
POSITIVE LOGITS
me
0.65
us
0.59
our
0.45
.
0.44
hanem
0.43
Me
0.40
/
0.39
saites
0.39
.
0.37
di
0.36
Activations Density 0.385%