INDEX
Explanations
statements reflecting societal issues and personal experiences related to social behaviors and norms
New Auto-Interp
Negative Logits
itſelf
-1.01
InSection
-1.00
Jefus
-0.93
Majefty
-0.91
themſelves
-0.87
myſelf
-0.87
ſelf
-0.85
himſelf
-0.84
pleaſure
-0.82
ſelves
-0.81
POSITIVE LOGITS
im
0.48
股
0.40
İstinadlar
0.40
these
0.39
las
0.38
saraba
0.38
рост
0.38
"
0.37
an
0.37
"
0.37
Activations Density 0.438%