INDEX
Explanations
references to personal identity and relationships
New Auto-Interp
Negative Logits
"]').
-0.68
"").
-0.62
})}
-0.62
",(
-0.60
"))
-0.58
目は
-0.57
())).
-0.56
》.
-0.56
",{-0.56
").
-0.56
POSITIVE LOGITS
YOURSELF
0.96
Myself
0.90
ourselves
0.89
Yourself
0.88
Myself
0.88
selves
0.87
myself
0.86
Yourself
0.86
comigo
0.80
yourself
0.78
Activations Density 0.221%