INDEX
Explanations
phrases indicating personal agency and choices in the context of social issues
New Auto-Interp
Negative Logits
()]
-0.60
']))
-0.60
()));
-0.57
"]);
-0.55
)}}
-0.55
},
-0.52
discul
-0.51
))}
-0.50
les
-0.50
{}));-0.50
POSITIVE LOGITS
themselves
1.32
their
1.15
themselves
1.11
their
0.99
Their
0.94
Their
0.85
THEIR
0.74
ihre
0.73
kanilang
0.72
ihren
0.72
Activations Density 0.503%