INDEX
Explanations
determiners related to possession or affiliation, particularly "their" and "they"
New Auto-Interp
Negative Logits
itſelf
-0.67
gears
-0.52
strato
-0.52
extAlignment
-0.51
voltaic
-0.50
-0.49
Jefus
-0.48
ſelf
-0.48
SECRET
-0.47
akcji
-0.46
POSITIVE LOGITS
themselves
0.83
themselves
0.76
they
0.73
Their
0.62
Their
0.61
يكب
0.60
their
0.60
they
0.59
forem
0.59
THEY
0.59
Activations Density 0.413%