INDEX
Explanations
possessive pronouns and expressions of personal ownership
New Auto-Interp
Negative Logits
unanim
-0.17
s
-0.16
ories
-0.15
tero
-0.15
elder
-0.14
pig
-0.14
rů
-0.14
STITUTE
-0.14
aftermath
-0.14
icky
-0.14
POSITIVE LOGITS
rtle
0.32
riad
0.32
opic
0.29
anmar
0.29
ri
0.26
opia
0.25
rrha
0.25
myself
0.25
ths
0.24
embros
0.23
Activations Density 0.142%