INDEX
Explanations
actions or abilities described in a positive light
New Auto-Interp
Negative Logits
Niet
-0.71
Azerb
-0.64
Borders
-0.61
Grail
-0.61
Seym
-0.61
Nare
-0.58
Clarkson
-0.56
Frie
-0.56
Jagu
-0.55
Gaw
-0.55
POSITIVE LOGITS
][
0.92
]
0.76
redients
0.69
)
0.66
];
0.65
].
0.65
::
0.65
_
0.65
actionDate
0.64
lement
0.64
Activations Density 4.238%