INDEX
Explanations
mentions of controversial topics or discussions surrounding moral or ethical dilemmas
New Auto-Interp
Negative Logits
malink
-0.18
elib
-0.16
Actually
-0.15
actually
-0.15
надо
-0.14
Asked
-0.14
too
-0.14
oka
-0.14
поÑĤом
-0.14
Actually
-0.13
POSITIVE LOGITS
According
0.23
according
0.22
Although
0.22
While
0.21
Though
0.21
According
0.21
Due
0.21
Furthermore
0.20
although
0.20
due
0.20
Activations Density 0.264%