INDEX
Explanations
phrases related to accountability and consequences for actions
New Auto-Interp
Negative Logits
purpoſe
-0.86
Theſe
-0.83
myſelf
-0.78
Monfieur
-0.75
Inſ
-0.74
Diſ
-0.74
pleaſure
-0.73
ſtate
-0.72
iſt
-0.70
rungsseite
-0.70
POSITIVE LOGITS
its
0.60
@"/
0.59
their
0.55
своей
0.55
having
0.54
suoi
0.51
vì
0.49
ésia
0.49
因
0.48
née
0.48
Activations Density 0.289%