INDEX
Explanations
phrases indicating legal judgments or findings of guilt
New Auto-Interp
Negative Logits
avra
-0.19
ebek
-0.18
ADOS
-0.16
uele
-0.16
ARIABLE
-0.15
oblig
-0.15
ãģŀ
-0.15
롱
-0.15
laÄį
-0.15
ivid
-0.14
POSITIVE LOGITS
fit
0.35
Fit
0.28
guilty
0.28
Fit
0.27
-fit
0.26
fit
0.26
fitness
0.23
unfit
0.23
worthy
0.21
.fit
0.20
Activations Density 0.091%