INDEX
Explanations
violent content or behavior
New Auto-Interp
Negative Logits
blame
1.44
tink
1.41
been
1.37
1.36
☺
1.36
aloud
1.34
-\
1.33
worry
1.33
been
1.29
puts
1.29
POSITIVE LOGITS
ität
1.73
د
1.59
conformément
1.54
ли
1.43
dagar
1.43
ição
1.42
יות
1.42
combate
1.41
er
1.39
رود
1.38
Activations Density 0.135%