INDEX
Explanations
self-defense justification or necessity
New Auto-Interp
Negative Logits
banal
0.48
安定
0.44
indiscrimin
0.44
ఆదేశ
0.43
revenue
0.43
zeta
0.41
stabilise
0.41
pasando
0.40
hate
0.40
callous
0.40
POSITIVE LOGITS
łą
0.45
stranded
0.43
ıları
0.43
nécessité
0.38
ತು
0.38
appropriately
0.38
फं
0.38
拦
0.37
American
0.36
нужда
0.36
Activations Density 0.040%