INDEX
Explanations
identifies common objects, dislikes disrespect
New Auto-Interp
Negative Logits
ərə
0.49
altres
0.47
antioxidants
0.47
bırak
0.47
estratégia
0.46
éve
0.46
ajouter
0.44
príncipe
0.44
antidepressants
0.44
combines
0.44
POSITIVE LOGITS
3
0.51
TT
0.50
text
0.49
ф
0.49
TH
0.48
FT
0.48
bullet
0.47
pt
0.47
mortem
0.46
頂
0.46
Activations Density 0.001%