INDEX
Explanations
trusted friend, family, or adult
New Auto-Interp
Negative Logits
ంగా
0.86
untersucht
0.77
లు
0.75
ના
0.73
ו
0.72
شي
0.71
يكي
0.70
};
0.68
ασ
0.68
هاي
0.68
POSITIVE LOGITS
z
1.13
st
0.99
trusted
0.98
л
0.95
is
0.89
Trusted
0.88
ic
0.84
ologist
0.84
ate
0.80
eg
0.80
Activations Density 0.002%