INDEX
Explanations
model refusalsmodel refusalsmodel refusalsmodel refusalmodel speakingmodel outputmodel outputmodel outputmodel outputmodel outputmodel speakingmodel outputmodel speaking
New Auto-Interp
Negative Logits
KMnO
0.41
顛
0.40
Occ
0.38
occ
0.36
khấu
0.36
निशान
0.35
insinu
0.35
Locks
0.35
vistazo
0.34
રો
0.34
POSITIVE LOGITS
Кан
0.39
Dan
0.39
Edel
0.39
郏
0.38
weiterhin
0.37
Danh
0.37
ভালো
0.37
anch
0.37
Angelo
0.37
ajax
0.36
Activations Density 0.050%