INDEX
Explanations
safety, emergency, warnings
New Auto-Interp
Negative Logits
lacht
0.53
as
0.52
wert
0.49
lama
0.49
zeta
0.49
have
0.49
pecific
0.48
bias
0.48
zn
0.47
ent
0.47
POSITIVE LOGITS
开
0.59
juices
0.50
你们
0.49
tổng
0.48
hiệu
0.48
빌
0.48
вас
0.47
ນ
0.47
เต็ม
0.46
người
0.46
Activations Density 0.001%