INDEX
Explanations
AI researchers, Vicuna, scientists, traditional computing
New Auto-Interp
Negative Logits
on
0.60
od
0.54
ER
0.52
ah
0.51
друг
0.50
im
0.48
as
0.48
ot
0.47
Перед
0.47
agal
0.47
POSITIVE LOGITS
、
0.52
artistes
0.48
entertained
0.47
。(
0.47
፣
0.47
၊
0.45
经常
0.44
胆
0.44
3
0.42
bothered
0.42
Activations Density 0.001%