INDEX
Explanations
phrases indicating transformation or change in state
New Auto-Interp
Negative Logits
erity
-0.89
essim
-0.77
intend
-0.72
ysis
-0.71
dstg
-0.66
vulner
-0.65
isha
-0.65
iency
-0.63
tempted
-0.63
ying
-0.62
POSITIVE LOGITS
infamous
0.93
famous
0.85
known
0.82
synonymous
0.80
iconic
0.80
illac
0.78
staples
0.76
canon
0.74
famous
0.73
ģĸ
0.72
Activations Density 0.050%