INDEX
Explanations
words related to trustworthiness or credibility
instances of the substring "r"
New Auto-Interp
Negative Logits
eers
-0.83
DRAG
-0.68
eer
-0.68
Loft
-0.68
Rising
-0.67
WAYS
-0.67
Winds
-0.66
staging
-0.64
warr
-0.64
dare
-0.64
POSITIVE LOGITS
inct
1.15
ilateral
1.01
angle
0.99
usted
0.98
angles
0.98
acy
0.98
acist
0.96
uder
0.96
angled
0.95
ushed
0.95
Activations Density 0.047%