INDEX
Explanations
words or phrases related to medical conditions or academic titles
words and phrases related to deception or misleading actions
New Auto-Interp
Negative Logits
izational
-0.71
Kenobi
-0.69
unci
-0.69
STER
-0.68
isations
-0.68
rarily
-0.68
eness
-0.68
unciation
-0.67
ested
-0.66
icip
-0.66
POSITIVE LOGITS
utical
0.97
ce
0.89
les
0.87
pter
0.85
e
0.82
rette
0.81
lled
0.80
re
0.79
llan
0.77
ased
0.76
Activations Density 0.036%