INDEX
Explanations
text related to attributes and characteristics
references to various attributes related to characters or entities
New Auto-Interp
Negative Logits
analysis
-0.74
stall
-0.71
dos
-0.70
fare
-0.68
NAS
-0.66
corn
-0.66
tic
-0.64
Saunders
-0.63
Schumer
-0.63
zin
-0.62
POSITIVE LOGITS
attributes
1.01
wcsstore
0.95
attribute
0.94
Attributes
0.81
attribute
0.79
iveness
0.76
mentation
0.75
reys
0.75
Attributes
0.74
identifiers
0.73
Activations Density 0.006%