I've been kicking the tires on various LLMs lately, and like many have been quite taken by the pace of new releases especially of models with weights distributed under open licenses, always with impressive benchmark results. I don't have local GPUs so trialling different models necessarily requires using an external host. There are various configuration parameters you can set when sending a query that affect generation and many vendors document recommended settings on the model card or associated documentation. For my own purposes I wanted to collect these together in one place, and also confirm in which cases common serving software like vLLM will use defaults provided alongside the model.
generation_config.json
, not all models provide that file or if they do,
they may not include their documented recommendations in it.generation_config.json
(and inference API providers respect this), and/or
a standard like model.yaml is adopted containing
these parameters, some attention may still be required if a model has
different recommendations for different use cases / modes (as Qwen3 does).The parameters supported by vLLM are documented here, though not all are supported in the HTTP API provided by different vendors. For instance, the subset of parameters supported by models on Parasail (an inference API provider I've been kicking the tires on recently) is documented here I cover just that subset below:
temperature
: controls the randomness of sampling of tokens. Lower values are
more deterministic, higher values are more random. This is one the
parameters you'll see spoken about the most.top_p
: limits the tokens that are considered. If set to e.g. 0.5 then only
consider the top most probable tokens whose summed probability doesn't
exceed 50%.top_k
: also limits the tokens that are considered, such that only the top
k
tokens are considered.frequency_penalty
: penalises new tokens based on their frequency in the
generated text. It's possible to set a negative value to encourage
repetition.presence_penalty
: penalises new tokens if they appear in the generated text
so far. It's possible to set a negative value to encourage repetition.repetition_penalty
: This is documented as being a parameter that penalises
new tokens based on whether they've appeared so far in the generated text or
prompt.
The above settings are typically exposed via the API, but what if you don't
explicitly set them? vllm
documents
that it will by default apply settings from generation_config.json
distributed with the model on HuggingFace if it exists (overriding its own
defaults), but you can ignore generation_config.json
to just use vllm's own
defaults by setting --generation-config vllm
when launching the server. This
behaviour was introduced in a PR that landed in early March this
year. We'll explore below
which models actually have a generation_config.json
with their recommended
settings, but what about parameters not set in that file, or if that file
isn't present? As far as I can see, that's where
_DEFAULT_SAMPLING_PARAMS
comes in and we get temperature=1.0
and repetition_penalty, top_p, top_k and
min_p set to values that have no effect on the sampler.
Although Parasail use vllm for serving most (all?) of their hosted models,
it's not clear if they're running with a configuration that allows defaults to
be taken from generation_config.json
. I'll update this post if that is
clarified.
As all of these models are distributed with benchmark results front and center, it should be easy to at least find what settings were used for these results, even if it's not an explicit recommendation on which parameters to use - right? Let's find out. I've decided to step through models groups by their level of openness.
Open weight and open dataset models
generation_config.json
with recommended parameters: No.Open weight models
temperature=0.3
(specified on model card)generation_config.json
with recommended parameters: No.generation_config.json
and the V3 technical
report indicates they used
temperature=0.7 for for some benchmarks. They also state "Benchmarks
containing fewer than 1000 samples are tested multiple times using varying
temperature settings to derive robust final results" (not totally clear if
results are averaged, or the best result is taken). There's no
recommendation I can see for other generation parameters, and to add some
extra confusion the DeepSeek API docs have a page on the temperature
parameter
with specific recommendations for different use cases and it's not totally
clear if these apply equally to V3 (after its temperature scaling) and R1.temperature=0.6
, top_p=0.95
(specified on model
card)generation_config.json
with recommended parameters: Yes.generation_config.json
with recommended parameters: File
exists,
sets no parameters.generation_config.json
with recommended parameters: No.temperature=0.5
.temperature=0.8
, top_k=50
, top_p=0.95
(specified
on model card)generation_config.json
with recommended parameters:
Yes.temperature=0.15
. (specified on model card)generation_config.json
with recommended parameters:
Yestemperature=0.15
and includes this in
generation_config.json.default_model_temperature
. Executing
curl --location "https://api.mistral.ai/v1/models" --header "Authorization: Bearer $MISTRAL_API_KEY" | jq -r '.data[] | "\(.name): \(.default_model_temperature)"' | sort
gives some confusing results. The
mistral-small-2506
version isn't yet available on the API. But the older
mistral-small-2501
is, with a default temperature of 0.3
(differing
from the recommendation on the model
card.
mistral-small-2503
has null
for its default temperature. Go figure.generation_config.json
with recommended parameters: No.temperature=0.15
. However, this isn't set in
generation_config.json
(which doesn't set any default parameters) and Mistral's API indicates a
default temperature of 0.0
.temperature=0.7
, top_p=0.95
(specified on model
card)generation_config.json
with recommended parameters:
No (file exists, but parameters missing).temperature=0.7
and top_p=0.95
and this default temperature is also reflected
in Mistral's API mentioned above.temperature=0.6
, top_p=0.95
, top_k=20
, min_p=0
for thinking mode
and for non-thinking mode temperature=0.7
, top_p=0.8
, top_k=20
min_p=0
(specified on model card)generation_config.json
with recommended parameters: Yes, e.g. for
Qwen3-32B
(uses the "thinking mode" recommendations). (All the ones I've checked
have this at least).presence_penalty
between 0 and 2 to reduce
endless repetitions. The Qwen 3 technical
report notes the same parameters but
also states that for the non-thinking mode they set presence_penalty=1.5
and applied the same setting for thinking mode for the Creative Writing v3
and WritingBench benchmarks.generation_config.json
with recommended parameters:
No
(file exists, but parameters missing).temperature=0.6
, top_p=0.95
, top_k=40
and
max_new_tokens=30000
(specified on model card).generation_config.json
with recommended parameters:
No.Weight available (non-open) models
temperature=1.0
, top_k=64
, top_p=0.96
(source).generation_config.json
with recommended parameters:
Yes
(temperature=1.0
should be the vllm default anyway, so it shouldn't
matter it isn't specified).generation_config.json
does set top_k
and top_p
and the Unsloth
folks apparently had confirmation from the Gemma team on recommended
temperature though I couldn't find a public comment directly from
the Gemma team.temperature=0.6
, top_p=0.9
(source:
generation_config.json
).generation_config.json
with recommended parameters:
Yes.generation_config.json
via a third-party mirror
as providing name and DoB to view it on HuggingFace (as required by
Llama's restrictive access policy) seems ridiculous.As it happens, while writing this blog post I saw Simon Willison blogged
about model.yaml.
Model.yaml is an initiative from the LM Studio folks
to provide a definition of a model and its sources that can be used with
multiple local inference tools. This includes the ability to specify preset
options for the model. It doesn't appear to be used by anyone else though, and
looking at the LM Studio model catalog, taking
qwen/qwen3-32b as an example:
although the Qwen3 series have very strongly recommended default settings, the
model.yaml only sets top_k
and min_p
, leaving temperature
and top_p
unset.