When a consul lock is lost, there is a possibility that the associated
session is still active. Most commonly, the long request to watch the
lock key may error out, while the session is continually refreshed at a
rate of TTL/2.
First have the lock monitor retry the lock internally for at least 10
seconds (5 attempts with the default 2 second wait time). In most cases
this will reconnect on the first try, keeping the lock channel open.
If the consul lock can't recover itself, then cancel the session as soon
as possible (terminating the PreiodicRenew will call Session.Destroy),
and start over. In the worse case, the consul agents were split, and the
session still exists on the leader so we may need to wait for the old
session TTL, plus the LockWait time to renew the lock.
We use a Context for the cancellation channels here, because that
removes the need to worry about double-closes and nil channels. It
requires an awkward adapter goroutine for now to convert the Done()
`<-chan` to a `chan` for PeriodicRenew, but makes the rest of the code
safer in the long run.
Remote state implementations may initialize a lineage when creating a
new named state (i.e. "workspace"). The tests were ignoring that initial
lineage to write a new state to the backend.
A common reason to want to use `terraform plan` is to have a chance to
review and confirm a plan before running it. If in fact that is the
only reason you are running plan, this new `terraform apply -auto-approve=false`
flag provides an easier alternative to
P=$(mktemp -t plan)
terraform refresh
terraform plan -refresh=false -out=$P
terraform apply $P
rm $P
The flag defaults to true for now, but in a future version of Terraform it will
default to false.
Rather than overloading InstanceDiff with a "Stub" attribute that is
going to be largely meaningless, we are just going to skip
pre/post-diff hooks altogether. This is under the notion that we will
eventually not need to "stub" a diff for scale-out, stateless nodes on
refresh at all, so diff behaviour won't be necessary at that point, so
we should not assume that hooks will run at this stage anyway.
Also as part of this removed the CountHook test that is now failing
because CountHook is out of scope of the new behaviour.
We are changing the behaviour of the "stub" diff operation to just have
the pre/post-diff hooks skipped on eval, meaning that the test against
CountHook will ultimately be meaningless and fail, hence we need a
different test here that tests it on a more general level.
The s3.Backend was using it's own code for DeleteState, but the dynamo
entries are only handled through the RemoteClient. Have DeleteState use
a RemoteClient for delete.
During plan and apply, because the provider constraints need to be built
from a plan, they are not checked until the terraform.Context is
created. Since the context is always requested by the backend during the
Operation, the backend needs to be responsible for generating contextual
error messages for the user.
Instead of formatting the ResolveProviders errors during NewContext,
return a special error type, ResourceProviderError to signal that
init will be required. The backend can then extract and format the
errors.
Changed the language of this field to indicate that this diff is not a
"real" diff, in that it should not be acted on, versus a "quiet" mode,
which would indicate just simply to act silently.
This fixes a bug with the new refresh graph behaviour where a resource
was being counted twice in the UI on part of being scaled out:
* We are no longer transforming refresh nodes without state to
plannable resources (the transformer will be removed shortly)
* A Quiet flag has been added to EvalDiff and InstanceDiff - this
allows for the flagging of a diff that should not be treated as real
diff for purposes of planning
* When there is no state for a refresh node now, a new path is taken
that is similar to plan, but flags Quiet, and does nothing with the
diff afterwards.
Tests pending - light testing has confirmed this should fix the double
count issue, but we should have some tests to actually confirm the bug.
Move the Swift State from a legacy remote state to an official backend.
Add `container` and `archive_container` configuration variables, and deprecate `path` and `archive_path` variables.
Future improvements: Add support for locking and environments.
We're shifting terminology from "environment" to "workspace". This takes
care of some of the main internal API surface that was using the old
terminology, though is not intended to be entirely comprehensive and is
mainly just to minimize the amount of confusion for maintainers as we
continue moving towards eliminating the old terminology.
This allows you to run multiple concurrent terraform operations against
different environments from the same source directory.
Fixes#14447.
Also removes some dead code which appears to do the same thing as the function I
modified.
Rather than providing an already-resolved map of plugins to core, we now
provide a "provider resolver" which knows how to resolve a set of provider
dependencies, to be determined later, and produce that map.
This requires the context to be instantiated in a different way, so this
very noisy diff is a mostly-mechanical update of all of the existing
places where contexts get created for testing, using some adapted versions
of the pre-existing utilities for passing in mock providers.
This reverts commit b73d037761.
This commit seems to have introduced a race condition where we can
concurrently keep updating state after we've checked if we need to
increase the serial, and thus end up writing partial changes
to the state backend.
In the case of Terraform Enterprise, this fails altogether because
of the state hash consistency check it does.
Since the DynamoDB table used by the S3 backend is no longer only used
for locks, rename it in the config to remove any confusion about it
being lock-specific.
Consul locks are based on liveness, and may be lost due timeouts,
network issued, etc. If the client determines the lock was lost, attempt
to reacquire the lock immediately.
The client was also not using the `lock` config option. Disable locks if
that is not set.
The S3 client can return (nil, nil) when the remote state doesn't exist.
The caused a nil pointer dereference when checking the payload.MD5
against the expected value.
This can happen if the remote state was manually removed, but the digest
entry was left in the DynamoDB table.
When the backend operation is cancelled, immediately call PersistState.
The is a high likelihood that the user is going to terminate the process
early if the provider doesn't return in a timely manner, so persist as
much state as possible.
Have StateHook periodically call PersistState to flush any cached state
to permanent storage. This uses a minimal 10 second interval between
calls to PersistState.
Updates to objects in S3 are only eventually consistent. If the
RemoteClient has a DynamoDB table available, use that to store a
checksum of the last written state, so the object can be verified by the
next client to call Get.
Terraform currently doesn't have any sort of user feedback around
RefreshState/Get, so we poll only for a short time before returning an
error.
In the old remote state system we had the idea of a local backup, which
is actually still present for the legacy backends but no longer applies
for the new-style backends like the s3 backend.
It's problematic when an apply runs for long enough that someone's
time-limited AWS STS credentials expire and then Terraform fails and can't
persist state to S3.
To reduce the risk of lost state, here we add some extra fallback code
for the local apply operation in particular. If either state writing
or state persisting fail then we attempt to write the state to a special
backup file errored.tfstate, and produce an error message that guides the
user on how to retry uploading this state.
In the unlikely event that we can't write to local disk either (e.g.
permissions problems) we take a last-ditch attempt to dump the JSON onto
stdout and advise the user to manually copy it into a file for import.
If even that doesn't work for some reason, we assume a critical Terraform
bug (JSON-serialization problem with states?) and bail out with an
apologetic error message.
This is implemented for the apply command in particular because this is
the one command where new objects are created in real APIs that we don't
want to lose track of. For other operations it's less bad to just generate
a simple error message and have the user retry.
This fixes#14298.
The backend apply operation doesn't need to output the same text as the
cli itself. Instead notify the user that we are in the process of
stopping the operation.
stringer has changed the boilerplate it generates in a recent version.
We'd previously updated to the new format but accientally rolled back
to the old while merging a long-running feature branch.
This restores us back to the new format again.
The documentation for Refresh indicates that it will always return a
valid state, but that wasn't true in the case of a graph builder error.
While this same concept wasn't documented for Apply, it was still
assumed in the terraform apply code.
Since the helper testing framework relies on the absence of a state to
determine if it can call Destroy, the Context can't can't start
returning a state in all cases. Document this, and use the State method
to fetch the correct state value after Apply.
Add a nil check to the WriteState function, so that writing a nil state
is a noop.
Make sure to init before sorting the state, to make sure we're not
attempting to sort nil values. This isn't technically needed with the
current code, but it's just safer in general.
Prevent extra keys in the s3 envPrefix path from showing up as
listed environments.
Better handle keys containing slashes
Add tests for unexpected keys in s3.
This matches the consul cli behavior, where locks are cleaned up after
use.
Return an error from re-locking the state. This isn't required by the
Locker interface, but it's an added sanity check for state operations.
What was incorrect here was returning an empty ID and error, which would
indicate that Lock/Unlock isn't supported.
Use the aws provider code to create the clients for the s3 backend, so
that all the behavior matches that of the provider.
Remove the fake creds from the test, as the aws provider will attempt to
validate them.
Add fields required to create an appropriate context for all calls to
clistate.Lock.
Add missing checks for Meta.stateLock, where we would attempt to lock,
even if locking should be skipped.
- Have the ui Lock helper use state.LockWithContext.
- Rename the message package to clistate, since that's how it's imported
everywhere.
- Use a more idiomatic placement of the Context in the LockWithContext
args.
golang/tools commit 23ca8a263 changed the format of the leading comment
to comply with some new standards discussed here:
https://golang.org/issue/13560
This is the result of running generate with the latest version of
stringer. Everyone working on Terraform will need to update stringer
after this is merged, to avoid reverting this:
go get -u golang.org/x/tools/cmd/stringer
The backend state tests weren't properly checking for persistence.
Update the test to persist states and fetch them again from the backend,
checking that lineage is preserved.
This adds named state (environment) support to the S3 backend.
A state NAME will prepend the configured s3 key with `env:/NAME/`.
The default state will remain rooted in the bucket for backwards
compatibility.
Locks in DynamoDB use the S3 key as the as the primary key value, so
locking will work as expected for multiple states.
Move the S3 State from a legacy remote state to an official backend.
This increases test coverage, uses a set schema for configuration, and
will allow new backend features to be implemented for the S3 state, e.g.
"environments".
Fixes#12871
We were forgetting to remove the legacy remote state from the actual
state value when migrating. This only causes an issue when saving a plan
since the plan contains the state itself and causes an error where both
a backend + legacy state exist.
If saved plans aren't used this causes no noticable issue.
Due to buggy upgrades already existing in the wild, I also added code to
clear the remote section if it exists in a standard unchanged backend
This allows a refresh on a non-existent or empty state file. We changed
this in 0.9.0 to error which seemed reasonable but it turns out this
complicates automation that runs refresh since it now needed to
determine if the state file was empty before running.
Its easier to just revert this into a warning with exit code zero.
The reason this changed is because in 0.8.x and earlier, the output
would be simply empty with exit code zero which seemed odd.
This adds a "lock" config (default true) to allow users to optionally
disable state locking with Consul. This is necessary if the token given
doesn't have session permission and is necessary for backwards
compatibility.
Add Env and SetEnv methods to command.Meta to retrieve the current
environment name inside any command.
Make sure all calls to Backend.State contain an environment name, and
make the package compile against the update backend package.
What will hopfully be the final version of the Backend interface. This
combines the MultiState interface into Backend since it will be required
to implement, and simplifies the interface because the Backend is no
longer responsible for tracking the current state.
Forgot to remove the currentState field, which was not always set. The
current state should always just be read from the environment file.
Always return the default state name when we can't determine the state.
Split the interface to change environments out from the minimal Backend
interface, to make it optional for backend implementations. If
backend.MultiState isn't implemented, return a "not implemented" from
environment related methods.
Have the Local backend delegate the MultiState methods to the proper
backend.
I made this interface way back with the original backend work and I
guess I forgot to hook it up! This is becoming an issue as I'm working
on our 2nd enhanced backend that requires this information and I
realized it was hardcoded before.
This propertly uses the CLIInit interface allowing any backend to gain
access to this data.
Fixes#12174
You're allowed to refresh with a nil module (no configs) as long as you
have state. However, if `-input=true` (default) then this would crash
since the input attempts to read the configs.
The API contract with `terraform.Context` says that the module tree must
be non-nil and loaded. To do this for other commands we create an empty
module tree. We do that here now.
This prevents Terraform from crashing on apply/destroy with a directory
with no Terraform configuration files. We allow a destroy with no files
but not an apply.
Gove LockInfo a Marshal method for easy serialization, and a String
method for more readable output.
Have the state.Locker implementations use LockError when possible to
return LockInfo and an error.
Use consul locks to implement state locking. The lock path is state path
+ "/.lock" which matches the consul cli default for locks. Lockinfo is
stored at path + "/.lockinfo".
Fixes#11628
This is a simple fix to output warnings. I originally forgot to do this
since the local backend didn't have a CLI UI at the time. It does now so
this is an easy fix.
Have the defer'ed State.Unlock call append any error to the
RunningOperation.Err field. Local error would be rare and
self-correcting, but when the backend.Local is using a remote state the
error may require user intervention.
Fixes#11504
The local backend should error if `terraform plan` is called in a
directory with no Terraform config files (same behavior as 0.8.x).
**New behavior:** We now allow `terraform plan -destroy` with no
configuration files since that seems reasonable.
The local backend implementation is an implementation of
backend.Enhanced that recreates all the behavior of the CLI but through
the backend interface.
Backends are a mechanism that allow abstracting the behavior of
Terraform CLI from the actual core. This allows us to slip in special
behavior such as state loading, remote operations, etc.