What happened?
When upgrading from com.spotify:dns version 3.1.5 to 3.2.2 some of the services started having SERVFAIL even though the service is there.
What was expected?
As there's no breaking change in the perceived API from com.spotify:dns, we expected the changes to not affect functionality.
How to reproduce
We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.
Context
We need to upgrade dnsjava:dnsjava to from version 2.x to 3.x. We checked that com.spotify:dns has done this change in version 3.2.0. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started getting SERVFAIL intermittently.
Here is an anonymised stack trace:
Jul 15, 2021 4:29:20 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<38>: (${PROTOCOL}://${SERVICE})] Failed to resolve name. status=Status{code=UNAVAILABLE, description=null, cause=java.util.concurrent.CompletionException: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:683)
at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:658)
at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2094)
at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$4(DnsSrvNameResolver.java:160)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL
at com.spotify.dns.XBillDnsSrvResolver.resolve(XBillDnsSrvResolver.java:60)
at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$0(DnsSrvNameResolver.java:162)
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:680)
... 6 more
}
We tried bumping version of dnsjava:dnsjava from 3.0.2 to 3.4.0 and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.
When we did a dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS} some hosts are returned as expected. Changing the version back to com.spotify:dns:3.1.5 and dnsjava:dnsjava:2.x makes the problem go away.
Java version used during the test:
$ java -version
> openjdk version "11.0.10" 2021-01-19 LTS
> OpenJDK Runtime Environment Corretto-11.0.10.9.1 (build 11.0.10+9-LTS)
> OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (build 11.0.10+9-LTS, mixed mode)
What happened?
When upgrading from
com.spotify:dnsversion3.1.5to3.2.2some of the services started havingSERVFAILeven though the service is there.What was expected?
As there's no breaking change in the perceived API from
com.spotify:dns, we expected the changes to not affect functionality.How to reproduce
We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.
Context
We need to upgrade
dnsjava:dnsjavato from version2.xto3.x. We checked thatcom.spotify:dnshas done this change in version3.2.0. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started gettingSERVFAILintermittently.Here is an anonymised stack trace:
We tried bumping version of
dnsjava:dnsjavafrom3.0.2to3.4.0and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.When we did a
dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}some hosts are returned as expected. Changing the version back tocom.spotify:dns:3.1.5anddnsjava:dnsjava:2.xmakes the problem go away.Java version used during the test: