Engineering

5 Hard Lessons from Building Microservices with Go and gRPC

When I first started building microservices, I had one big misconception: REST APIs are all you need. JSON is human-readable, HTTP is universal, what more could you want?

That belief fell apart once we crossed 10 services.

We were spending half our development time aligning API specs. One mismatched field from Service A would bring down Service B entirely. JSON serialization overhead was the cherry on top. That's when gRPC caught my attention.

Protobuf schemas gave us auto-generated docs and client code. Binary protocol made REST look slow by comparison. "Finally, the suffering is over," I thought. I had no idea what production had in store.

Here are five lessons I learned the hard way from running Go and gRPC in real services across my side projects and solo business.

1. No timeout means cascading bankruptcy

Everything is fast in development. Service A calls B and gets a response in 10ms. So you trust the network. Production networks are not that forgiving.

If A calls B without a timeout and B hangs for five minutes, A just waits. Requests pile up behind A. Goroutines get exhausted. A dies too. That's a cascading failure.

Set a context with a timeout on every gRPC call.

ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
 
r, err := c.SayHello(ctx, &pb.HelloRequest{Name: name})
if err != nil {
    log.Fatalf("could not greet: %v", err)
}

One second might feel aggressive, but inter-service calls should complete in tens of milliseconds. Failing one call fast is what keeps the whole system alive.

2. Skip graceful shutdown and your users will leave

When you deploy a new version, Kubernetes tears down the old Pod and spins up a new one. If your app terminates immediately on SIGTERM, in-flight requests get dropped. A user clicks "Place Order" and gets an error page. Did the order go through? Nobody knows.

You need graceful shutdown: finish in-flight requests, stop accepting new ones.

s := grpc.NewServer()
// ... register services ...
 
go func() {
    if err := s.Serve(lis); err != nil {
        log.Fatalf("failed to serve: %v", err)
    }
}()
 
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
<-c
 
log.Println("Gracefully shutting down gRPC server...")
s.GracefulStop()
log.Println("Server gracefully stopped")

It's a few lines of code. The difference in user experience is night and day.

3. Retry smart, not hard

A network switch hiccupping or a packet getting dropped is a daily occurrence. Showing users "Please try again later" for transient failures is lazy engineering.

Most transient errors resolve with one or two retries. But you can't retry everything blindly. Retrying an "insufficient balance" error is pointless. Apply retry policies only to specific gRPC status codes like Unavailable that signal network-level issues.

Exponential backoff helps too. The gRPC-Go library lets you configure retry policies as client connection options.

4. Health checks: not just "are you alive?" but "are you ready?"

Kubernetes Liveness checks if your service is alive. Readiness checks if it can handle traffic. Just checking whether a TCP port is open is not enough. Your service might be alive but in a zombie state with a dead database connection.

gRPC provides a standard health check protocol. If the DB connection drops, return NOT_SERVING so the load balancer stops routing traffic to that instance.

$ grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
{
  "status": "SERVING"
}

This single configuration dramatically improves your zero-downtime deployment success rate.

5. Interceptors: the foundation of observability

When 20 services are wired together and something breaks, finding the culprit is a nightmare. If you can't trace which services a user request touched and what happened at each hop, you're left grep-ing through logs across a dozen systems.

gRPC interceptors are middleware that wrap every request and response. They let you automatically log calls and propagate trace IDs for distributed tracing.

func loggingInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    startTime := time.Now()
    h, err := handler(ctx, req)
    log.Printf("method=%s duration=%s error=%v", info.FullMethod, time.Since(startTime), err)
    return h, err
}
 
s := grpc.NewServer(
    grpc.UnaryInterceptor(loggingInterceptor),
)

This one interceptor is the foundation of your observability stack. It tells you where bottlenecks are and where errors originate, at a glance.

Wrapping up

gRPC is a solid tool. But it's not a silver bullet. Understanding the complexity and unpredictability of production, and designing defensively around it, is the real skill.

Timeouts, graceful shutdown, retries, health checks, interceptors. These aren't just technical tips. They're about the attitude of building services that don't wake you up at 3 AM.