Chapter 3:

The Guide to Google Pub/Sub

August 9, 2021
15 min read

The guide to Google Pub/Sub

Messaging frameworks are an integral part of the big data stacks of modern applications. These frameworks act as the primary communication medium between all services within the stack while ensuring security and efficiency.

There are two types of messaging frameworks: Messaging queues and distributed publish-subscribe (pub/sub) services. Messaging queues like RabbitMQ, ActiveMQ, and ZeroMQ, are mostly built on traditional message broker architecture generally suitable for point-to-point communication. These frameworks have limited support for large-scale streaming models. 

In comparison, distributed pub/sub services are designed for data scalability and customization. As a result, these services are ideal for many big data streaming applications. Here, we’ll take a deep dive into one of the most popular frameworks in the distributed pub/sub category.

Google Pub/Sub pros and cons

Google PubSub is an asynchronous, real-time messaging service backed by Google cloud. Given the inherent workflow of messaging frameworks, the cloud-based model has some pros and cons. 

Control

Pub/Sub, like many other cloud services, abstracts away many underlying configuration settings. This reduces complexity, but it often comes at the cost of limited control. Other services, like Kafka, provide options for standalone deployments and offer more configuration control such as selecting the duration of the message retention (which is fixed to seven days in Google Pub/Sub). 

Operational Cost

Fault tolerance and failure recovery make up a considerable part of operational costs for distributed systems. Fault tolerance is generally achieved by replicating data and services inside data centers across geographic regions. 

One advantage of a cloud-based service is the offloading of operational tasks. With Google Pub/Sub, fault tolerance and failure recovery is managed by Google internally. On the replication side, all Pub/Sub messages are automatically replicated across regions.

Message Retention

The requirement to keep undelivered messages may vary, depending on the use case. In some cases, this decision is governed by business-level demands such as costs rather than technical requirements. Unfortunately, Google Pub/Sub only stores a message for seven days after it has been sent by the publisher, and this setting cannot be changed. 

API

An important factor when working with big data cloud systems is their support for different technologies and languages. Google Pub/Sub has API support for multiple languages including Python, compiled languages such as C++, C#, Go, Java, and web-based frameworks like Nodejs and Ruby. Google Pub/Sub also provides a GUI interface that makes it easier for non-technical users to leverage the platform.

Reliability

Google Pub/Sub runs on top of Google data centers distributed around the globe. Synchronous, cross-zone message replication, and per-message receipt tracking ensure reliable delivery without sacrificing scalability, making Google Pub/Sub an effective message processing solution.

Google Pub/Sub vs. Kafka vs. Amazon SQS vs. Macrometa

Google Pub/Sub is just one option for a distributed messaging framework. Apache Kafka, Amazon SQS, and Macrometa are all alternatives to Google Pub/Sub with their own features and benefits.

Deployment Retention Replication Security API Ordering Price Retrieval Scalability
Google PubSub Cloud based 7 days Built In Built In Multiple Manual key based Medium Push/pull High
Kafka Multiple Configurable Manual Kerberos Java Automatic Free Pull High
Amazon SQS Queue based Configurable None Built In REST Automatic Low Pull Less
Macrometa Cloud based Configurable Built In Built In Multiple Automatic Low Push/pull High

Getting started with Google PubSub

In addition to the basic concepts behind distributed messaging and queues, there are three other topics to cover about getting started with Google Pub/Sub: topic, subscription, and message. 

Topic

A topic is a basic handle to a resource in the messaging system. Messages are sent or published to a topic by publisher applications and read from the topic by subscriber endpoints. The communication can be (1) one publisher to multiple subscribers, (2) multiple publishers to a single subscriber, and (3) multiple publishers to multiple subscribers. 

Subscription

Subscriptions, as the name suggests, are the same as “subscribers” in many publish-subscribe services. Essentially, a subscription presents a stream of messages from a given topic. 

In the stream processing context, Google PubSub ensures at-least-once message delivery as explained in this article. However, it must be noted that this at-least-once delivery guarantee only applies to messages that have at least one subscription. Therefore, any message published before a subscription is created will be lost.

Another important consideration when working with Google PubSub subscriptions is the requirement to explicitly acknowledge a message once consumed by a subscription. PubSub will consider a message consumed by a subscribed application undelivered until the subscriber acknowledges it. If a message is left unacknowledged, PubSub will keep trying to re-deliver it. Undelivered messages are kept for seven days in the system.

A Google PubSub subscription can be pull or push-based. For pull-based configurations, the PubSub service keeps adding messages to a local queue. Subscribers explicitly need to request messages which are served from the queue until it is empty, at which time an error is returned to the subscriber. On the other hand, push-based settings have an active server-side component that delivers messages to the subscribers as soon as they are received. Similar to the pull-based method, messages must be acknowledged by the subscriber. 

Message

Message behaves as a unit of communication in the system. It consists of application-specific data as well as optional attributes. Messages are sent to the topic and read from subscription endpoints. When defining messages, optional attributes can be set in the form of key-value pairs. Messages in the Google Pub/Sub platform follow a specific flow from the publisher to the subscriber. 

Source
  1. A topic is created by a publisher endpoint outside of the Pub/Sub framework. Once a topic is created, the publisher connects to it and publishes a message to it
  2. Once received by a topic inside the Pub/Sub system, a message is retained by the system for subscribers for up to seven days.
  3. A subscriber endpoint outside the Pub/Sub framework creates a subscription to an existing topic (note: subscription must exist before a message is published). Google Pub/Sub ensures that each message is delivered to every subscription of the topic.
  4. The message is forwarded to the subscriber endpoint outside of the Pub/Sub framework.
  5. The subscriber endpoint must acknowledge receipt of the message to the subscription. Once acknowledgment is received for a message, the Pub/Sub framework marks the message as delivered and removes it from storage.

Google Pub/Sub messaging example

Google Pub/Sub offers support for multiple programming languages. Here is a step-by-step example of how the Google Pub/Sub system works with Python.

  1. Follow the steps at the link below to get a json credentials file: https://cloud.google.com/pubsub/docs/quickstart-client-libraries. All references to PUBSUB_SERVICE_ACCOUNT_JSON refer to the path of this file.
  2. Install Google Cloud SDK, available here: https://cloud.google.com/sdk/docs/install  
  3. Install Python client libraries for Google PubSub using this command:

pip install --upgrade google-cloud-pubsub

  1. Create a topic in publisher.py

import json

from google.auth import jwt

from google.cloud import pubsub_v1

project_id = "PROJECT_ID"

topic_id = "TOPIC_NAME"

service_account_info = json.load(open(PUBSUB_SERVICE_ACCOUNT_JSON))


audience = "https://pubsub.googleapis.com/google.pubsub.v1.Subscriber"

credentials = jwt.Credentials.from_service_account_info(

    service_account_info, audience=audience

)


publisher_audience = "https://pubsub.googleapis.com/google.pubsub.v1.Publisher"

credentials_pub = credentials.with_claims(audience=publisher_audience)

publisher = pubsub_v1.PublisherClient(credentials=credentials_pub)


# The `topic_path` method creates a fully qualified identifier

# in the form `projects/{project_id}/topics/{topic_id}`

topic_path = publisher.topic_path(project_id, topic_id)

  1. Subscribe to topic in subscriber.py

project_id = "PROJECT_ID"

# topic to subscribe to get messages from

subscription_id = "TOPIC_NAME"

# Number of seconds the subscriber should listen for messages

timeout = 300.0


service_account_info = json.load(open(PUBSUB_SERVICE_ACCOUNT_JSON))


audience = "https://pubsub.googleapis.com/google.pubsub.v1.Subscriber"

credentials = jwt.Credentials.from_service_account_info(

    service_account_info, audience=audience

)


subscriber = pubsub_v1.SubscriberClient(credentials=credentials)

# The `subscription_path` method creates a fully qualified identifier

# in the form `projects/{project_id}/subscriptions/{subscription_id}`

subscription_path = subscriber.subscription_path(project_id, subscription_id)

  1. Listen for messages in subscriber.py and register a callback to handle each message

def callback(message):

    log.debug("Received message: {}".format(message))

    message.ack()


    log.debug(str(message.data.decode('utf-8')))

    # further process received message

    


streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)

print("Listening for messages on {}..\n".format(subscription_path))


with subscriber:

    try:

        # When `timeout` is not set, result() will block indefinitely,

        # unless an exception is encountered first.

        streaming_pull_future.result(timeout=timeout)

    except TimeoutError:

        streaming_pull_future.cancel()


  1. Prepare and send/publish messages in publisher.py

msg_str = “This is a test message published to Google PubSub”

pubsub_data = msg_str.encode("utf-8")

# When you publish a message, the client returns a future.

future = publisher.publish(topic_path, data=pubsub_data)

print(future.result())

Google Pub/Sub considerations

Like other cloud platforms, there are tradeoffs and implementation decisions to consider when using the Google Pub/Sub platform. Two of the most important considerations are costs and message routing vs. message processing. 

Costs 

Since Google Pub/Sub is a cloud service, cost is a factor to consider and track over time. For instance, sending 10,000 log records per second where each record is about 5 KB in size translates to about 4TB of data per day. This amounts to approximately $4,500 per month in Google Pub/Sub costs alone.

A micro-batching strategy can possibly help reduce these Google Pub/Sub costs. Instead of writing each record directly to Google Pub/Sub, you can combine records into a file stored on Google storage. Finally, publish the file path  to the Pub/Sub system where subscribers can read the data from the specified path and process it as required. While this strategy may work to reduce costs, it can cause delays leading to near real-time processing guarantees only.

Message Routing vs. Message Processing

In stream processing it’s important to process incoming messages with minimal latency. At the same time, it’s also important to limit overall system complexity to ensure better management and recovery. While systems like Kafka provide message routing and processing in one package, Google Pub/Sub keeps routing separate and provides separate packages, such as Apache Beam, to process the data. Both strategies have their pros and cons, and it’s up to the developer to consider the tradeoffs of both approaches based on message volume and application latency requirements.

Conclusion

Google Pub/Sub is a popular choice as a commercial, cloud-based messaging framework. Google Pub/Sub is  scalable and reliable, with built-in replication and security, and offers support for many programming languages. But Google Pub/Sub has its limitations, such as a limited retention time and lack of configuration control. Google Pub/Sub is also an isolated service, and that will require you to integrate and manage multiple Google services (compute, analytics, database, caching) to create a complete stream processing platform for your application. As more teams try to adopt an edge native approach, consider if the advantages of the individual services outweigh the complexities of a large toolchain.

Learn more about Macrometa’s edge computing platform with integrated pub/sub messaging and a stream processing engine.