Difference between revisions of "Protobuf notes"
m |
m (Create sub-section "Handling of values of zero".) |
||
(22 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | == Overview == | |
+ | |||
+ | There's a lot going on with Protobuf, therefore this local Neela Nurseries page started to capture some links to online docs and tutorials. | ||
* https://protobuf.dev/getting-started/cpptutorial/ | * https://protobuf.dev/getting-started/cpptutorial/ | ||
+ | |||
* https://protobuf.dev/programming-guides/proto3/ | * https://protobuf.dev/programming-guides/proto3/ | ||
− | Also a good more detailed introduction: | + | Also a good more detailed introduction, the following documentation page is one of a full collection of documentations. It contains example C code to "get the size of the message without storing it anywhere": |
* https://chromium.googlesource.com/external/github.com/nanopb/nanopb/+/refs/heads/master/docs/concepts.md | * https://chromium.googlesource.com/external/github.com/nanopb/nanopb/+/refs/heads/master/docs/concepts.md | ||
+ | |||
Another reference with possible good detail: | Another reference with possible good detail: | ||
* https://www.swi-prolog.org/pldoc/man?section=protobufs-tags | * https://www.swi-prolog.org/pldoc/man?section=protobufs-tags | ||
+ | |||
+ | Encoding details of protobuf "on the wire": | ||
+ | |||
+ | * https://protobuf.dev/programming-guides/encoding/ | ||
+ | |||
+ | == [[#top|^]] Terminology and Elements Of == | ||
+ | |||
+ | Protobuf has several features and "moving parts", Two of these which we'll loop back and write a bit more about include filed names and field numbers. Both of these are important identifiers which are expressed in the .proto message defining files used at build time for senders and receivers of protobuf formatted messages. Because they're defined at build time, this means that changing these name and numeric field identifiers after software has been built and released can and will likely cause message interpretation errors. The newer message format is not understood by the older version of software sending and receiving the updated messages. | ||
+ | |||
+ | Protobuf has a means for handling cases where field names and numbers need to be "removed". This means involves changing the message definitions to mark those defunct identifiers as reserved. They cannot be re-used, different names and numbers must be chosen, but they can be taken "out of circulation". Read further at: | ||
+ | |||
+ | * https://protobuf.dev/programming-guides/proto3/#assigning | ||
+ | |||
+ | == [[#top|^]] To Factor Protobuf Message Definitions == | ||
+ | |||
+ | The following online guide speaks to defining multiple related protobuf messages in a single .proto file: | ||
+ | |||
+ | * https://protobuf.dev/programming-guides/proto3/#adding-types | ||
+ | |||
+ | == [[#top|^]] Nanopb == | ||
+ | |||
+ | 2022-01-08 Saturday | ||
+ | |||
+ | * https://github.com/nanopb/nanopb/blob/master/generator/proto/nanopb.proto | ||
+ | * https://jpa.kapsi.fi/nanopb/docs/whats_new.html | ||
+ | * https://jpa.kapsi.fi/nanopb/docs/ | ||
+ | |||
+ | * https://docs.python.org/3/tutorial/modules.html | ||
+ | |||
+ | Cmake script to locate Nanopb headers and sources: | ||
+ | |||
+ | * https://chromium.googlesource.com/external/github.com/nanopb/nanopb/+/nanopb-0.2.9.1/extra/FindNanopb.cmake | ||
+ | |||
+ | A nanopb API reference: | ||
+ | |||
+ | * https://jpa.kapsi.fi/nanopb/docs/reference.html | ||
+ | |||
+ | * https://jpa.kapsi.fi/nanopb/docs/reference.html#pb_encode | ||
+ | |||
+ | * https://jpa.kapsi.fi/nanopb/docs/reference.html#pb_encode_submessage | ||
+ | |||
+ | A further reference from University of Hannover: | ||
+ | |||
+ | * https://gitlab.uni-hannover.de/tci-gateway-module/grpc/-/blob/47a06ace92d0db299e6fa9ecc9a9d26db8d85c62/third_party/nanopb/docs/reference.rst#pb-encode | ||
+ | |||
+ | |||
+ | pb_encode.c has an interesting function . . . | ||
+ | |||
+ | <pre> | ||
+ | 258 /* Encode a field with callback semantics. This means that a user function is | ||
+ | 259 * called to provide and encode the actual data. */ | ||
+ | 260 static bool checkreturn encode_callback_field(pb_ostream_t *stream, | ||
+ | 261 const pb_field_t *field, const void *pData) | ||
+ | 262 { | ||
+ | 263 const pb_callback_t *callback = (const pb_callback_t*)pData; | ||
+ | 264 | ||
+ | 265 #ifdef PB_OLD_CALLBACK_STYLE | ||
+ | 266 const void *arg = callback->arg; | ||
+ | 267 #else | ||
+ | 268 void * const *arg = &(callback->arg); | ||
+ | 269 #endif | ||
+ | 270 | ||
+ | 271 if (callback->funcs.encode != NULL) | ||
+ | 272 { | ||
+ | 273 if (!callback->funcs.encode(stream, field, arg)) | ||
+ | 274 PB_RETURN_ERROR(stream, "callback error"); | ||
+ | 275 } | ||
+ | 276 return true; | ||
+ | 277 } | ||
+ | </pre> | ||
+ | |||
+ | === [[#top|^]] Handling of values of zero === | ||
+ | |||
+ | To save space nanopb (and maybe protobuf by its specification) does not encode values of zero and their fields, and during decoding the protobuf implementation assumes that missing fields carry a value of zero: | ||
+ | |||
+ | * https://github.com/nanopb/nanopb/issues/696 | ||
+ | |||
+ | === [[#top|^]] Set Up Errors === | ||
+ | |||
+ | * https://github.com/zephyrproject-rtos/zephyr/issues/70065 | ||
+ | |||
+ | == [[#top|^]] Protobuf C Code Examples == | ||
When compiling nanopb Protobuf library as part of C language programs, nested Protobuf messages require use of nanopb defined function type `pb_callback_t` in order to encode and to decode those nested messages. Some examples of this on github: | When compiling nanopb Protobuf library as part of C language programs, nested Protobuf messages require use of nanopb defined function type `pb_callback_t` in order to encode and to decode those nested messages. Some examples of this on github: | ||
Line 69: | Line 155: | ||
batched->device_id.funcs.encode = encode_device_id_string; | batched->device_id.funcs.encode = encode_device_id_string; | ||
} | } | ||
+ | |||
+ | A search for calls to `pack_batched_periodic_data()`: | ||
+ | |||
+ | $ grep -nr pack_batched_periodic_data ./* | ||
+ | ./commands.c:1069: pack_batched_periodic_data(&data_batched, &periodicdata); | ||
+ | ./proto_utils.c:158:void pack_batched_periodic_data(batched_periodic_data* batched, periodic_data_to_encode* encode_wrapper) | ||
+ | ./proto_utils.h:29:void pack_batched_periodic_data(batched_periodic_data* batched, periodic_data_to_encode* encode_wrapper); | ||
</pre> | </pre> | ||
+ | |||
+ | Tracing yet further back kitsune project commands.c has following routine which declares and uses a `periodic_data` type: | ||
+ | |||
+ | <pre> | ||
+ | 1038 void thread_tx(void* unused) { | ||
+ | 1039 batched_periodic_data data_batched = {0}; | ||
+ | 1040 #ifdef UPLOAD_AP_INFO | ||
+ | 1041 batched_periodic_data_wifi_access_point ap; | ||
+ | 1042 #endif | ||
+ | 1043 periodic_data forced_data; | ||
+ | 1044 bool got_forced_data = false; | ||
+ | 1045 | ||
+ | 1046 LOGI(" Start polling \n"); | ||
+ | 1047 while (1) { | ||
+ | 1048 if (uxQueueMessagesWaiting(data_queue) >= data_queue_batch_size | ||
+ | 1049 || got_forced_data ) { | ||
+ | 1050 LOGI( "sending data\n" ); | ||
+ | 1051 | ||
+ | 1052 periodic_data_to_encode periodicdata; | ||
+ | 1053 periodicdata.num_data = 0; | ||
+ | 1054 periodicdata.data = (periodic_data*)pvPortMalloc(MAX_BATCH_SIZE*sizeof(periodic_data)); | ||
+ | 1055 | ||
+ | 1056 if( !periodicdata.data ) { | ||
+ | 1057 LOGI( "failed to alloc periodicdata\n" ); | ||
+ | 1058 vTaskDelay(1000); | ||
+ | 1059 continue; | ||
+ | 1060 } | ||
+ | 1061 if( got_forced_data ) { | ||
+ | 1062 memcpy( &periodicdata.data[periodicdata.num_data], &forced_data, sizeof(forced_data) ); | ||
+ | 1063 ++periodicdata.num_data; | ||
+ | 1064 } | ||
+ | 1065 while( periodicdata.num_data < MAX_BATCH_SIZE && xQueueReceive(data_queue, &periodicdata.data[periodicdata.num_ data], 1 ) ) { | ||
+ | 1066 ++periodicdata.num_data; | ||
+ | 1067 } | ||
+ | 1068 | ||
+ | 1069 pack_batched_periodic_data(&data_batched, &periodicdata); | ||
+ | 1070 | ||
+ | 1071 data_batched.has_uptime_in_second = true; | ||
+ | 1072 data_batched.uptime_in_second = xTaskGetTickCount() / configTICK_RATE_HZ; | ||
+ | 1073 | ||
+ | 1074 if( !is_test_boot() && provisioning_mode ) { | ||
+ | |||
+ | . . . | ||
+ | </pre> | ||
+ | |||
+ | In this kitsune project see also `kitsune/kitsune/protobuf/provision.pb.h`. | ||
+ | |||
+ | == [[#top|^]] To Predetermine Encoded Data Size == | ||
+ | |||
+ | This section might also be titled "To Determine Encoded Data Size at Build Time". | ||
+ | |||
+ | Often not possible with messages of variable and unknown length, the question of determining maximum possible encoded message size at build time can be useful for projects where messages have fixed field counts in a given message. This section collects what tools and online public discussions cover this protobuf analysis topic. | ||
+ | |||
+ | * https://stackoverflow.com/questions/30915704/maximum-serialized-protobuf-message-size | ||
+ | |||
+ | <!-- | ||
+ | Stack Overflow discussion and helpful answer by Kenton Varda: | ||
+ | |||
+ | " | ||
+ | 33 | ||
+ | |||
+ | In general, any Protobuf message can be any length due to the possibility of unknown fields. | ||
+ | |||
+ | If you are receiving a message, you cannot make any assumptions about the length. | ||
+ | |||
+ | If you are sending a message that you built yourself, then you can perhaps assume that it only contains fields you know about -- but then again, you can also easily compute the exact message size in this case. | ||
+ | |||
+ | Thus it's usually not useful to ask what the maximum size is. | ||
+ | |||
+ | With that said, you could write code that uses the Descriptor interfaces to iterate over the FieldDescriptors for a message type (MyMessageType::descriptor()). | ||
+ | |||
+ | See: https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor | ||
+ | |||
+ | Similar interfaces exist in Java, Python, and probably others. | ||
+ | |||
+ | Here's the rules to implement: | ||
+ | |||
+ | Each field is composed of a tag followed by some data. | ||
+ | |||
+ | For the tag: | ||
+ | |||
+ | Field numbers 1-15 have a 1-byte tag. | ||
+ | Field numbers 16 and up have 2-byte tags. | ||
+ | |||
+ | For the data: | ||
+ | |||
+ | bool is always one byte. | ||
+ | int32, int64, uint64, and sint64 have a maximum data length of 10 bytes (yes, int32 can be 10 bytes if it is negative, unfortunately). | ||
+ | sint32 and uint32 have a maximum data length of 5 bytes. | ||
+ | fixed32, sfixed32, and float are always exactly 4 bytes. | ||
+ | fixed64, sfixed64, and double are always exactly 8 bytes. | ||
+ | Enum-typed fields' maximum length depends on the maximum enum value: | ||
+ | 0-127: 1 byte | ||
+ | 128-16384: 2 bytes | ||
+ | ... it's 7 bits per byte, but hopefully your enum isn't THAT big! | ||
+ | Also note that negative values will be encoded as 10 bytes, but hopefully there aren't any. | ||
+ | Message-typed fields' maximum length is the maximum length of the message type plus bytes for the length prefix. The length prefix is, again, one byte per 7 bits of integer data. | ||
+ | Groups (which you shouldn't be using; they're a decrepit old feature deprecated before protobuf was even released publicly) have a maximum size equal to the maximum size of the contents plus a second field tag (see above). | ||
+ | |||
+ | If your message contains any of the following, then its maximum length is unbounded: | ||
+ | |||
+ | Any field of type string or bytes. (Unless you know their max length, in which case, it's that max length plus a length prefix, like with sub-messages.) | ||
+ | Any repeated field. (Unless you know its max length, in which case, each element of the list has a max length as if it were a free-standing field, including tag. There is NO overall length prefix here. Unless you are using [packed=true], in which case you'll have to look up the details.) | ||
+ | Extensions. | ||
+ | " | ||
+ | --> | ||
+ | |||
+ | == [[#top|^]] Encoding Submessages == | ||
+ | |||
+ | * https://stackoverflow.com/questions/56739667/nanopb-protocol-buffers-library-repeated-sub-messages-encode | ||
+ | |||
+ | * https://groups.google.com/g/nanopb/c/OT4Kw3Siuio | ||
+ | |||
+ | May be necessary in a pb_callback_t function to call `pb_encode_tag()` followed by `pb_encode_submessage()`. | ||
+ | |||
+ | * https://github.com/nanopb/nanopb/issues/331 | ||
+ | |||
+ | Evidently protobuf callbacks for decoding are not supported for `oneof` types: | ||
+ | |||
+ | * https://stackoverflow.com/questions/39854434/nanopb-correctly-encoding-and-decoding-repeated-construct-fields-in-submessage | ||
+ | |||
+ | How to encode strings, hint: use the .arg message structure member: | ||
+ | |||
+ | * https://stackoverflow.com/questions/57569586/how-to-encode-a-string-when-it-is-a-pb-callback-t-type | ||
+ | |||
+ | == [[#top|^]] Length Prefixing == | ||
+ | |||
+ | One way to send large data sets via protobuf is to break them into smaller pieces, and apply protobuf definition to give these pieces a meaning both sender and receiver can understand. See one Mr. Eli's article on this strategy: | ||
+ | |||
+ | * https://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers | ||
+ | |||
+ | == [[#top|^]] Other Protobuf Libraries == | ||
+ | |||
+ | Pigweed protobuf library . . . | ||
+ | |||
+ | * https://pigweed.dev/pw_protobuf/#comparison-with-other-protobuf-libraries | ||
+ | |||
+ | <!-- odne komentar --> | ||
== [[#top|^]] References To Sort == | == [[#top|^]] References To Sort == | ||
Line 103: | Line 334: | ||
. . . It appears that the integer values which message elements are assigned as tantamount to key names in JSON. | . . . It appears that the integer values which message elements are assigned as tantamount to key names in JSON. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<!-- comentario --> | <!-- comentario --> |
Latest revision as of 23:12, 29 October 2024
Contents
Overview
There's a lot going on with Protobuf, therefore this local Neela Nurseries page started to capture some links to online docs and tutorials.
Also a good more detailed introduction, the following documentation page is one of a full collection of documentations. It contains example C code to "get the size of the message without storing it anywhere":
Another reference with possible good detail:
Encoding details of protobuf "on the wire":
^ Terminology and Elements Of
Protobuf has several features and "moving parts", Two of these which we'll loop back and write a bit more about include filed names and field numbers. Both of these are important identifiers which are expressed in the .proto message defining files used at build time for senders and receivers of protobuf formatted messages. Because they're defined at build time, this means that changing these name and numeric field identifiers after software has been built and released can and will likely cause message interpretation errors. The newer message format is not understood by the older version of software sending and receiving the updated messages.
Protobuf has a means for handling cases where field names and numbers need to be "removed". This means involves changing the message definitions to mark those defunct identifiers as reserved. They cannot be re-used, different names and numbers must be chosen, but they can be taken "out of circulation". Read further at:
^ To Factor Protobuf Message Definitions
The following online guide speaks to defining multiple related protobuf messages in a single .proto file:
^ Nanopb
2022-01-08 Saturday
- https://github.com/nanopb/nanopb/blob/master/generator/proto/nanopb.proto
- https://jpa.kapsi.fi/nanopb/docs/whats_new.html
- https://jpa.kapsi.fi/nanopb/docs/
Cmake script to locate Nanopb headers and sources:
A nanopb API reference:
A further reference from University of Hannover:
pb_encode.c has an interesting function . . .
258 /* Encode a field with callback semantics. This means that a user function is 259 * called to provide and encode the actual data. */ 260 static bool checkreturn encode_callback_field(pb_ostream_t *stream, 261 const pb_field_t *field, const void *pData) 262 { 263 const pb_callback_t *callback = (const pb_callback_t*)pData; 264 265 #ifdef PB_OLD_CALLBACK_STYLE 266 const void *arg = callback->arg; 267 #else 268 void * const *arg = &(callback->arg); 269 #endif 270 271 if (callback->funcs.encode != NULL) 272 { 273 if (!callback->funcs.encode(stream, field, arg)) 274 PB_RETURN_ERROR(stream, "callback error"); 275 } 276 return true; 277 }
^ Handling of values of zero
To save space nanopb (and maybe protobuf by its specification) does not encode values of zero and their fields, and during decoding the protobuf implementation assumes that missing fields carry a value of zero:
^ Set Up Errors
^ Protobuf C Code Examples
When compiling nanopb Protobuf library as part of C language programs, nested Protobuf messages require use of nanopb defined function type `pb_callback_t` in order to encode and to decode those nested messages. Some examples of this on github:
In the first example an early on file instance of `pb_callback_t` occurs on line 56. Looking further this project has a few dozen protoc generated files . . . switching to a possible smaller project:
In kitsune project, looking at:
(1) file kitsune/kitsune/audio_features_upload_task.c function setup_protbuf( . . . )
(2) file audio_features_upload_task_helpers.c function encode_repeated_streaming_bytes_and_mark_done(pb_ostream_t *stream, const pb_field_t *field, void * const *arg)
(3) in same file reviewing function write_streams(pb_ostream_t *stream, const pb_field_t *field,hlo_stream_t * hlo_stream)
Here is an excerpt from proto_utils.c which appears to contain a pb_callback_t definition:
147 bool encode_device_id_string(pb_ostream_t *stream, const pb_field_t *field, void * const *arg) { 148 //char are twice the size, extra 1 for null terminator 149 char hex_device_id[2*DEVICE_ID_SZ+1] = {0}; 150 if(!get_device_id(hex_device_id, sizeof(hex_device_id))) 151 { 152 return false; 153 } 154 155 return pb_encode_tag_for_field(stream, field) && pb_encode_string(stream, (uint8_t*)hex_device_id, strlen(hex_device_id)); 156 } Same routine no line numbers, plus following routine which references first routine in function point assignment: bool encode_device_id_string(pb_ostream_t *stream, const pb_field_t *field, void * const *arg) { //char are twice the size, extra 1 for null terminator char hex_device_id[2*DEVICE_ID_SZ+1] = {0}; if(!get_device_id(hex_device_id, sizeof(hex_device_id))) { return false; } return pb_encode_tag_for_field(stream, field) && pb_encode_string(stream, (uint8_t*)hex_device_id, strlen(hex_device_id)); } void pack_batched_periodic_data(batched_periodic_data* batched, periodic_data_to_encode* encode_wrapper) { if(NULL == batched || NULL == encode_wrapper) { LOGE("null param\n"); return; } batched->data.funcs.encode = encode_all_periodic_data; // This is smart :D batched->data.arg = encode_wrapper; batched->firmware_version = KIT_VER; batched->device_id.funcs.encode = encode_device_id_string; } A search for calls to `pack_batched_periodic_data()`: $ grep -nr pack_batched_periodic_data ./* ./commands.c:1069: pack_batched_periodic_data(&data_batched, &periodicdata); ./proto_utils.c:158:void pack_batched_periodic_data(batched_periodic_data* batched, periodic_data_to_encode* encode_wrapper) ./proto_utils.h:29:void pack_batched_periodic_data(batched_periodic_data* batched, periodic_data_to_encode* encode_wrapper);
Tracing yet further back kitsune project commands.c has following routine which declares and uses a `periodic_data` type:
1038 void thread_tx(void* unused) { 1039 batched_periodic_data data_batched = {0}; 1040 #ifdef UPLOAD_AP_INFO 1041 batched_periodic_data_wifi_access_point ap; 1042 #endif 1043 periodic_data forced_data; 1044 bool got_forced_data = false; 1045 1046 LOGI(" Start polling \n"); 1047 while (1) { 1048 if (uxQueueMessagesWaiting(data_queue) >= data_queue_batch_size 1049 || got_forced_data ) { 1050 LOGI( "sending data\n" ); 1051 1052 periodic_data_to_encode periodicdata; 1053 periodicdata.num_data = 0; 1054 periodicdata.data = (periodic_data*)pvPortMalloc(MAX_BATCH_SIZE*sizeof(periodic_data)); 1055 1056 if( !periodicdata.data ) { 1057 LOGI( "failed to alloc periodicdata\n" ); 1058 vTaskDelay(1000); 1059 continue; 1060 } 1061 if( got_forced_data ) { 1062 memcpy( &periodicdata.data[periodicdata.num_data], &forced_data, sizeof(forced_data) ); 1063 ++periodicdata.num_data; 1064 } 1065 while( periodicdata.num_data < MAX_BATCH_SIZE && xQueueReceive(data_queue, &periodicdata.data[periodicdata.num_ data], 1 ) ) { 1066 ++periodicdata.num_data; 1067 } 1068 1069 pack_batched_periodic_data(&data_batched, &periodicdata); 1070 1071 data_batched.has_uptime_in_second = true; 1072 data_batched.uptime_in_second = xTaskGetTickCount() / configTICK_RATE_HZ; 1073 1074 if( !is_test_boot() && provisioning_mode ) { . . .
In this kitsune project see also `kitsune/kitsune/protobuf/provision.pb.h`.
^ To Predetermine Encoded Data Size
This section might also be titled "To Determine Encoded Data Size at Build Time".
Often not possible with messages of variable and unknown length, the question of determining maximum possible encoded message size at build time can be useful for projects where messages have fixed field counts in a given message. This section collects what tools and online public discussions cover this protobuf analysis topic.
^ Encoding Submessages
May be necessary in a pb_callback_t function to call `pb_encode_tag()` followed by `pb_encode_submessage()`.
Evidently protobuf callbacks for decoding are not supported for `oneof` types:
How to encode strings, hint: use the .arg message structure member:
^ Length Prefixing
One way to send large data sets via protobuf is to break them into smaller pieces, and apply protobuf definition to give these pieces a meaning both sender and receiver can understand. See one Mr. Eli's article on this strategy:
^ Other Protobuf Libraries
Pigweed protobuf library . . .
^ References To Sort
Protobuf references, somewhat arbitrary starting point yet introduces some key topics of Protobuf standard and use cases:
- https://www.crankuptheamps.com/blog/posts/2017/10/12/protobuf-battle-of-the-syntaxes/
- https://www.educative.io/edpresso/what-is-the-difference-between-protocol-buffers-and-json
JSON supported data types:
First Protobuf .proto file, compiles using `protoc-c`, part of a package available with Ubuntu 20.04:
// syntax = "proto3"; syntax = "proto2"; // Notes: // $ protoc-c --c_out=. ./first.proto message sensorUpdates { required int32 message_id = 1; optional float vrms = 2; }
. . . It appears that the integer values which message elements are assigned as tantamount to key names in JSON.