Hello everyone! After several months of working on this project, I am very pleased to finally be able to share the results.
The makeup-Comparator project was initially intended to serve as both a backend for a website and a command-line interface (CLI). As of today, September 2023, only the CLI program has been released to offer users a quick method for searching for products across different websites and comparing them. This is achieved through web scraping techniques applied to various websites to extract their product data.
First and foremost, installing Makeup-Comparator is a straightforward process using Cargo. You can follow the instructions below:
After all, personal projects aren’t necessarily aimed at becoming a major success or a highly profitable service. In my case, I undertake them to learn and enhance my knowledge of various technologies and programming methods that I may not have the opportunity to explore in my daily work. That’s why I’d like to take this opportunity to explain what I’ve learned during this process and, perhaps more importantly, the challenges I’ve encountered along the way.
Rust
I chose Rust as the language for this project primarily because I wanted to delve deeper and learn more about this relatively new programming language, which I had previously used in other projects like: Easy_GA.
Rust allowed me to develop this project quite easily, thanks to its versatility and robustness. It is a language that makes it simple to start a project quickly and get things done efficiently.
Error handling is another powerful aspect of Rust. In this project, it was essential because web scraping relies heavily on whether we can retrieve the desired information or not, and it’s crucial to find solutions when this information is not available.
Testing
One of the most crucial aspects of this project was system testing, even more so than unit testing. This is significant because webpages can change frequently, and in a project that heavily relies on HTML structure, it was common to wake up one day and find that some parts of the web scraping were broken due to recent changes on the websites used to retrieve information.
Thanks to system testing, I was able to quickly identify the sources of these problems and address them promptly. It’s important to note that testing serves not only to ensure a high level of code reliability but also to define specific situations that, when altered, should trigger notifications or alerts.
Testing is indeed crucial, but its effectiveness depends on the coverage of the parts we expect to test. In this project, I paid special attention to the code coverage of my testing efforts. I designed a script to calculate the coverage generated by my tests and utilized various tools to visualize this information. I also set a goal of maintaining test coverage above 90% of the lines of code to ensure thorough testing.
The same way as with testing, this work gave me the idea of writing a article that was really popular online: Code coverage in Rust.
CI/CD
Continuous Integration/Continuous Deployment (CI/CD) pipelines are common in the industry but not often seen in personal projects. In my case, it was more about exploring this aspect and understanding what it could offer me.
Despite the fact that this project was developed by a single programmer (myself), I structured the repository to follow a Gitflow pattern of integration. I prohibited direct pushes to the master branch and enforced passing the tests defined by Github Actions before any changes were merged.
Before implementing the CI/CD pipeline, I established Git hooks to ensure that the code being pushed didn’t contain any warnings, didn’t fail static analysis, was well-formatted, and that all tests were passing.
Finally, the deployment process provided by Cargo is very straightforward and easy. I divided my project into two crates: one for web scraping, which can be reused in other projects, and the other for visualizing the results using the first crate.
Small string optimization (SSO) is a mechanism present in many programming languages and modern compilers. Its purpose is to reduce memory allocations on the heap. This optimization is particularly crucial in embedded environments and low-latency systems, where we must be vigilant about both our application’s memory consumption and code performance.
In the cppcon talk “The strange details of std::string at Facebook”, Nicholas Ormrod mentioned that std::string is the most frequently included file in the Facebook database, representing 18% of the CPU time spent in the STL.
Memory allocations are slow due to the numerous steps involved in the process of allocating memory on the heap. Typically, when working with strings, this memory is swiftly deallocated after usage, such as when printing text or assigning names to local (stack) objects.
The most frequently used string in C++ is the empty string. This string is generated when we perform default initialization on a string, and it ocuppies1 byte, which is the null terminator.
How does it works?
Before GCC 5.0, Small String Optimization (SSO) was not implemented. The std::string had a size of 8 bytes because both the size and the capacity were stored in positions -3 and -2 of the char* used to reference the std::string.
Before delving into how Small String Optimization (SSO) is implemented, it’s important to revisit how regular string objects are defined. Here’s a typical std::string implementation using modern C++:
struct string {
std::unique_ptr<char[]> data; // data
std::size_t size; // actual size
std::size_t capacity; // total capacity of the buffer
}
With this implementation, we have a string class with a size of 24 bytes (on 64-bit architectures). If we compare this to the actual std::string in C++ found in the <string> header, we can observe that the C++ implementation has a size of 32 bytes. This size difference is attributable to Small String Optimization (SSO), which consumes some space within the struct. When SSO is applied, the implementation changes to something similar to this:
The magic number 16 is determined by the compiler’s implementation and typically corresponds to 16 (15 characters + 1 null terminator).
This implementation has its advantages and disadvantages. If your string is shorter than 15 characters, it will be stored inside the std::array on the stack. However, if it exceeds 15 characters, your string will occupy 40 bytes in memory, with 24 bytes of that space going unused when SSO (capacity, size and data won’t be necessary until we reach 16 characters).
In this new implementation, we utilize a static array (std::array) that eliminates the need for memory allocation on the heap. The critical element here is the sizeof(heap_buffer), which on a 64-bit architecture would be 16 bytes. This buffer accommodates 15 bytes for the string content and 1 byte for the null terminator for small strings. With this approach, the string size remains at 24 bytes, and the use of a union ensures that the same space is utilized whether the string size is larger or smaller than 16 bytes.
Personally, I prefer the second implementation, even though it lacks the flexibility to set a fixed size for every architecture or operating system, and it is tightly bound to data sizes. There are two potential reasons for using the first approach in compilers: either they consider it more efficient or they want to maintain compatibility with the existing Application Binary Interface (ABI).
Performance
The answer seems obvious, if our data structure is smaller and non allocated on the heap, our operations will be faster since it is faster to allocate and retrieve from the stack than from the heap. There are a lot of testing doing in the internet about that but I would like to add two of the most clear and well done that I found.
Specially in the last one we can see clearly how in MacOS the SSO size is set to 23 characters, instead of 15 that is in Ubuntu
FBString by Andrei Alexandrescu
In the CppCon 2016, Nicholas Ormrod mention in his talk “The strange details of std::string at Facebook”, the implementation done by Andrei Alexandrescu of FBString in order to implement a SSO prior to GCC 5.0.
In contrast to some SSO optimizations implemented by compilers, FBString offers an SSO buffer size of 23 characters, which is larger than the 15 characters provided by the vast majority of implementations. But why is this significant? As Nicholas mentioned in the talk, one of the most commonly used strings at Facebook consists of UUIDs with 20 characters. With the majority of compiler implementations, these strings exceed the maximum characters required to fit within the SSO buffer, leading to heap memory allocation. Aside from Facebook case, 15 characters are not enought space in order to parse some of the 64-bit integer values to strings (int64_t, uint64_t, long long)
This is accomplished through the implementation of a three-tiered storage strategy and by collaborating with the memory allocator. Specifically, FBString is designed to identify the use of jemalloc and work in conjunction with it to attain substantial enhancements in both speed and memory utilization.
FBString SSO uses the last byte to represent the spare capacity for the string (remaining caracters).
This is quite clever because 0 is equivalent to ‘\0’. Consequently, when the spare capacity is 0, our last character will be the null terminator.
With this implementation, we have 3 bits remaining within the 24 available bytes to use as flags. Here’s a quick summary table about how we store tha values based on the number of characters:
Character range
Storage method
<= 23 characters
Allocated on the stack
24-255 characters
are stored in malloc-allocated memory and copied eagerly.
> 255 characters
are stored in malloc-allocated memory and copied lazily (copy on write).
Copy-on-write (COW) semantics were removed in C++11 due to multithreading problems, but Andrei’s implementation offers multithreading safety.
Nicholas claims that this new implementation resulted in a 1% performance improvement for Facebook, which may seem substantial for a mere std::string optimization. One of the reasons for this improvement is its high optimization for the cache. Although it generates more assembly instructions, it is more cache-friendly, making the implementation faster. It’s essential to consider whether our target platform has cache available when assessing the impact of this optimization.
Note: Clang and MSVC have a similar implementation.
String_view
std::string_view was introduced in C++17. The primary purpose of this data structure is to optimize operations involving read-only strings. As a result, the std::string_view interface closely resembles that of a constant std::string.
A std::string_view is exceptionally lightweight, consisting of just one pointer to data (8 bytes on 64-bit architectures) and the size (8 bytes on 64-bit architectures), for a total of 16 bytes of space. This property makes it extremely efficient to convert a std::string into a std::string_view using the std::basic_string_view(const CharT* s, size_type count) constructor. This operation is highly resource-efficient.
In contrast, transforming a null-terminated string (const char*) into a suitable data structure typically requires iterating through all the data to retrieve the size of the string. This process has a time complexity of O(n), which can be significant when dealing with very large null-terminated strings, although such situations are relatively rare.
It’s crucial to highlight that std::string_view instances are allocated on the stack, eliminating the need for memory heap allocations. This is made possible by their immutability and their role as a “view” of an existing string, rather than a separate data container.
It’s indeed a good practice to consider using std::string_view whenever you are working with data that effectively behaves like a const std::string& or a const char* without the need for a null terminator
Last but not least, std::string_view does not have a null terminator character!!!
It is widely acknowledged that ChatGPT has gained significant recognition in the field, so let us proceed directly to the valuable techniques I have discovered through my extensive experience as a programmer, which I diligently apply in my everyday coding endeavors.
1. Generate code for languages I’m not familiar with
It is a prevalent practice in software development projects to utilize multiple programming languages. When working with C++, for instance, it is often necessary to employ additional languages such as CMake, Bash, Python, and others. However, it is quite common to use these auxiliary languages for specific tasks and subsequently overlook them.
This approach proves particularly advantageous for C++ projects, as the initial setup process in CMake can be intricate and time-consuming. In situations where a simple configuration is all that is needed to initiate a project, employing this technique can save valuable time and effort.
By adopting this approach, I am able to initiate a project promptly without the need to comprehend the intricacies of CMake or its usage. This significantly reduces the time required for project setup, enabling me to dive into coding swiftly.
Another language that has proven immensely valuable is Bash. However, working with Bash sometimes necessitates an extensive understanding of the language itself as well as the Linux operating system. This dependency can be mitigated by leveraging the capabilities of ChatGPT. For instance, consider the following example where I require a quick method to obtain a list of all process IDs (PIDs) that are currently running, and subsequently remove a specific process by providing an input parameter.
2. Explain code
In the software development realm, it is not uncommon to encounter scenarios where we need to delve into pre-existing legacy code, whether they were generated for us or by other individuals. As time goes by, it becomes increasingly challenging to comprehend code that we ourselves wrote in the distant past. This difficulty amplifies when dealing with external code. However, ChatGPT excels in its ability to provide comprehensive explanations of code. Let’s consider the following code snippet:
int randomNumber = Random::get<int>(0, std::numeric_limits<int>::max());
// current date/time based on current system
std::time_t now = std::time(0);
// convert now to string form
std::string time = std::asctime(std::localtime(&now));
std::string title("Simulations/Simulation" + std::to_string(randomNumber) +".csv");
std::ofstream stream(title);
csv2::Writer<csv2::delimiter<','>> writer(stream);
This is the depth explanation:
3. Comment code
Generating meaningful and concise comments for code can often be a tedious and challenging task. It requires careful consideration of how other programmers who read the code will comprehend it accurately. Thankfully, ChatGPT can assist in automating this process. Let’s continue with the previous code snippet and ask ChatGPT to generate meaningful comments for it:
// Generate a random number
int randomNumber = Random::get<int>(0, std::numeric_limits<int>::max());
// Get the current system time
std::time_t now = std::time(0);
// Convert the time to a string
std::string time = std::asctime(std::localtime(&now));
// Create the file name for writing the CSV data
std::string title("Simulations/Simulation" + std::to_string(randomNumber) + ".csv");
// Open the file for writing
std::ofstream stream(title);
// Create a CSV writer for the file
csv2::Writer<csv2::delimiter<','>> writer(stream);
4. Improve code
Setting personal ego aside, it is a common occurrence for programmers to find themselves needing to make quick implementations to meet deadlines or due to the desire for efficiency. In such situations, ChatGPT proves to be highly valuable as it excels in improving code in ways that may not have been initially considered by the programmer.
The following code snippet is extracted from a personal project TwitchLootClicker that it is a Google Chrome extension in Javascript. Javascript is an easy language to pick up quickly but I am not very familiar with so I tend to code like if I still using C++:
The first function serves to convert milliseconds to the format of minutes and seconds (MM:SS). The second function converts milliseconds to the format of hours, minutes, and seconds (HH:MM:SS). Lastly, the third function converts minutes to milliseconds.
Introducing the next prompt to ChatGPT: “Refactor this code using idiomatic JavaScript conventions.” The goal of this prompt is to modify the code to adhere to the standard programming practices followed by JavaScript developers.
/**
* Converts milliseconds to minutes and seconds
* @param {number} millis - The time value in milliseconds
* @returns {string} The formatted time in "minutes:seconds" format
*/
function millisToMinutesAndSeconds(millis) {
if (millis < 0) {
return "Ready to claim";
}
const minutes = Math.floor(millis / 60000);
const seconds = Math.floor((millis % 60000) / 1000).toString().padStart(2, '0');
return `${minutes}:${seconds}`;
}
/**
* Converts milliseconds to hours, minutes, and seconds
* @param {number} millis - The time value in milliseconds
* @returns {string} The formatted time in "hours:h minutes:m seconds:s" format
*/
function millisToHoursMinutesAndSeconds(millis) {
const seconds = Math.floor(millis / 1000);
const minutes = Math.floor(seconds / 60);
let hours = "";
if (minutes > 59) {
hours = Math.floor(minutes / 60).toString().padStart(2, '0');
minutes %= 60;
}
const formattedMinutes = minutes.toString().padStart(2, '0');
const formattedSeconds = (seconds % 60).toString().padStart(2, '0');
if (hours !== "") {
return `${hours}h ${formattedMinutes}m ${formattedSeconds}s`;
}
return `${formattedMinutes}m ${formattedSeconds}s`;
}
/**
* Convert minutes to milliseconds
* @param {number} minutes - The time value in minutes
* @returns {number} The equivalent time value in milliseconds
*/
function minutesToMilliseconds(minutes) {
return minutes * 60000;
}
Those are the improvements done by ChatGPT to the former code:
Added type annotations to the function parameters for better code documentation.
Used template literals (“) for string concatenation, making the code more readable.
Replaced the toFixed() method with Math.floor() and toString().padStart() to ensure consistent formatting and avoid floating-point errors.
Used the const keyword for variables that don’t require reassignment.
Made the code more concise and eliminated unnecessary variable assignments.
Code coverage is a metric that verifies the extent to which code is safeguarded. It is pretty useful when implementing unit testing to ensure that your tests are covering the code conditions you really want to. Code coverage is conventionally represented as a numerical percentage, wherein a lower value implies a diminished level of code protection.
Metrics used for code coverage
While lines of code serve as one metric for assessing code coverage, it is crucial to acknowledge that they are not the sole determinant of comprehensive code protection. Various units of measurement contribute to achieving well-rounded coverage, such as function coverage, statement coverage, branch coverage and condition coverage.
Function coverage: Is a vital metric that quantifies the extent to which the defined functions are actually invoked or called during program execution. By measuring function coverage, developers can assess the effectiveness of their tests in exercising all the functions and identifying any potential gaps in code execution.
Statement coverage: Is a fundamental metric used to evaluate the degree to which statements are executed during program run-time. It measures the proportion of statements that are traversed and executed by the test suite. By examining statement coverage, developers can gain insights into the thoroughness of their tests in terms of exploring different code paths and identifying any unexecuted or potentially problematic statements.
Branch coverage: is a crucial metric that assesses the extent to which different branches of code bifurcations are executed by the test suite. It specifically measures whether all possible branches, such as those within if-else or if-elseif-else conditions, are exercised during program execution. By analyzing branch coverage, developers can determine whether their tests adequately explore various code paths, ensuring comprehensive validation of all possible branch outcomes. This helps identify potential gaps in testing and increases confidence in the reliability and robustness of the code.
Condition coverage: Is a metric used to evaluate the adequacy of tests in terms of covering all possible outcomes of boolean sub-expressions. It measures whether different possibilities, such as true or false evaluations of conditions, are effectively tested. By assessing condition coverage, developers can ensure that all potential combinations and variations within boolean expressions are thoroughly examined, mitigating the risk of undetected issues related to specific condition outcomes.
Given the next code snippet:
Rust
pubfnadd(x:usize, y:usize) ->usize {letmut z =0;if x >0&& y >0 { z = x; } z}
Function coverage will be achieved when the function add is executed.
Statement coverage will be achieved when the function add is called, such as add(1, 1), ensuring that all the lines within the function are executed.
Branch coverage will be achieved when the function is called with add(1, 0) and add(1, 1), as the first call does not cover the if statement and line 5 remains unexecuted, while the second call enters the if statement.
Condition coverage will be achieved when the function is called with add(1, 0), add(0, 1), and add(1, 1), encompassing all possible conditions within the if statement.
It is called source-based because it operates on AST (Abstract syntax tree) and preporcessor information.
Code coverage relies on 3 basic steps:
Compiling with coverage enabled: Enabling code coverage during compilation in clangd requires the inclusion of specific flags: -fprofile-instr-generate and -fcoverage-mapping.
Running the instrumented program: When the instrumented program concludes its execution, it generates a raw profile file. The path for this file is determined by the LLVM_PROFILE_FILE environment variable. If the variable is not defined, the file will be created as default.profraw in the program’s current directory. If the specified folder does not exist, it will be generated accordingly. The program replaces specific pattern strings with the corresponding values:
%p: Process ID.
%h: Hostname of the machine.
%t: The value of TMPDIR env variable.
%Nm: Instrumented binary signature. If N is not specified (run as %m), it is assumed to be N=1.
%c: This string does not expand to any specific value but serves as a marker to indicate that execution is constantly synchronized. Consequently, if the program crashes, the coverage results will be retained.
Creating coverage reports: Raw profiles files have to be indexed before generating coverage reports. This indexing process is performed by llvm-profdata.
To generate coverage reports for our tests, we can follow the documentation provided by Rust and LLVM. The first step is to install the LLVM tools.
Bash
rustupcomponentaddllvm-tools-preview
After installing the LLVM tools, we can proceed to generate the code coverage. It is highly recommended to delete any previous results to avoid potential issues. To do so, we need to execute the following sequence of commands:
Bash
# Remove possible existing coveragescargoclean && mkdir-pcoverage/ && rm-rcoverage/*CARGO_INCREMENTAL=0RUSTFLAGS='-Cinstrument-coverage'LLVM_PROFILE_FILE='coverage/cargo-test-%p-%m.profraw'cargotest
This will execute the command cargo test and calculate all the coverage for the tests executed. This process will generate the *.profaw files that contain the coverage information.
Generate HTML reports
To visualize the coverage results more effectively, we can utilize the grcov tool, which generates HTML static pages. Installing grcov is straightforward and can be done using cargo.
Bash
cargoinstallgrcov
Once grcov is installed, we can proceed to generate the HTML files that will display the coverage results.
–binary-path: Set the path to the compiled binary that will be used.
-s: Specify the root directory of the source files.
-t: Set the desired output type. Options include:
html for a HTML coverage report.
coveralls for the Coveralls specific format.
lcov for the lcov INFO format.
covdir for the covdir recursive JSON format.
coveralls+ for the Coveralls specific format with function information.
ade for the ActiveData-ETL specific format.
files to only return a list of files.
markdown for human easy read.
–branch: Enable parsing of branch coverage information.
–ignore-not-existing: Ignore source files that cannot be found on the disk.
–ignore: Ignore files or directories specified as globs.
-o: Specifies the output path.
Upon successful execution of the aforementioned command, an index.html file will be generated inside the target/coverage/ directory. The resulting HTML file will provide a visual representation of the coverage report, presenting the coverage information in a structured and user-friendly manner.
Generate lcov files
Indeed, in addition to HTML reports, generating an lcov file can be beneficial for visualizing the coverage results with external tools like Visual Studio Code. To generate an lcov file using grcov, you can use the same command as before but replace the output type “html” with “lcov.” This will generate an lcov file that can be imported and viewed in various coverage analysis tools, providing a comprehensive overview of the code coverage in a standardized format.
Finally, we can install any extension to interpretate this information. In my case, I will use Coverage Gutters.
Finally, our vscode will look something like this. When we can see more visually and dynamic which lines of our code are being tested and the total percentage of the current file.
This is my script I recommend to use in order to generate the code coverage with HTML and LCOV files.
Bash
#!/bin/bash# Define color variablesGREEN='\033[0;32m'YELLOW='\033[1;33m'NC='\033[0m'functioncleanup() {echo-e"${YELLOW}Cleaning up previous coverages...${NC}"cargoclean && mkdir-pcoverage/ && rm-rcoverage/*echo-e"${GREEN}Success: Crate cleaned successfully${NC}"}functionrun_tests() {echo-e"${YELLOW}Compiling and running tests with code coverage...${NC}"CARGO_INCREMENTAL=0RUSTFLAGS='-Cinstrument-coverage'LLVM_PROFILE_FILE='coverage/cargo-test-%p-%m.profraw'cargotest--workspaceif [[ $?-ne0 ]]; thenecho-e"${RED}Error: Tests failed to execute${NC}"exit1fiecho-e"${GREEN}Success: All tests were executed correctly!${NC}"}functiongenerate_coverage() {echo-e"${YELLOW}Generating code coverage...${NC}"grcov.--binary-path./target/debug/deps/-s.-thtml--branch--ignore-not-existing--ignore'../*'--ignore"/*"-otarget/coverage/ && \grcov.--binary-path./target/debug/deps/-s.-tlcov--branch--ignore-not-existing--ignore'../*'--ignore"/*"-otarget/coverage/lcov.infoif [[ $?-ne0 ]]; thenecho-e"${RED}Error: Failed to generate code coverage${NC}"exit1fiecho-e"${GREEN}Success: Code coverage generated correctly!${NC}"}echo-e"${GREEN}========== Running test coverage ==========${NC}"echocleanuprun_testsgenerate_coverage
Using Tarpaulin
There is an existing powerful tool called Tarpaulin that can do all this for us. We can install it with cargo:
Branch and condition coverage are the two main functionalities missed in this project. Apart from that, I do think that Tarpaulin is a good choice if you want to quickly assess your code coverage.
Git hooks are script files that can be executed in response to specific events or operations. They can be classified into two distinct categories: client-side hooks and server-side hooks.
How to create a Git Hook
Git hooks are located within the .git/hooks directory. During the initialization of a repository using git init, a set of example hooks is generated, covering various types of hooks. These examples are suffixed with .sample, and in order to utilize them, the suffix must be removed. The provided examples are predominantly written as shell scripts, incorporating some Perl, but any appropriately named executable scripts can be used—be it Ruby, Python, or any other language you are familiar with.
Each hook possesses its own unique interpretation, necessitating a review of the provided examples or documentation to determine whether specific parameters are required, if a return value is expected, and other relevant specifications.
List of Git Hooks
applypatch.msg: Check the commit log taken by applypatch from an e-mail message. This script can modify the commit message. Exit with a non-zero status to stop the commit.
post-update: Prepare a packed repository for use over dump transports.
pre-applypatch: This script is executed before the applypatch hook.
prepare-commit-msg: This hook is called with git commit and it is used to prepare the commit message. It receives two parameters: the name of the file that has the changes and the description of the commit. If the script returns a non-zero status, the commit will be aborted
commit-msg: Called with one argument, the name of the file that has the commit message. This script can modify the commit message. Exit with a non-zero status to stop the commit.
pre-commit: This hooks is executed when you use git commit command. It takes no arguments but it is useful to pause the commit returning a non-zero status with a message. E.g: stop the commit if the code is not properly formatted or stop the commit if tests are failing.
post-commit: It runs once the git commit is finished. It does not take any argument and it is usually used to provide custom messages.
pre-merge-commit: This hook is executed when you are about to perform a git merge operation. It takes no arguments and should return a non-zero status to stop a merge.
pre-push: This hooks is executed when you perform a git push command. After checking the remote status, but before pushing anything. It takes two arguments: name of the remote branch and the URL to which the push will be done. If this script retun a non-zero status, it will stop the push.
pre-rebase: This hooks is executed before git rebase is started. It takes two arguments: the upstream the series was forked from and the branch being rebased (or empty when rebasing the current branch). If it returns a non-zero status, the rebase will be aborted.
post-checkout: Executed after a successful git checkout.
A wide range of freely available crates exists to enhance the implementation of git hooks in Rust projects. Personally, I have chosen to utilize the rusty-hook crate.
There are two methods for incorporating this crate into our project. The first approach involves adding it as a dev dependency in our Cargo.toml file. Upon building the project for the first time, it will be automatically initialized:
TOML
[dev-dependencies]rusty-hook="^0.11.2"
Or install and initialize it:
Bash
cargoinstallrusty-hookrusty-hookinit
How to use it
Once the crate is initialized we will be able to see new file named .rusty-hook.toml at the root of our project. This file is the only thing we need to work with rusty-hook.
The first time we open the file everything is already set up.
Under the [hooks] section, we have the ability to assign a custom script that will be executed for a specific git hook. The following example, extracted from a personal project, showcases a script that formats the code, runs clippy (a Rust linter), and executes all tests in the project before committing the staged changes.
In july of 2022, a new iniciative was presented by Yoshua Wuyts in the Inside Rust Blog. This new iniciative was aiming to make possible the generalization between async and non-async traits and functions. Which means that we no longer have to split traits and functions in two to handle the both behaviour contexts.
I am not going to focus on the first released document because a week ago, a second update about the iniciative had been released. This time, including aswell the desire to make generic the const and non-const traits and functions.
At the time I am writting this article, Rust 1.69 is the latest nightly version. Take that in count since this iniciative is already focusing in changes that, at this moment, are not even available in standard Rust. Like async traits.
Async traits in Rust
When it comes to defining async functions and traits in Rust, the only way of doing it is by using the async_trait crate. If you are curious about why it is not standard to define async traits in Rust, I’d highly recommend to read this article.
If we try to declare a function of a trait as async, the next error message will be shown when compiling our crate:
Rust
error[E0706]: functions in traits cannot be declared `async`--> src/main.rs:2:5|2|asyncfnFoo() {|-----^^^^^^^^^||| `async` because of this|= note: `async` trait functions are not currently supported= note: consider using the `async-trait` crate: https://crates.io/crates/async-trait= note: see issue #91611 <https://github.com/rust-lang/rust/issues/91611> for more information= help: add `#![feature(async_fn_in_trait)]` to the crate attributes to enable
But even using the recomended crate for asynchronous functions, we still have the lack of generalization. See the next code snippet that fails to compile:
Rust
#[async_trait]traitMyTrait {asyncfnbar(&self);}structFooAsync;structFooNotAsync;#[async_trait]implMyTraitforFooAsync {asyncfnbar(&self) {}}implMyTraitforFooNotAsync {fnbar(&self) {} }//////////////////////////////////////--> src/main.rs:17:11|5|asyncfnbar(&self);|--------------- lifetimes inimpldo not match this method intrait||| this bound might be missing in the impl...17|fnbar(&self) {}|^ lifetimes do not match method intrait
Then the only possible way is either by creating two different traits, or by declaring two different functions.
What the KGI is proposing?
The Keyword Generics Iniciative (KGI from now on) is proposing a new syntax in order to declare traits and functions generics over asyncness and constness. The same way we already have generalization over types.
In all the article, I will be focusing only in the asyncness generalization, since it is the same explanation for constness.
This is the proposed syntax at this moment:
Rust
trait?asyncRead {?asyncfnread(&mutself, buf:&mut [u8]) ->Result<usize>;?asyncfnread_to_string(&mutself, buf:&mutString) ->Result<usize> { ... }}/// Read from a reader into a string.?asyncfnread_to_string(reader:&mutimpl?asyncRead) -> std::io::Result<String> {letmut string =String::new(); reader.read_to_string(&mut string).await?;Ok(string)}
Yayy!! Finally we have our generalization over asyncness. The syntax for ?async, is choosen to match other examples like ?Sized. If we call read_to_string in an asynchronous context, it will be locked in the .await? call, waiting to be completed. If we call it in a non-async context, it will just work as a no-op.
Rust
// `read_to_string` is inferred to be `!async` because// we didn't `.await` it, nor expected a future of any kind.#[test]fnsync_call() {let _string =read_to_string("file.txt")?;}// `read_to_string` is inferred to be `async` because// we `.await`ed it.#[async_std::test]asyncfnasync_call() {let _string =read_to_string("file.txt").await?;}
In addition to that, the KGI is also proposing a new built-in function called is_async(). This will let the user to create flow branches depending on the calling context.
Advantages
The advantages are pretty obvious, this is super powerful to avoid duplication code. It is almost like magic. Just one function that let the user use it in different contexts.
This will help a lot of crates and even the standard library in order to reduce the existing splitted implementatios for async and non-async contexts. It will even let the programmer of libraries to declare all theirs traits as “maybe async” and let the user choose what they want to implement.
Disadvantages
Syntax complexity
The main disadvantage that was brought by the community is the syntax. I also think that the Rust syntax is becoming more and more complex everytime and with one of the main objectives of the Rust Team being the goal of making Rust more easy to learn, this will not help at all.
I told that I didn’t want to include in this article the “maybe const” signature, but it is important to mention that they could be used together, e.g:
Rust
trait?const?asyncRead {?const?asyncfnread(&mutself, buf:&mut [u8]) ->Result<usize>;?const?asyncfnread_to_string(&mutself, buf:&mutString) ->Result<usize> { .. }}/// Read from a reader into a string.?const?asyncfnread_to_string(reader:&mutimpl?const?asyncRead) -> io::Result<String> {letmut string =String::new(); reader.read_to_string(&mut string).await?;Ok(string)}
The KGI is also thinking about a combined syntax for that called ?effect/.do but I will not enter on that because it is not included in the actual proposal.
So… Hard to understand that code for “just” a “maybe async and maybe const” trait and functions right? Let’s make it even more funny. Lets add them generic types, e.g:
This is getting out of hands… I would spent half of my daily work just reading over functions signatures… For those curious, there are 111 characters just to declare a function with 1 input parameter.
Low implementation efforts
It is easy to see lot of people using this feature to vagly declaring functions. Something like: “I don’t know if this will be async or not, let’s declare it as maybe”. Causing a lot of half implementations and smelly code that indicates that this is ?async so lets call with async and non-async context.
I think at this point is where the compiler should take action. We should avoid at all cost the possibility of a user calling a function in an async context, just to later realize that the implementation is completaly synchronous. This is not acceptable, the compiler should check if a possible async and non-async branch flow are defined, and this could be difficult if we want to make it absolutely opaque from the user by just interpreting .await? as a no-op in a synchronous context. But it will be easier with the is_async() and is_const() proposed.
My proposed changes
Since I would really like to see this running in the future, I would also like to propose some of my thoughts on how we could improve it.
Syntax
I really like the ?async syntax. I prefer it over the async? signature that some other users were proposing. The question mark prefix is already being used for traits and I do think that fits well with the maybe async context.
Even though the ?effect/.do is not formally proposed, I hope that the combination of ?async and ?const will not go throgh. This will add so many syntax complexity and Rust is aiming for exactly the opposite at this moment.
Implementation restrictions
It is important to ensure the user will never face a moment when it is calling a function in an asynchronous context and the function is working entirely synchronous. The compiler must check if requierements are satisfied fot both contexts.
Other option is to split the implementations in two functions, e.g:
Even though this still produces some duplication code, it also gives the advantage of declaring the trait and functions once and calling the correct function without the user action.
The other day, looking at the std::Vec documentation in Rust, I found this interesting function called swap_remove.
First, we have to talk about the obvious difference between the standard remove, that everybody has used before, and this swap_remove that is a more tricky implementation for some use cases:
Remove
Remove, as defined in its documentation, follows the next function signature: pub fn remove(&mut self, index: usize) -> T. It recieves the mutable vector itself, and the desired index of the element to be removed, and return it.
What it is important to mention is how Rust (and almost every other language) handle vectors. Vectors in computer science are defined as a collection of elements followed in memory, that means, the distance between two elements of a vector is the size of the element.
Once we delete an element that is in the middle of the structure (if you want to delete the first or last element Rust recommends to use VecDeque instead), Rust has to shift the elements to avoid empty spots in the collection. This work has a O(n) complexity in the worst case. Wich means that if we have one million elements and we want to remove the second one, the vector will have to shift 999.998 elements!!!
Swap_remove
The function swap_remove is useful to handle this problem. This operation will swap the last element with the one to be erased, and finally perform a back pop to remove it. E.g: If we remove the value 2, the vector will look like the following:
When to use it or not
With the last explanation seems like we have to use always swap_remove instead of remove right? Thats wrong!!!
Vectors are a powerful structure when we have the values sorted. This invalidates the whole point of using swap_remove, since we will have to sort the values again.
We have to be careful with working with raw pointers in Rust since they are unsafe, but I will explain some errors that we can find.
Rust
letmut v:Vec<i32> =vec![1 ,2, 3, 4];let points_to_four =&v[3] as*consti32;let points_to_two =&v[1] as*consti32;assert_eq!(unsafe{*points_to_four}, 4);assert_eq!(unsafe{*points_to_two}, 2);v.swap_remove(1);assert_eq!(unsafe{*points_to_four}, 4); // Could be false at some point that we can not controlassert_ne!(unsafe{*points_to_two}, 2); // Now points to the value 4 at v[1]
This test was very interesting. It seems like Rust clones the elements to swap instead of moving it, this is understandable since an i32 implements the Copy and Clone traits.
Those are unsafe operations but we have to be very careful of what are we doing because if our Vec<T> where T !Copy, the assert could be or not true. By the way, even if now it is true because the value is copied, this could change in the future once the OS claim this free memory address. This happens aswell with vec::remove.
Rust
letmut v:Vec<i32> =vec![1, 2, 3, 4];let last_ptr =&v[3] as*consti32;assert_eq!(unsafe{last_ptr.read()}, 4);v.swap_remove(1);assert_eq!(unsafe{last_ptr.read()}, 4); // Still truev.insert(1, 10);assert_eq!(unsafe{last_ptr.read()}, 4); // FAIL: Left: 3, Right: 4. The vector has growth, shifting the value 3 to the right.
Conclusion
We should use vec::remove when we care about the order and vec::swap_remove otherwise. Working with raw pointers and remove elements of a vector is highly unsafe, but at worst… We have the same as C++ right?
Testing in software development tend to be the forgotten and underestimated part. It usually takes time and do not produce real products or concrete features, this is why it usually get relegated in order to prioritize other aspects in a development cycle.
There is a rule that says you have to have one line of testing for one line of code. In practice, this is usually not fulfilled, usually due to time, money and resources. Companies ofter underestimate how powerfull testing is and spend a lot of hours in finding and solving bugs that shouldn’t even been in production.
Testing in big projects
Testing in big projects is always a must. Even thought, you have more or less tests, better or worse tests, but there are always some testing in huge projects where a lot of people are involved and one minor change can downfall into a bug issue. Companies are aware of that and tend to force their developers to implement testing in the code, but let’s be honest, testing is usually the part that developers hate to implement and they tend to get rid of it as soon as possible, letting a lot of cases uncovered.
Testing in small and personal projects
But obviously, if there is a place when testing is absolutely forgotten is in small and mainly personal projects. Ey! I do not say that you have to “waste” time testing you small project used by absolutely nobody in the world. But this trend of not implement testing in small projects tend to extend to libraries or small software products used by users.
Almost 6 months ago, I started learning Rust and trying to heavily participate in adding my small grain of sand to the ecosystem the comunity is building arround it. That includes libraries, frameworks, external tools and also projects.
When you add a library made by someone in your project, you are trusting this piece of software. You take for granted that eveything should work as intended. And you, as developer, are in the moral responsability that this is true, otherwise, you are not only developing a bad software but making other developers to waste their time by your fault.
Hidden gems in testing
Programmers usually understand testing as a way to ensure that his code is well constructed for any possible scenarios that may happen in the execution.
Meanwhile this is true, it is only one of the reasons for testing. A part for covering all the possible edge cases that your program can have, there are other things that testing hold your back.
Development speed in the future
Adding test to your software gives you the ability to quickly ensure that any modification you are making will not break the existing logic. This is the main reason testing is done in big projects. You want to implement a new feature or fix a bug and if testing is properly done, you do not have to mind if you are breaking other stuffs with your new implementation because the tests will ensure that for you.
Third party libraries
If we only focus on us as the only people that can break our software we are wrong. Usually, in a product development we use a bunch of third party libraries that are constantly updating and changing its logic. Maybe that third party library that you used for a function called getDollars(float euros) have change in its last release and now its new implementation is more like this: getDollars(float pounds).
This change in a third party library will produce a silent bug in our code. Now, all of our conversions will return an incorrect value since we thought we are converting our euros to dollars. But the program will not crash in any way possible unless we made the propers tests for that.
External changes
Sometimes, our software depends in some public data or values that we take for granted as inmutable. This is the parciular case that made me make this post.
Recently, I have been working in a Rust project that I will talk more about it soon. The project consist in web scrapping some website in order to retrieve information and process it. For this reason, the implementation of my logic heavily depends on how the web page structure is done and any minor change in the website made by their developers can easily break all my code.
Obviously, I can not anticipate on how and when the web programmers and UI/UX teams will update the site. In this case, aswell as tests that tries to cover all the possibles fails if the page layout changes, I also want that my tests fails if someday this happens. For example:
HTML
<!-- Version 1.0 --><html> <body> <p> Valuable data $$ </p> </body></html>
C++
// logic.cppstd::string data =get_first_element_in_body();
In this case our logic is simple, we want to get the first element inside the body in order to retrieve our valuable data. But then, the webpage made a change and this is the new code:
HTML
<!-- Version 1.1 --><html> <body> <h1> Stupid title </h1> <p> Valuable data $$ </p> </body></html>
With this small change, all our code is absolutely and silently broken. The developers decided that they wanted to add a title before that information for stetics reasons and now our webscrapping does not provide the same result as before. This would be so difficult to find without tests for that. Imagine we are storing this directly in a database or sending it to another resource. We wouldn’t know anything until the user tell us.
This can be prevented with a simple small test
C++
// test.cppstd::string data =get_first_element_in_body();static_assert(is_this_data_valuable(data));
Conclusion
Testing is as important as real product implementations. We have to start thinking as live saver and not as time wasted. In this article, I also proved that testing is not only important for breaking changes that you may introduce in your code but also for silent bugs introducen not even by yourself. I hope after this read you start seeing testing this way. You can not imagine how many stress take away from you when developing to know that your product is working as intended just by running a single command that executes the tests or even a CI flow that ensure that nobody can change anything that breaks the existing implementation.