Proper input validation in 4 steps

31 Dec 2021

Proper validation of external (or user) input is one of the basics of building a secure application. Input validation can be linked to the simple secure coding paradigm 'never trust the client'. But however simple this sounds, the implementation often leaves much to be desired.

The why

From a security perspective the main reason for proper input validation is prevention of injection vulnerabilities. In the latest OWASP top 10 (2021) injection vulnerabilities are ranked at 3th place, so they still pose a major and prevalent risk and proper input validation is an important protection against injection vulnerabilities. But the risks of missing or improper input validation go beyond basic injection issues like command injection and code execution. Proper validation can also prevent corruption of data, overloading of systems of even in some cases denials of service. And, partly outside the security scope, proper validation can also help businesses to maintain a high level of data correctness. This makes input validation an absolute must-do for all applications consuming input data, especially if this data comes from an untrusted source.

The 4 steps

Good input validation consists of 4 basic steps, the first step being optional and the other 3 mandatory.

Sanitation
Input checking
Feedback
Logging

Step 1 : Sanitation

As the first step in validation input may be cleaned up (sanitized). Sanitizing, however, is a highly optional step in the validation process. Two types of sanitation can be distinguished: removing unnecessary characters and normalization.

Removing unnecessary characters

Sanitizing input by removing parts of that input should never be done without properly assessing the impact of changing the input. Cleaning up input changes the original user input without the user knowing his or her input has been changed. This could lead to messy data or in some cases even legal problems. Use sanitation only when you are very clear about the expected inputs and the possible results after the sanitation. But although you have to be very careful when sanitizing any input, sanitation can make further validation steps easier. As a rule-of-thumb, use sanitation only to remove or replace parts of input values a user did not intentionally add. Common examples of this are:

Trailing and leading whitespace characters
Extremely common user typos
Trailing and leading alphanumeric characters for a numeric input

Normalization

Many Unicode characters can be encoded in more than one way, especially letters with accents. Unicode letter with an accent can often be encode as a single accented letter or as a basic letter and a sperate accent symbol (or diacritic). For example, the letter ắ can be encoded as U+1EAF, U+0103 U+0301 or U+0061 U+0306 U+0301. All should be displayed as the same character, but different processing of Unicode between different systems or applications could lead to issues. When you process Unicode, make sure to use a consistent way of encoding. By Normalizing your input, before further processing it you can make sure you will be using consistent encoding.

// Normalize the input to use default encoding (type C)
string normalizedString = input.Normalize();

Step 2 : Input checking

The most common mistake when checking input, is to only check for certain risky values. Some applications for example, only check for < or > symbols to try to prevent script injection. But by blacklisting unwanted input in this way it is almost impossible to catch all forms of unwanted input. Although blacklisting can in some cases add some extra security, it should never be the first step in validation. It is usually much easier and more secure to use a whitelisting approach, stating which input is allowed and marking all other input as invalid. Good validation should always start with whitelisting, which can in some situations be extended with blacklisting. Validation should consist of a combination of one or more of the following checks, sometimes multiple check can be combined in a single check statement.

Check basic datatype
Check format
Check length/range
Check for risky input

1. Check basic datatype

The most basic check is just validating if the input is of the expected datatype (for example, does the input given for a numeric field only consist of digits). Depending on the development language or framework there are a number of basic datatypes, usually integer, float, string/text, and date are amongst them. Most modern languages or frameworks will automatically throw an exception when converting (user) input to the correct datatype, so invalid datatypes will never be accepted. But never rely blindly on this. If manual validation for datatype is required, try to use libraries for this.

Integer parsing in Java

java

Integer.parseInt()

Integer parsing in C#

int.TryParse(input, out integerValue)

2. Check format

With some simple input types (like integer), no additional format checking is necessary once the datatype is properly checked. In other cases input needs to be checked to adhere to a specific format. Format can be generic (like zip-code, phone number, email address) or business specific (like customer ID, order number, etc.). For many common formats standard Regular Expressions are available, so use those if available.

Example generic regex patterns (OWASP)

Usage	Pattern
Email	`^[a-zA-Z0-9+&*-]+(?:.[a-zA-Z0-9+&-]+)@(?:[a-zA-Z0-9-]+.)+[a-zA-Z]{2,7}$`
Creditcard number	`^((4\d{3})\\|(5[1-5]\d{2})\\|(6011)\\|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}\\|3[4,7]\d{13}$`
US ZIP code	`^\d{5}(-\d{4})?$`

Some input types are more complex, especially anything containing names. For things like person names, company name and city name, simply whitelisting the basic letters of the alphabet (a-z) would be too restricted. Often you also need to allow diacritic or accented letters (é, Ꝑ, Ø) or maybe letters in non-latin scripts, as they will be valid in certain names. In these cases it is often helpful to whitelist Unicode categories, this way you can allow only letters (with or without accents and in a number of scripts). Regex has some options for checking Unicode categories and many programming languages also support checking Unicode categories.

Example of regex pattern using Unicode Categories

Pattern	Explanation
`^[\p{L} -]{2,20}$`	Allowing between 2 and 20 characters consisting of Unicode letters (from all scripts/languages) or spaces.
`^\p{N}{4}$`	Allowing 4 'digit' numbers in any script (also roman numerals)

Regex Tutorial - Unicode Characters and Properties (regular-expressions.info)

Example of C# check if character is Unicode Letter

// check if character is a letter
char.IsLetter(character)

3. Check length/range

An important check is to validate if the input does not go below or above an expected minimum or maximum. For example, an amount field in a web shop order form should realistically never have to allow more than 3 digit numbers, let alone values of millions or billions or negative numbers. Another common issues is limiting dates, some date input should never be in the past. Allowing birthdates to be 1000 years in the past could lead to some interesting issues. Always think about what would be a range of acceptable input and check for this range, whether the range is an amount of letters, date limits or numeric ranges.

Limiting the amount of characters allowed can often be checked with the regex you use for format checking. For numerical input, checking if the input falls between a specific minimum and maximum is difficult to do with regex but can be easily done with basic math functions.

Regex pattern	Explanation
`^\d{1,3}$`	Allowing only whole numbers between 1 and 3 digits.
`^\1[0-9]\|2[0-5]$`	Allows whole numbers in the range of 10 until 25.

4. Check for risky input

For most types of input the previous three checks will suffice. All previous checks use some form of whitelisting (what characters are allowed), which should be the basis of any input validation. In some cases however, additional measures are needed, this is where blacklisting (which characters or character combinations are not allowed) comes into play. Examples where additional blacklisting is useful, are large (rich)text inputs like feedback forms. Often large text inputs are almost impossible to validate by just using whitelisting, there might just be too many allowed characters and in some cases certain risky characters would also have to be whitelisted, which defeats the whole purpose of whitelisting. In these cases use a very broad whitelisting and add blacklisting to disallow both specific characters and character combinations. There is no single list of risky input as the risk depends on the locations the input values will be processed or displayed (in HTML, in SQL queries, ...). Risky inputs which are rarely valid input are < and > so these are good candidates for a blacklist. Other risky input, like quotes, are often valid input so they can not always be blocked. Be aware that risky characters can also be encoded (like HTML encoding), the encoded forms can sometimes also pose a risk. The character < can for example be encoded as <, \u003c or %3C, so think about also blocking these encoded character combinations. Always keep in mind that you'll never be able to block every possible risky input this way, use blacklisting as an additional defensive measure, but don't trust to it too much.

Regex pattern	Explanation
`^[^<>]+$`	Allowing everything except `<` and `>`.

Step 3 : Feedback

Every input should be validated, but what to do with invalid input? Should you just deny the input or should you remove or encode any undesired characters and process this cleaned up input?

Denying invalid input

In almost all situations the best solution will be to deny any invalid input by responding to the calling system or user with a fault code and a message stating what validation rules were violated for which input. For REST API's a HTTP response code of 400 (Bad Request) is the recommended return code.

Cleaning up invalid input

The risks of invalid input can theoretically be reduced or mitigated by removing invalid parts or encoding the entire input. Both methods are not advisable however. Removing parts of the input could lead to messy and unpredictable input data, and removing parts of an input could even lead to new validation violations.

Encoding (for example HTML encoding) of input as a way of sanitation is another difficult matter. Proper encoding is an important security measure, but it is best used when outputting or displaying input to the user, not immediately on the input side. As the right type of encoding depends on the way the data will be output or displayed, it is difficult to be absolutely sure which encoding type to use on the input side.

When there is a good reason to clean up invalid input in stead of simply denying it, always revalidate the result of the cleaning action.

Step 4 : Logging

A often overlooked step in proper input validation is logging. Insufficient logging in general is still a major application vulnerability (see OWASP top 10). Even if someone is trying to attack your application by trying to inject dangerous values and your input validation blocks these injection attempts, you still want to know what happened. That's where logging comes into play. In most situations you'll want to log every validation failure event, but be aware that this could lead to a large number of log entries for some applications. Try to predict the expected storage requirements for your logging and always monitor the growth of your logs.

As with all logging, the log entry should include the 5-W's (What happened, Where did it happen, Why did it happen, When did it happen and Who were involved). So log the time of the validation failure, the location (which input field in which part of the application), the details of why the input was seen as invalid (what validation rules where violated) and any identifying properties about the sender of the input (IP address, or user ID). Adding the raw user input to the log entry can also be useful, but be careful with this as it can include personal data (PII) and storing this without proper reason could be a violation of local privacy laws.

Logs of validation failures can be used for a number of reasons. For some purposes your logs will need to be inspected (long) after the actual event (e.g. auditing, non-repudiation, lawful inspections), in these cases retention times are important. In many other situations you want to be notified immediately of certain validation failures. In these cases adequate monitoring of the logs needs to be in place.

Additional Tips & Tricks

By following the four basic steps you'll have a consistent and easy way to implement proper validation. But there are some additional tips to make this even more easy.

Reuse

First of all, reuse existing components and libraries when possible. This principle is valid for all software development but is extra important for validation. Some input validations checks are pretty complex, so check what functions or libraries are available for your programming language or framework before you start building something yourself.

Build company libraries

Although many generic input validation options are available though language frameworks or public libraries, all companies will need to validate certain inputs very specific to the company. When these company specific validations rules will be used by more than one team, put these rules in a reusable library to be used throughout the company. This way you'll make sure everyone uses the same rules.

Multi layer validation

The security design principle 'defence in depth' is also valid for validation. In a multi-layered system, where data passes through a chain of applications, you should at least have input validation on the server-side of the public facing application. But in a properly secured system, you'll want to validate the input for all applications involved in the chain. Even for applications only receiving input from sources within your network, input validation improves security. Try to add validations in as many locations as possible, not just in outward facing applications.