Test data best practices¶

Guidance

Consider the cost and maintenance overhead of developing your own tools versus leveraging industry or open-source tools for anonymizing and generating synthetic data.
Review the VA Technical Reference Model(TRM) for VA approved products that can be utilized.
Review guidance from Health and Human Services (HHS) and National Institute of Standards and Technology (NIST) for anonymizing data.

Test Data Management (TDM) has evolved into a specialized industry to address key challenges:

Generating representative test data that includes important edge-cases
Creating test data with production-like scenarios
Creating and maintaining test accounts to mimic role-based access
Resetting test data when APIs add, update, or delete data

Best practices for managing test data include:

Generating synthetic data
Mocking API responses
Anonymizing production data to eliminate sensitive information
Maintaining user-role test accounts
Creating representative data to reflect production complexities
Implementing data reset and refresh processes

Generate synthetic test data¶

Synthetic test data is for teams who manage their database and the data within it or who need variety in mocked API responses.

Synthetic test data can be generated to mimic real production data. There are open-source tools and paid-for products that will generate different types of data, such as random names, addresses, and phone numbers. Using an industry tool is better practice than developing custom tools when needing thousands of test data records. The tip here is to generate representative data.

Mock API responses¶

Mocking API responses is very similar and related to generating synthetic test data, but in this case you don't manage the database or data within it and therefore can't control the data. Instead, mocked results are returned to simulate the calling of the interface. This may still require generating synthetic test data for the mocked responses.

For example, your API must call another API within VA called Rx API to retrieve a Veteran's prescriptions. Instead of your API making that actual call to the Rx API, a mocked interface would return a mocked result from a set of mocked results you manage to simulate the actual call.

A good example of implementing this would be having the mocked response return different types of prescriptions and a variety of list sizes for each different Veteran test account. Try to be as representative to the real experience as possible.

Mocked data responses can often be simplified to be json blobs instead of maintaining a replica test database.

Caution

Mocked interfaces often respond faster than the actual interface call.
Mocked interfaces may not completely simulate the behavior of the actual interface, therefore, it may give a false confidence that production will behave the same way.

Mocking responses can have drawbacks that API teams must consider. Since the underlying code isn't executed in mocked interfaces, the API may appear to behave correctly in the non-production environments but fail in production. Considerations include:

Response times not reflecting actual interface calls.
Mocked responses not representing actual results from the called interface due to mistakes in generating test data or due to changes to production systems that drift from the mocked responses, such as a change in a datatype of a property.
Network and security factors with server certificates and credentials going untested.

To ensure API quality, perform comprehensive end-to-end testing in a fully integrated non-production environment that does not rely on simulated interfaces.

Anonymize production data¶

Anonymizing data is the process of transforming all personal and sensitive datasets irreversibly.

Requirement

Anonymized data must be irreversible.

Irreversible, means the resulting data cannot be reverse-engineered to reconstruct the original data. For example, consider the name "John Doe" as a real person. If this is anonymized by shifting each letter one character to the right, it creates "Kpio Epf" for the name. This anonymized name has 2 major drawbacks. It is easily reverse-engineered to determine the actual name and the name is unpronounceable. This method and methods like this are discouraged and represent an anti-pattern.

Anonymization is best done by a software product whose job is to do just that. However, customization to such tooling could be necessary, such as reducing the total dataset to fit the smaller test database size. The best anonymization tools would take the above example of "John Doe" and replace "John" with a random choice from a large dataset of first names, then replace "Doe" with a random choice from a large dataset of last names. If truly random choices were made, it would be impossible to reverse-engineer back to the original name.

The anonymizing strategy is best suited for teams who manage the database and the data it contains, where data is relatively simple, and when the production data can be reduced to fit the smaller database footprint of a typical development environment.

Keep in mind, anonymizing the sensitive data includes more than just names as given in the above example. It covers a wide range of data elements, including Social Security Numbers(SSNs), Internal Control Numbers (ICNs), sensitive dates (e.g., birth, death, and service dates), email addresses, physical addresses, phone numbers, and medical records. Remember to handle the often overlooked areas such as sensitive data on PDF documents. These items are meant as examples and not an exhaustive list.

For further guidance on de-identification techniques and requirements visit HHS and NIST SP 800-188.

Maintain test accounts¶

Guidance

API teams should maintain and document a set of test accounts to support scenarios where API behavior depends on the identity or role of the requester.
Test accounts should be documented with the API documentation.

When API behavior depends on the identity of the requester, API teams should maintain and document a set of test accounts for use with the API. This documentation should describe the specific characteristics that make each test account relevant for a test case. For example, include test accounts for representing various account statuses, accounts that may have related data such as a list of several claims or prescriptions, and accounts with a mixture of data domains, such as both appeals and claims.

The list of test accounts a consumer can use should be documented with the API documentation along with information about how to reset the account to the baseline dataset.

Representative test data¶

Representative test data requires having enough variety in the test data that reflects real production experiences.

Strive for test data to include:

empty and populated optional values
lists that include empty, typical size, and extreme size
numeric values that contain typical size, extremely low, and extremely high
booleans with both false and true example sets
variety of different enumerated values
string values with no value, typical size, and largest size it can hold

For example, if an API returns a list of prescriptions for a Veteran, then the data will need to include several different test accounts with various data scenarios in order to test pagination and performance. For example, test accounts with:

no prescription (extreme low)
1 prescription (typical)
7 prescriptions (typical)
40 or more prescriptions (extreme high)

To get the variety of datasets needed, synthetic data generator tools should be considered to determine if it can save time over manual test data creation.

Test data reset¶

If the API allows for adding, updating, or deleting data then automated cleanup will be helpful to reset the test data to baseline values. API teams have had success by scheduling ongoing data resets on a nightly or weekly basis. In addition, for APIs maintaining test accounts, having on-demand resets for a test account.

Since APIs are unique in what data they manage, the strategies vary for cleaning. Typical strategies include:

Ability to reset data for a single test account.
Refresh the entire test database to a "golden" set of baseline data on a periodic basis.
Run an auto-cleanup after each automated test run if the test generated or changed data.