Data software: Software challenge

Nothing reinforces direct mail's 'junk' image more strongly than incorrectly addressed mail. Peter Crush set three suppliers the task of cleaning some specially prepared data to test their skills.

If it really was that simple, examples of direct mail being addressed to 'Mr Annoying Bastard' would be far less frequent than they are. But to the consumers who receive such mailers, suppliers' pleas that "it just happens" aren't good enough.

In the world of data cleaning, the fact that mail is still wrongly addressed, sent to dead people, to people who have moved, or sent two or three times to the same person are all endemic of poor data quality checks at the data-processing level. That these mistakes ought to be ironed out is an issue the DMA and suppliers such as The REaD Group battle with on a daily basis.

Clients have a bewildering choice of data bureaux, many of which have their own software solutions that promise to input dirty data at one end and produce clean, sparkling data at the other. But with each set-up having different rules and scales of probability for spotting obscenities, duplications and wrong addresses, there's still considerable human intervention involved in deciding whether, say, 'Mr Davis' is the same person as 'Mr Davies'.

We decided to put three suppliers to the test to see just how well they coped with a sample database of names and addresses.

The challenge

Our brief was deliberately open-ended - designed to resemble a normal client brief as much as possible. Our suppliers were to run a data health check on 70,000 cold records split into two files, run the data against all the suppression files they thought necessary and produce a report on any undeliverable records. The only guidance we gave was that it was not necessary to PAF-verify data or suppress any foreign records. The rest was up to them, meaning that the test also revealed what each supplier thought was necessary for a basic data audit.

The results of their work would be critiqued by our own independent adjudicator, Nick Pride-Hearn, technical director of Intelligent Print Solutions, who would see just how well each supplier did. Part of this review was also on how easily understandable the standard reports were to read, as well as how accurate they were. We didn't make it that easy either: we deliberately added deceased, obscene and salacious words, just to make it more interesting.

Our thanks goes to our suppliers for taking part - it's a brave bureau that puts itself under such scrutiny. But this isn't a test of which supplier is best at data cleaning. The lesson to be learned is in understanding the method a supplier uses to cleanse files. And, as our adjudicator says in his round-up (see page 39), the numbers of dupes and goneaways will always vary.




Do not mail




Suppression files


The data: To conduct the test, we worked exclusively with a secret provider who supplied us with the cold data and who deliberately 'corrupted' it to introduce duplicate counties in address lines and in the designated county field. Our provider said he would expect download reports to highlight these issues along with any other attributes of the data that could cause a record to be undeliverable.

Deceased, obscene and salacious words were also a feature. These included: 'dead', 'deceased', 'died', 'moved away', 'goneaway', 'gone away unknown', 'do not mail', 'complains', 'none', 'Scooby Doo', 'huge anus', 'arse', 'tart', 'tosser',' bastard', 'piss', as well as out-of-range characters such as '%', '+' and 'xxx' in the name/address fields. According to our data provider, all these are real examples of problems seen recently.

The test: "It's impossible to provide a precise number of duplicate and suppression records as the file could have contained its own in addition to what was added," explains our provider. "Specific dupes were sown in the files to highlight each bureau's ability to identify not only difficult matches as such, but to illustrate how close matches are dealt with - such as those records that could be deemed to be duplicates if an element of the matching criteria is changed. There's often no right or wrong answer to this. It's dependent upon the campaign objective, data attributes and client requirements. When compared, the reports should highlight the strengths and weaknesses of each set of results.

"Given that this was briefed as a cold prospect mailing, I would expect all deceased files and Mailing Preference Service (MPS) to be applied, plus those goneaway suppression files held by each bureau. Application of National Suppression File is arguable, especially records where the likelihood of them being genuine goneaways is in doubt.

"I would also expect the suppression files to be applied in order of royalty cost, as opposed to house preference - MPS matches dropped first, most expensive suppression matches last.

"The duplicate reports should illustrate not only duplicate and suppression records, but close matches - ie records that could be influenced by a change in matching criteria. Output of dupe groups for checking by the adjudicator is key. It will help assess whether Company A finding 1,000 dupes has matched accurately or over-killed in comparison with Company B finding 900 dupes.

"Data should be output as requested and provided securely password-protected. Comment could be made regarding the .csv format. While it's the most common default format, it would be good to see bureaux recommending an alternative delimiter such as I (pipe) to avoid potential problems with records containing a comma within their address. Given that a default salutation will be required it would be good to see bureau recommending a default valediction too.

"We would also hope that the bureau would have requested more information on the brief regardless of the original brief's comments to the contrary."


Software package used: In-house package

Report highlights:

- 13 exclusion words found

- 21,607 male and 12,874 female (not highlighted by the others); 41

cases of ambiguous gender; 477 cases of gender conflict

- Details of numbers found in each suppression file used (not reported

as clearly in other two). Including 1,028 on the MPS (DPS says 1,146)

- 409 duplicates (against DPS's 415)

Nick says "Absolute Data was the only company that came back to us seeking detailed clarification about the brief. We had said we were testing interpretation, but we expected some interaction.

"We got back a report, the whole output file and examples of stripped out data (ie gender conflicts). While it gave back counts against each of the suppression files used, the report didn't say how many names had been removed from the file.

Also, while Absolute assumed correctly that it should give a single output file, they were mixed up so I couldn't say which names came from which file. However, it specifically mentioned county name repetitions, how many were affected, and advice on their mailability.

"It was good that we were provided with examples where duplicates came up, as it gives the reader an indication of how the software works to define what it thinks are dupes. That said, it's a shame I got just the clean file back and nothing on the suppressions. Because I did have the whole file back, I could at least run my own checks. Some still remained, such as 'Dear Sir Harris' instead of 'Sir Harris'.

"There was another consistent error that shows how the software treats certain fields, and that was in house names. One address was 'The Old Rectory, The Rectory ..." In this report, line one was 'The Old Rectory, The', which is wrong. This is an error in reading commas within a field.

The other schoolboy error was on house names with letters (for example, '12a'), which had the house number and the road on separate lines instead of the same line.

"The team spotted all the obvious obscenities and gave examples of them, but they didn't spot 'Sir Welsh Animal'.

"It wasn't bad, but I found errors, and while I couldn't find any dupes, I didn't know how many had been taken out."


Software package used: Cademus

Report highlights:

- Showed the order it did each of its processes in to show sequence of


- 279 dupes (against 409 for Absolute and 415 for DPS)

- Revealed the number of files left that were good for mailing (64,608),

which the others did not explicitly show

- 1,274 on the MPS (compared with Absolute's 1,028 and DPS's 1,146)

Nick says "I was far more comfortable with this report. It had excellent summarisation and I especially liked the visual traffic-lights. My only comment would be that having got such good results, it would have been nice to have more insight about how to handle it.

"That aside, there were lots of nice touches built into the software report, such as the synopsis of the job and comments about how CDMS approached it. It made a very good effort of describing the idiosyncrasies of data and why absolute numbers of matches can be misleading.

"However, CDMS missed a trick on the suppression files. It rightly assumed the best suite to use, but should have considered others and said how many more names it would remove and at what cost. That said, this wasn't a definite requirement.

"I got the feeling that CDMS had given the file a good look over but again, when it came to reporting the counts - the suppressed and the mailable records - it wasn't that clear, and didn't say how many were removed. Nowhere did it say there were x obscenities and x were removed.

"Compared with Absolute, CDMS only provided a few examples of suppressed records - four dupes, three matching the MPS. You had no way of telling if it has over or under-killed on these, nor did it demonstrate the breadth of the matching software. You didn't know if CDMS had done the right job or a good job."


Software package used:

Cygnus (in-house software solution)

Report highlights:

- Nine obscenities removed

- Six poor quality names dropped from further processing

- Three deceased words identified and dropped from further processing

- 12 embedded comments (do not mail) dropped from further processing

- Identified county errors and used other data source than PAF to


- Hierarchy of suppression files used

- 415 dupes dropped, 254 deceased names dropped, 1,424 goneaways dropped

and 1,159 no mail requesters

Nick says"This was the best of the three. DPS provided a good summary report, took off the junk records and told you how many these were. This was the only company that answered everything. It wasn't a hard brief, but DPS managed it.

"The comprehensive report was clearly part of the standard reporting - for example that in file 11 there were six obscenities. DPS was also the only company to talk about tracker records that are put in and followed through to test the method of inquiry.

"In addition to everything else, there was a neat report about field lengths - sometimes addresses can be cut off midway through because they're too long, so the software looked to see if some records had been truncated.

"The summary reporting is good enough to let me see what's probably the only problem I have - which is that the company has over-suppressed to be safe, but at least I can see that it has gone for overkill, and that I'm happy to accept this. This gives me a lot more confidence that this was a good job done.

"The only error I did find was that DPS dealt with the surname 'O'Rourke' as one word and didn't put in the comma."


So, how did our guinea-pigs do? What was obvious was that comparing the number of duplicates or goneaways or deceased files with each supplier is less meaningful than the way in which suppliers follow a methodology and back up the figures they produce.

"The fact that each came up with different numbers actually pleased me," says Pride-Hearn. "It shows that no one is wrong, but you can be more right than others." He argues that confidence is key, and that while he believes DPS was good, he can't say for definite whether the others were worse, only that they were different.

What Pride-Hearn says this really throws up is that data processing can't simply be about button pushing. "There can be a mentality in this industry of 'have we done this, this and this suppression?', rather than 'have we done it well?', which is a bit scary. Variations can occur because someone different ran the report that day, as at the end of each process there's the opportunity to review that output file and change the parameters of the matching, and some may be better at this than others." But he adds that until clients also get interested in data, they perhaps deserve what they get. "Maybe most clients don't want the sort of reports that DPS provides and so suppliers don't provide them. But I think it's up to the marketers to decide whether they want thorough auditing or not. There's a much bigger picture here on standardisation. There's nothing that sets in stone what the minimum requirements are for a basic suppression. The DMA did look at this, but it seems not to have resurfaced again."

Become a member of Campaign

Get the very latest news and insight from Campaign with unrestricted access to , plus get exclusive discounts to Campaign events

Become a member

Looking for a new job?

Get the latest creative jobs in advertising, media, marketing and digital delivered directly to your inbox each day.

Create an alert now

Partner content