I was interested in generating some dummy data to do some load testing in MS Azure and came up with a pretty nifty way to generate lots and lots of data using U-SQL. The tip is to simply create a small U-SQL custom generator and use it to extract from a dummy file.
First I created a dummy file… literally…. In my input folder on my local machine I just created a blank, 0-byte file just to stop the custom extractor complain that I’m not actually going to use an input file.
The custom extractor uses a C# as follows
for (Int64 i = _startVal; i < _endVal; i+= _step)
and this loop simply generates a single column line using the output.Set function. The full code for the code behind file is
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace CustomExtractor
{
[SqlUserDefinedExtractor(AtomicFileProcessing = false)]
public class GenerateSeries : IExtractor
{
private readonly Int64 _startVal;
private readonly Int64 _endVal;
private readonly Int64 _step;
public GenerateSeries(Int64 startVal , Int64 endVal, Int64 step)
{
this._startVal = startVal;
this._endVal = endVal;
this._step = step;
}
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
for (Int64 i = _startVal; i < _endVal; i+= _step)
{
output.Set<Int64>(0, i);
yield return output.AsReadOnly();
}
yield break;
}
}
}
Once you have this in place you can call it from your U-SQL script
DECLARE @in = "dummyfile.csv";
DECLARE @out = "result.csv";
@input =
EXTRACT mycol Int64
FROM @in
USING new CustomExtractor.GenerateSeries(1,1000000000,1);
@t =
SELECT mycol ,
mycol + 2 AS AnotherColumn
FROM @input;
OUTPUT @t
TO @out
USING Outputters.Csv(outputHeader : true);
This script calls the CustomExtractor .GenerateSeries function and passes three arguments which in term become the three arguments used in the C# for loop. So these can be customised pretty easily.
The @t select statement allows you to inject additional columns. This could be where you generate columns for random dates, products, quantities etc on a pretty major scale if you wanted.
I first ran this locally on my machine and filled up my hard drive pretty quick, so switched to my Azure Data Lake Store where space is no issue. With 2 verticies the query took 20 seconds to prep, sat 7 seconds in the queue but ran for 16 minutes.
The final result was a pretty easy to customise file with 1 billion rows that was about 20GB in my Azure Data Lake Store Account.



