How to generate 1 billion rows using U-SQL

I was interested in generating some dummy data to do some load testing in MS Azure and came up with a pretty nifty way to generate lots and lots of data using U-SQL.  The tip is to simply create a small U-SQL custom generator and use it to extract from a dummy file.

First I created a dummy file… literally….  In my input folder on my local machine I just created a blank, 0-byte file just to stop the custom extractor complain that I’m not actually going to use an input file.

The custom extractor uses a C# as follows

for (Int64 i = _startVal; i < _endVal; i+= _step)

and this loop simply generates a single column line using the output.Set function.  The full code for the code behind file is

using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace CustomExtractor

    [SqlUserDefinedExtractor(AtomicFileProcessing = false)]
    public class GenerateSeries : IExtractor
        private readonly Int64 _startVal;
        private readonly Int64 _endVal;
        private readonly Int64 _step;

        public GenerateSeries(Int64 startVal , Int64 endVal, Int64 step)
            this._startVal = startVal;
            this._endVal = endVal;
            this._step = step;

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            for (Int64 i = _startVal; i < _endVal; i+= _step)
                output.Set<Int64>(0, i);
                yield return output.AsReadOnly();
            yield break;

Once you have this in place you can call it from your U-SQL script


DECLARE @in = "dummyfile.csv";
DECLARE @out = "result.csv";
@input =
    EXTRACT mycol Int64
    FROM @in
    USING new CustomExtractor.GenerateSeries(1,1000000000,1);

@t =
    SELECT mycol , 
           mycol + 2 AS AnotherColumn
    FROM @input;
TO @out
USING  Outputters.Csv(outputHeader : true);

This script calls the CustomExtractor .GenerateSeries function and passes three arguments which in term become the three arguments used in the C# for loop.  So these can be customised pretty easily.

The @t select statement allows you to inject additional columns.  This could be where you generate columns for random dates, products, quantities etc on a pretty major scale if you wanted.

I first ran this locally on my machine and filled up my hard drive pretty quick, so switched to my Azure Data Lake Store where space is no issue.  With 2 verticies the query took 20 seconds to prep, sat 7 seconds in the queue but ran for 16 minutes.

The final result was a pretty easy to customise file with 1 billion rows that was about 20GB in my Azure Data Lake Store Account.GenerateSeries

Philip Seamark on EmailPhilip Seamark on LinkedinPhilip Seamark on Twitter
Philip Seamark
Phil is Microsoft Data Platform MVP and an experienced database and business intelligence (BI) professional with a deep knowledge of the Microsoft B.I. stack along with extensive knowledge of data warehouse (DW) methodologies and enterprise data modelling. He has 25+ years experience in this field and an active member of Power BI community.

Leave a Reply