How to dynamically copy specific tables from source systems with Azure data factory

When you are working on a data project, you often have a list of tables you want to copy to your DWH database or staging database. With Azure data factory(ADF), this is easy to set up(Kind of). In this blog post, I will show you how. The result will end up looking something like the image below.

The scenario. I have a source system called WideWorldImportersand a DWH database called DWH_DEV. After talking with the business and the developers, I have a list of tables I want to copy to my DWH. 

In my test scenario, I have the databases on the same database server as shown below, but in real life, that is not the case, but the development process is the same. 

All my work I will do in the DWH_DEV database. First, I want a schema to the sourcesystem. Like so.

CREATE SCHEMA WideWorldImporters

Then I want a table with the tables I want to copy, and information about where to copy them. Like this

CREATE TABLE [Framework].[TablesToBeCopied](
	[SourceSystem] [NVARCHAR](250) NULL,		--The sourcesystem. We can have multiple
	[SourceSchema] [NVARCHAR](250) NULL,		--The Source schema, if it has one
	[SourceTable] [NVARCHAR](250) NULL,			--The Source table 
	[DestinationSchema] [NVARCHAR](250) NULL,	--The Destination Schema in your DWH
	[DestinationTable] [NVARCHAR](250) NULL		--The Destination table in your DWH
)

I want to add the 2 tables [Sales].[Invoices] and [Sales].[Customers]. So my TablesToBeCopied table will look like this: 

When I do a bulk load from different source systems, I want to drop the tables and recreate them again when I copy them. The reason for that is, that I don’t want to worry about remapping columns, if the change datatypes or are renamed. This is maybe not what you need, and then you can just skip this step. But I always create a stored procedure for that, like so

-- =============================================
-- Author:		CKC, DataWise
-- Create date: 20230719
-- Description:	Drop all tables in a schema. 
--				Primarly used for extract tables
-- Parameters
--				@Schema - The name of the schema we want to truncate. 
--				It can not be the dbo schema. 
-- Change history
-- 
--				20230719 CKC, DataWise
--					First version

BEGIN
    /*Declare required variables*/
    DECLARE @SQLStatement NVARCHAR(MAX);

    IF @Schema = 'dbo'
    BEGIN
        RAISERROR('This procedure should not be used on the schema dbo', 16, 1);
    END;

    IF @Schema <> 'dbo'
    BEGIN
        /* BUILD TABLE TRUNCATION SQL*/
        SELECT @SQLStatement =
        (
            SELECT STUFF(
                   (
                       SELECT '; ' + 'Drop TABLE ' + [SchemaName] + '.' + [TableName] AS [text()]
                       FROM
                       (
                           SELECT SCHEMA_NAME(schema_id) AS [SchemaName],
                                  name AS [TableName]
                           FROM sys.tables
                           WHERE schema_id = SCHEMA_ID(@Schema)
                       ) x
                       FOR XML PATH('')
                   ),
                   1,
                   1,
                   ''
                        ) + ';'
        );
        /*EXECUTE TRUNCATION SQL*/
        EXEC sp_executesql @stmt = @SQLStatement;
    END;
END;
GO

Now we are ready to setup ADF

We create a new pipeline and name it something meaningful, like pipelineCopyDynamicFromWorldWide

We start by adding a script task to the pipeline, that will drop the table. We add this SQL to it

EXEC [Framework].[dropTablesBySchema]
		@Schema = N'WideWorldImporters'

After that, we add a lookup task, to get the tables we want to copy.

In the lookup task we add this SQL

SELECT [SourceSystem]
      ,[SourceSchema]
      ,[SourceTable]
      ,[DestinationSchema]
      ,[DestinationTable]
  FROM [DWH_DEV].[Framework].[TablesToBeCopied]
  WHERE SourceSystem='WideWorldImporters'

The pipeline now look like this 

We now want to iterate over the tables with the forreach iteratior. 

We drag it in, and connect to it. We click the settings, and click add dynamic content like below

Then click the get tables to copy like below

Now we want to add an activity in our foreach loop. To do that, click the add activity plus sign, and select the copy data activity like so

Then we setup the copy data activity. First we add the source. Start by connection to a table in your world wide importers database. When you are done, go to the dataset that is created. Click on the parameter tab, and add 2 like this

Then click the connection again, and click the edit button, and dynamiccontent and add the table and schema like so

Do the same with the destination dataset. 

Go to the copy activity again, and setup the source and the sink. The setup should look like this: 

Where you dynamically set the schema and table properties like shown. 

Do the same with the sink(Destination, but use the destination names from your SQL instead). 

Remember to clik the auto create table 

We can now run the pipeline. We should get a result like so

And you should see the table in your DWH like this

As with most in ADF, setting up is a little complicated. But when you are up and running, it is easy to add the table you want in your daily copy job, and it is very easy to maintain. 

You add multiple source systems in your table with the tables you want to copy. I normally create a pipeline for each source system. 

But this is it. I hope you can use it 🙂

Guide: Setup data factory to run locally

Azure Data Factory (ADF) is Microsofts cloud based ETL tool. It is in many ways, better than their old on-premise ETL tool “integration services”. Luckily, you can easily set up ADF to move data around on-premise. In this blog post, I will show you how 🙂

I will assume you already have an ADF running, and some SQL servers running on-premise (Or on your laptop, if you just want to do it, to test ADF features).

To get started, go to your ADF and

  1. Click the manage menu on the left, click integration runtimes
  2. Click New
  3. Click On, Azure self-hosted

As shown below.

After, click new, give it a good name, and click create.

You will get 2 keys, as shown above(1). Save them a secure place, and click the link with the express setup(2).

When you do the manual setup, you will need the keys. If you just do the express, you don’t need them. After installation, you should see something like this.

We are now ready to use the on-premise resources.

I have a local SQL server with different test databases on it. See below

From ADF I want to copy a table from adventure works to the DWH_DEV database.

To do that, I create a pipeline, and

  1. Drag the copy data task in
  2. Under source, I click new dataset
  3. I click new linked service

And fill in the connection information

Click test, and select the table you want to copy.

Do the same for the destination database and table.

Run the pipeline. You should see a confirmation as below.

You can now copy data with ADF between your databases on-prem, and do all the transformation that is possible i ADF.

You should also be able to do all your testing with ADF as you want, and you dont have to spin up expensive Azure sql servers for all the test databases you want.

I hope this was helpfull 🙂 Enjoy 🙂

How to scale your azure sql database with SQL

An Azure sql database is super easy and cheap to setup and use. However, if you want to do some serious data manipulation, you will find the cheapest versions are to slow to use, and every upgrade you make on the sql database, is a doubling in price. If you are a data engineer, like me, most of your data work will be in batches, so it will also be annoying to pay for an expensive version, if you only use it for 1 hour/day.

Luckily, there is a solution. You can change the performance / version of the SQL server with SQL.

To check which version you are running, use this SQL

SELECT     
  slo.edition,    
  slo.service_objective    
FROM sys.database_service_objectives slo   

And you will get a result like this

You can change the version with this SQL

ALTER DATABASE DSA MODIFY(EDITION='Standard' , SERVICE_OBJECTIVE='S4')

Unfortunately, the change is not instant. My experience is, that on the standard version, the latency is around 5-10 minuttes.

What i normally do, is that I have a job that starts 30 minutes before my batch job, that update the database to a higher version. Then it is updated when the batch starts. And I set the batch job to downscale the database in the end.

For pricing and DTUs, see the screenshot below.

Pricing for all versions can be found here: https://azure.microsoft.com/en-us/pricing/details/azure-sql-database/single/

Have fun 🙂

How to create sql azure login and user

I use sql azure databases a lot. Both as backend, datamarts and also for data exchange layer with 3. parties.

For that, I need to create users fast and easy. There is no interface for doing that in SQL azure, so you have to do it with SQL.

Step 1: Connect to sql with manegement studio or sql Azure data studio with an admin account, and make sure you are connected to the master database.

Step 2: Create the login like so

CREATE LOGIN Testuser 
   WITH PASSWORD = 'complexpassword' 

You can now see the user

Step 3: Switch to the database where you want the user to have access and run this code:

CREATE USER Testuser
    FOR LOGIN Testuser

Step 4: Give the access you want the user to have. For instance reader:

EXEC sp_addrolemember N'db_datareader', Testuser

And you can now see the new user in the database under users

The default roles you can give to the users can be found here: https://learn.microsoft.com/en-us/sql/relational-databases/security/authentication-access/database-level-roles?view=sql-server-ver16

And seen below:

How to: Change the length of an attribute in MS MDS

You would think that it would be easy to change the length of an attribute in MDS, but going to the webinterface might make you think otherwise.  

The length parameter is greyed out. But fear not.

Instead open the table with your excel plugin


Click on the column you want to change, click attribute properties, and change the length to the desired length.

Publish your changes, and go back to the webinterface.


Tada 🙂 The length has been changed 🙂

A quick look at System-Versioned Tables in SQL server 2016

SQL server 2016 has been available for preview for some time now. One of the more exiting features I am looking forward to is system versioned tables.

As a DWH developer, one of the things, you spent a lot of time with, is developing dimension tables with columns that stores history. The history is typically in the form of Type1 or Type2, where Type1 is no history, and Type2 is where you store the history, by adding another row, when a value changes for a given natural key in the source system.
Type2 history is normally developed by using

  • A DWH framework
  • 3rd Party component
  • Developed specific for the dimension that needs the Type2 history

In version 2016, SQL server can help you with this, using system-versioned tables.

Right click tables->new-> And then you see a new option called “System-versioned Table”

There is no designer for it yet, so we will do it in SQL, which look like this.

CREATE TABLE dbo.Product
(

SK_Product int IDENTITY(1,1) Not Null,
BK_productid INT NOT NULL
CONSTRAINT PK_product PRIMARY KEY NONCLUSTERED,
ProductName NVARCHAR(50) NOT NULL,
ProductCategory NVARCHAR(50) NOT NULL,

sysstart DATETIME2(0) GENERATED ALWAYS AS ROW START NOT NULL,
sysend DATETIME2(0) GENERATED ALWAYS AS ROW END NOT NULL,
PERIOD FOR SYSTEM_TIME (sysstart, sysend)
)
WITH ( SYSTEM_VERSIONING = ON ( HISTORY_TABLE = dbo. ProductHistory ) );

Where

  • SK_Product is the surrogate key
  • BK_productid is the natural key
  • ProductName and ProductCategory are collumns we want to track history on(Well, preferably only ProductCategory, but this is not on column level yet)
  • Sysstart is the valid from
  • Sysend is the valid to
  • And PERIOD indicates which collumns to be used for valid from and to.

Required columns are of course a primary key, Sysstart, Sysend and PERIOD.
Below a screenshot of what it looks like in management studio.

As you can see, the producthistory table is placed under the product table, to show the relationship.
Let us try to insert some values:

INSERT INTO [dbo].[Product]
([BK_productid]
,[ProductName]
,[ProductCategory])
VALUES
(1,'Product1','Category1')

So, the insert, only insert a row in the current table, and no rows in the history table.
Let us try with an update:

UPDATE Product
SET ProductCategory = 'CategoryUpdated'
WHERE BK_productid=1

So, the update, updates the current table, and insert a row in the history table, hence, we will always have the current value in the “current table”, and old values in the history table.

Conclusion

When would you use this feature?
It could be helpful in an OLTP system, where they know they will need the history for reporting in the future, but does not have a DWH yet, to store the history.

There will be an overhead in performance and storage.

Would we use it in a DWH, for a dimension with type2 history?

Properly not. This is one specific implementation of it, and we cant change anything to the implementation, so we would have to use it in combination with our specific needs, and then it would be easier not to use it at all.

This could be useful in an archive layer in a DWH, where you want to store all changes of the extracted values over time.

This blog post is based upon a CTP version of SQL server, so much could still change.
Read more about it here: https://msdn.microsoft.com/en-us/library/dn935015.aspx

Error code: 0x80040E21 Reading from SSAS with SSIS

So, I upgraded a big project for a customer to 2012 from 2008R2. Everything worked fine, except the last part, where some datamarts was loaded with SSIS from SSAS. They had worked with no problem before, but suddenly, I received the above error code. It made no sense to me, and it was hard to find out, what precisely had happen. The worst, was, that I was able to see the preview in the datasource, but not run the package.

A workaround was to create a linked server to the cube, and then just select from that like so

SELECT * FROM openquery(<ServerName>, ‘<MDX>’)

And that was, properly they way I would have implemented it myself. But having hundreds of packages, rewriting them, was not the way to go.
Instead, I found an old post on google, where a person had the same problem, in an older SQL server. The solution was, to open the connection, click the all botton, and type in “(Format=Tabular)” in the extended properties field. Se the image bellow.

So, yeah…., YIEAKS. But hey, if you find yourself having the same problem, you now know what the solution is 🙂

Partitioning revisited

When building big data warehouses(DWH), partitioning tables is something you want to consider.

There is some good examples/explanations in BOL, but with this blog post I just want to give som easy hands on examples, so you can get started right away.

What
Partitioning tables is splitting up the data in a table, so the data is stored sepearatly, and can be managed seperatly. The SQL user can use the table without knowing the table is partitioned.

Why
Why do we want to partition big tables in for instance an enterprise DWH?
Well some benefits are listed on the SQL CAT(See resources) page, and they say it all

  • Partitioning can provide manageability and often performance benefits.
  • Faster, more granular index maintenance.
  • More flexible backup / restore options.
  • Faster data loading and deleting

How
Partitions can be managed both via management studio and with raw SQL. It is actually much easier to understand when using SQL, so I will only use SQL in this blog post.

Lets just start completely from scratch, by creating the partition function. It is simply used for defining the ranges in the partitions.

CREATE PARTITION FUNCTION myRangePF1 (datetime)
AS RANGE LEFT FOR VALUES ('2013.01.01', '2013.02.01', '2013.04.01') ;

After that, we want to create the partition scheme, which is used for mapping the partitions to a filegroup. For the case of simplicity we just use the primary filegroup In this example , but you don’t want to do that in production.

CREATE PARTITION SCHEME myRangePS1
AS PARTITION myRangePF1
TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY]) ;

Next, let’s create som partitioned tables

CREATE TABLE FactSomething (DW_TS DATETIME, factValue int)
ON myRangePS1 (DW_TS) ;

--Create an archieve table, where we want to store old values
CREATE TABLE FactSomethingArchive (DW_TS DATETIME, factValue int)
ON myRangePS1 (DW_TS) ;

Then let us insert some test data, so we have some data to work with.

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2010.01.01',1)

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2010.01.01',1)

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2010.01.01',1)

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2013.01.01',1)

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2013.02.02',1)

INSERT INTO [dbo].[FactSomething] ([DW_TS] ,[factValue])
VALUES ('2014.01.01',1)

You can use this select stm. to get all the necessary information about the partitions:

--Long SQL with all the information regarding the partition table
SELECT t.name AS TableName,p.rows,
p.partition_number,
p.partition_id,
f.function_id,
f.type_desc,
r.boundary_id,
r.value AS BoundaryValue
FROM sys.tables AS t
JOIN sys.indexes AS i
ON t.object_id = i.object_id
JOIN sys.partitions AS p
ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.partition_schemes AS s
ON i.data_space_id = s.data_space_id
JOIN sys.partition_functions AS f
ON s.function_id = f.function_id
LEFT JOIN sys.partition_range_values AS r
ON f.function_id = r.function_id and r.boundary_id = p.partition_number
WHERE t.name = 'FactSomething' -- AND i.type <= 1
ORDER BY p.partition_number

Which in our case will show the following result.

Where the rows column show the number of rows in the partitions. Lets say we want to archive the oldest partition. Then we just switch the partition like so

ALTER TABLE dbo.FactSomething
SWITCH PARTITION 1
TO dbo.FactSomethingArchive PARTITION 1

If we run the partition info SQL again, we now get the following result

Where you can see, that we now have 0 rows in the first partition. If you do a select * from the archieve table, we will now have 4 rows there.

Lets say, that we don’t want to have the old partitions in our fact table anymore, we just run this merge stm.

ALTER PARTITION FUNCTION myRangePF1()
MERGE RANGE ('2013-01-01')

partition info SQL will now return only 3 rows as shown below.

To sum up 🙂

I have shown how to create partitioned tables, and how to use them. Next step will be managing there indexes, and how to make sliding window partitioning. Maybe there will be a post about those subjects here later 🙂

Resource

http://sqlcat.com/sqlcat/b/top10lists/archive/2008/02/06/top-10-best-practices-for-building-a-large-scale-relational-data-warehouse.aspx

http://msdn.microsoft.com/en-us/library/ms190787.aspx

SQL TIP-> Display seconds as HH:MM:SS

In many BI application you have stored a duration as seconds. To display that on the format HH:MM:SS just use the following SQL(Replace 100 with your column name):

SELECT
convert(char(8),dateadd(second,100,0),108) as [Duration]

Wich yield the result

00:01:40

As in my previous blogpost where I did the same in analysis services, this only works if the duration is below 24 hours, else you have to do some additional logic.

Fun with SQL: Joining on column name and row field value

Ok, it is that time of the day where I’m starting to get high on caffeine 🙂

So this is the scenario. I have an application that is automatically generating a table in my database. It can add columns on the fly(But thankfully not delete).

It could look like this:

CREATE TABLE Cube.AppGenereatedTable
(
x1value float,
x2value float,
ID_Region INT
)

We can now query information about that table like so:

select column_name,ordinal_position,DATA_TYPE , *  from information_schema.columns
where table_name = 'AppGenereatedTable'
AND TABLE_SCHEMA='cube'

The problem is, that right now, I do not know what to do with that data, and when I’m not at the customer anymore, more fields can be added by the application. What I would like, is a place where the customer can add more information about that column, and decide if that column should be included in some logic. To do that, I add a “metadata” table.

CREATE TABLE Dim.SomeMetadata
(
ID_SomeMetadata INT,
ColName nvarchar(100),
friendlyName nvarchar(100),
includeInETLLogig bit
)

Now I want to put it all in a fact table, where I have 1 row / column value. It could look like this:

CREATE TABLE Fact.SomeFactTable
(
Value float,
ID_Metadata INT
)

Ehhh, Ok, so know I will do what the title is saying. To load the data, I need to create the sql automaticly like so:

--Use the colnr to reference all collumns
DECLARE @colnr AS INT
DECLARE @sql AS NVARCHAR(MAX)
SET @colnr = 1
--Do for as many collumns you expect there ever will be. Do some automation here if you like
WHILE (@colnr < 50)
BEGIN
SET  @sql =
'INSERT INTO Fact.SomeFactTable
(Value, ID_Metadata)

--And now we do the SQL with the join. Pay attention 

SELECT '+(select column_name from information_schema.columns
where table_name = 'AppGenereatedTable'
AND TABLE_SCHEMA='cube'
AND ordinal_position=@colnr AND DATA_TYPE = 'float')+' as value
,someMetaData.ID_SomeMetadata
FROM Cube.AppGenereatedTable appGenereatedTable
INNER JOIN Dim.SomeMetadata someMetaData on someMetaData.ColName='''+(select
column_name from information_schema.columns
where table_name = 'AppGenereatedTable'
AND TABLE_SCHEMA='cube'
AND ordinal_position=@colnr AND DATA_TYPE = 'float')+''''

print @colnr
print @sql
BEGIN TRY
exec( @sql )
END TRY
BEGIN CATCH
--Catch error. Do nothing
END CATCH
SET @colnr = @colnr + 1
END

Well, I don’t know if there is other than me, who can use this for something. Next post will be about precisely about what I’m using this to 🙂