Archive for the ‘Uncategorized’ Category
Exporting a RavenDB Index to CSV using Powershell
Part of the system I maintain needed to have a CSV (for viewing in Excel) generated daily of line items and their prices, since the information was already transformed and ready to be consumed from a RavenDB Index I used a little Powershell scripting to dump it to a CSV. Easy, practical and can be wired up to Task Scheduler in a few minutes.
Since CSVs are generally flat we are explicitly in what we fetch (see $fields variable) this causes Raven to pull only from Lucene index (a thus a lot faster) rather than using the index as doc id lookup. To get full compatibility with Excel you have to be a little careful with encoding, otherwise you have to fallback to the import wizard, with ascii it is double click open, fortunately in this instance we live within the ascii charset.
At GistHub: http://gist.github.com/548970
$baseUrl = "http://localhost:8080"
$outputFile = "salesexport.csv"
$fields = "Manufacturer,ProductName,RetailPrice,ProductCode"
$fieldsArray = $fields.Split(',')
$startAt = 0;
$pagingSize = 100
# Build your Index in RavenDB - best to do your hard work there.
# script supports flat indexes so we use fields in query to ensure it pulls
# from Lucene rather than a Lucene to DocID lookup - and again much faster
[System.Reflection.Assembly]::Load("System.Web.Extensions, Version=3.5.0.0," `
+ "Culture=neutral, PublicKeyToken=31bf3856ad364e35")
$serializer = New-Object System.Web.Script.Serialization.JavaScriptSerializer
$serializer.MaxJsonLength = [System.Int32]::MaxValue # probably too laxed
$webClient = New-Object System.Net.WebClient
#clear the file and write the columns
#using ascii since Excel presumes CSV and has problem open .csv directly when BOM
#present. if you need unicode use utf8, save file as .txt and use wizard
[String]::Join(",",($fieldsArray | %{ ('"' + $_ + '"')})) | Out-File -Encoding ascii $outputFile
do {
$indexUrl = "$baseUrl/indexes/Shop/Brochure?start=$startAt&pageSize=$pagingSize&" `
+ [string]::Join("&",($fieldsArray | % { "fetch=$_" } ))
$response = $webClient.DownloadString($indexUrl)
$jsonDic = $serializer.DeserializeObject($response)
$results = $jsonDic["Results"]
$script:totalResults =$results.Length
foreach($result in $results)
{
# ravendb includes the __document_id in the ouput as the last key so
# we select only fields we specify by using a select -first
[string]::Join(",", (% { $result.Keys } | select -first $fieldsArray.Length |`
% { ('"' + $result[$_].Replace(",","\,") + '"') } )) `
| Out-File -Encoding ascii $outputFile -Append
# if you want custom mapping just use $result[column-name] to build your own
}
$startAt = $startAt + $pagingSize
} while ( $script:totalResults -ne 0 )
Again we use the .Net serializer rather than Newtonsoft.Json due to strange errors that I dont have time to troubleshoot, this exports more than 20,000 records in less than a minute (on a modest desktop). Enjoy
Moving data from SQL Server to RavenDB using Powershell
We have a legacy-style product information database in SQL Server that is updated every few days by the vendor (when I say updated I mean delete tables and recreated – yuck) – we have no control over this database design apart from adding our views (which thankfully they don’t delete) and our front end display database is in RavenDb (document database with restful HTTP interface), one of my goals this weekend was to figure out a simple way to bulk move data periodically from the product database to the document database.
(simple google docs drawing)
Initially I played with Rhino ETL but even that was overkill (and a little slow – but a beautiful framework) so inspired from recently looking at psake I decided to see how far I could get with Powershell sanely. The great thing about using a admin-friendly scripting language like powershell if that is easy to modify if your schema changes and it’s extremely easy to deploy, although it might not be the best code. Due to Powershell’s full .Net support and RavenDB’s restful interface it was pretty easy to get this up and running – and a good opportunity to further enhance my Powershell skills.
Here’s the result – a rough and simple ETL powershell script for moving data from a SQL Server table to RavenDB, it tries a PATCH first incase the doc already exists, failing that it will do a PUT, since given other properties are set from different sources an overwrite of the document is not going to work for us.
# requires curl on PATH - if you installed mysysgit with it added to all PATHS then
# you should be okay, otherwise download from http://curl.haxx.se
# presume integrated security edit connection otherwise
$sqlserver = ".\SQLEXPRESS"
$db = "DATAVENDOR_CAR"
$baseUrl = "http://localhost:8080"
# first value selected must be the doc id - ideally create dedicated view instead
$sqlQuery = "SELECT CarId, Manufacturer, Model," `
+ "Title, BodyStyle, Doors, CO2, Trim, DriveTrain, Transmission," `
+ "FuelType, EngineSize, ManufacturerRetailPrice, SourceUpdated FROM CarView"
$connection_string = ("Data Source=$sqlserver; Initial Catalog=$db;Integrated Security=SSPI")
$conn = New-Object System.Data.SqlClient.SqlConnection($connection_string)
# newtonsoft.json created weird errors
[System.Reflection.Assembly]::Load("System.Web.Extensions, Version=3.5.0.0, Culture=neutral,
PublicKeyToken=31bf3856ad364e35")
$serializer = New-Object System.Web.Script.Serialization.JavaScriptSerializer
$conn.Open()
$cmd = $conn.CreateCommand()
$cmd.CommandText = $sqlQuery
$reader = $cmd.ExecuteReader()
$derivativeIds = New-Object 'System.Collections.Generic.List[string]'
while ($reader.Read()) {
$docid = $reader.GetValue(0)
$url = ([System.Uri]("$baseUrl/docs/" + $docid)).AbsoluteUri.ToString();
# build a list of SET operations - either update or create property
# only affects properties existly set in the operations, does not overwrite
# whole document.
$listOfSetOperations = New-Object 'System.Collections.Generic.List
[System.Collections.Generic.Dictionary[string,object]]'
for ($i = 1; $i -lt $reader.FieldCount; $i = $i +1) {
$dic = New-Object 'System.Collections.Generic.Dictionary[string,object]'
$dic.Add("Type","set")
$dic.Add("Name",$reader.GetName($i))
$dic.Add("Value",$reader.GetValue($i))
$listOfSetOperations.Add($dic)
}
$jsonPayLoad = $serializer.Serialize($listOfSetOperations).Replace("`"","'")
$response = curl -w '%{HTTP_CODE}' -X PATCH $url "-d $jsonPayLoad"
# doc does not exist so lets insert it. I go this route rather than a single update-or-insert PUT
# because if the doc already exists on the ravendb it will have additional properties associated with
# that have been added/modified since (i.e. new properties dont exist in this sql source db)
if($response -eq '404') {
$dictionary = New-Object 'System.Collections.Generic.Dictionary[string,object]'
for ($i = 0; $i -lt $reader.FieldCount; $i = $i +1) {
$dictionary.Add($reader.GetName($i),$reader.GetValue($i))
}
$dictionarySerialized = $serializer.Serialize($dictionary).Replace("`"","'")
curl -X PUT $url "-d $dictionarySerialized"
}
}
Example also available on gisthub here, adapt it per your needs, http://gist.github.com/543499.
I had strange errors relating to Value property with Newtonsoft.Json, so I fell back to .Net Javascript serializer. We could additionally clean this up by using WebClient instead of relying on curl. And of course this is void of any real error handling or performance analytics and as junior to Powershell any best-practices, feel free to extend it, adapt it, it isn’t meant to be beautiful code so don’t punish me too hard.
Later in the week I plan to batch the requests into groups of 50 (or a configurable amount), per the documentation at http://ravendb.net/documentation/docs-http-batching-2 this should significantly improve throughput, I’ll update the script on gisthub.
I used PowerGUI script editor as the main IDE (although it does crash a lot – have process explorer / task manager handy), but for a free tool it’s a good start. For debugging Raven communication add –v parameter to the curl command, or use Fiddler, or plugin WebClient with some verbosity. Also note, by default RavenDb requires authentication for anything other than GET requests so you’ll need to either change the behaviour via the config file or ensure your http requests are authenticated.