mardi 21 juin 2016

spark scala xml dataframe get the empty XML tag using SELECT query

I am trying to load an XML file into Scala and then check if the XML tag is empty by running a SELECT query.

Here is the code to load XML file using Spark XML

val df = sqlContext.read
                  .format("com.databricks.spark.xml")
                   .option("rowTag", "Session")
                   .load("test1.xml")

And the actual XML file contents are below

<Aggregator>
    <Session>
        <Data>
            <Log>
                <Event EventTime="2015-01-09T02:13:19">
                    <Start />
                </Event>
                <Event EventTime="2015-01-08T20:27:21">
                    <Power State="PowerOn"/>
                </Event>
                <Event EventTime="2015-01-08T23:54:52">
                    <Play>
                        <Video VideoId="4294967295" VideoType="MP4"/>
                    </Play>
                </Event>
                <Event EventTime="2015-12-27T03:59:20">
                    <Stop/>
                </Event>
                <Event EventTime="2015-05-01T09:37:56">
                    <Pause>
                        <File pauseTime="2015-04-15T19:35:00" key="902"/>
                    </Pause>
                </Event>
            </Log>
        </Data>
    </Session>
    <Session>
        <Data>
            <Log>
                <Event EventTime="2015-01-08T23:54:52">
                    <Play>
                        <Video VideoId="4294967295" VideoType="MP4"/>
                    </Play>
                </Event>
                <Event EventTime="2015-12-27T03:59:20">
                    <Stop/>
                </Event>
                <Event EventTime="2015-05-01T09:37:56">
                    <Pause>
                        <File pauseTime="2015-04-15T19:35:00" key="902"/>
                    </Pause>
                </Event>
            </Log>
        </Data>
    </Session>
    <Session>
        <Data>
            <Log/>
        </Data>
    </Session>
</Aggregator>

After loading this XML into dataframe I am trying to compare get all events which are Start by using below SQL query

df.registerTempTable("eventsData")
sqlContext.sql(""" select * from eventsData where Event.Start is null""").show()

However this is not running successfully and not returning accurate output.

Can someone help me to get those records where Event has Start as its child tag??

Aucun commentaire:

Enregistrer un commentaire