Origin-Destination Data: a prototype and related scenarios

The Public Transportation System and its operation management require the processing of large amount of data (like bus routes, user data and bus schedules). In particular, origin-destination data serves to indicate citizens’ travel patterns, providing insights related to the dynamic of the urban space occupation. Given this scenario, this paper presents a prototype of origin-destination data visualization, based on queries associated with a set of trips (and related attributes), analysis of trips and services for Curitiba, clustering of georreferenced data (for visualization) and a local study of Origin-Destination from the Public Transportation of Curitiba. The novelty relies on visualization through clustering of georreferenced data, allowing the analysis of different regions of interests (neighborhood, regionals or mathematical regions using K-means algorithm). We demonstrate the prototype through several scenarios, and interviews done to local citizens. Challenges related to meaningful presentation of results are discussed under the perspective of visualization and analytics.


Introduction
The growth of urban centers brings challenges related to urban mobility that impact the well-being of the population. In this scenario, the public administration has sought to follow the Open Data trend, making city data available in its Open Data Portals (such as Paris , Nova Iorque and Moscow ). In Brazil, the city of Curitiba followed this trend, by making its data available through opendata.paris.fr opendata.cityofnewyork.us data.gov.ru several sources (City Hall Portal , Instituto de Pesquisa e Planejamento Urbano (IPPUC) , Urbanização de Curitiba (URBS) , TransitFeeds and GTFS ).
In Curitiba, each day vehicles carry over . . passengers (where % of them use smartcards) through different routes and bus terminals and resulting, on average, .
bus trips on . kms by day over bus stops. If we consider one month of data from the user smartcards pick-ups (Sep th -Oct th ) on Fig. , weekends (such as Sept th, Oct st, Oct th, Oct th) have a lower average compared to week days, with a total of . . use of smartcards in the period. The number of routes and vehicles are also impacted by weekends. From a formal perspective, a transport network has been defined by Añez et al. ( ) as a set of directed links of the form N = (V, L), where V = {v , v , ..., vn} is a set of vertices and L = {v , v , ..., vn} is a set of links, such that L = (v, w, Q vw ), v, w ∈ V, where v and w are origin and destination nodes, and Q vw is a set of attributes of each link such as distance, capacity, number of passengers, vehicle speed. Since the origin and destination are node numbers, once all such calculations have been performed, the model can be transformed back in the original graph notation, without any additional computational effort. The same model can be extended to multidimentional networks, to represent transit routes that share a physical link. Alternative methods include arc-vertex and forwardstar (Añez et al., ). If we consider the definition above along with the data from Curitiba, L could be presented as shown in Table , where v and w could be a bus stop or a bus terminal, and Q vw is a set of parameters available from the local bus administration.
From GIS and data perspective, the public transport system has geographic and temporal components, as well as data on its dynamics, for example, about travel. A bus www.curitiba.pr.gov.br/dadosabertos/ ippuc.org.br urbs.curitiba.pr.gov.br/ https://transitfeeds.com/l/388-brazil/ https://developers.google.com/transit/gtfs/reference trip (where a user goes from an origin to a destination and comes back to an origin) can be characterized by dates, departures and arrivals (namely origin-destination data) associated with a bus line (predetermined route) at passenger pick-up and drop-off points. A trip can also have information such as name and line code, vehicle code, user smartcard id, gender, date of birth, latitude and longitude of the bus (for example, every five minutes). In GIS perspective, bus stops, terminals, bus trips, passenger pick-up an drop-offs are stored in low-level features such as points, lines and polygons, with known latitude and longitude. The analysis can be focused on places, tasks, phenomena, cause-effect, OD flows and simulations, among others. In order to understand the OD dynamics, in general, users and Public Transportation System (PTS) managers are interviewed and reports emmited about the information collected (Pichiliani, ). For manipulation and processing of the referred data it is possible to use tools such as ArcGIS, QGIS and Excel, where generally only part of the data are analyzed. More complex analysis require knowledge of specific languages (such as SQL queries within QGIS), in time consuming tasks.
In this direction, this paper presents a visual tool to support spatial-temporal queries over OD data, having the following as requirements: ) to be able to visualize queries associated with a set of bus trips (and related attributes) for non experts (Ferreira et al., ), ) to visualize details of analysis of bus trips and services focusing on particularities of Curitiba (Diniz Jr., ), ) to be able to scale the visualization using clustering of georreferenced data (Vila et al., ), and ) to enhance the OD analysis from (Pichiliani, ), along with its particular characteristics (such as having an interface similar to Tableau) and variables (similar division for age, time, regions and week aggregation). The novelty relies on visualization through clustering of georreferenced data (Vila et al., ), allowing the analysis of different regions of interests (neighborhood, regionals or mathematic regions using K-means algorithm). From the data perspective, the visualization can provide both overview and details, maintaining the spatial and temporal contexts. We demonstrate our tool through several scenarios motivated by a local study of Origin-Destination (Pichiliani, ), and interviews done to local citizens. This paper is organized as follows. Section gives an overview of related work. We present the prototype at Section . Section presents the case studies. Finally, we present our conclusion at Section .

Related Work
Public transportation is one of the most critical areas under the city perspective. Mobility challenges have already gained attention of computer science community in Brazil . In particular, these challenges could be grouped in the following areas (Vila et al., ): (i) discovery of patterns, (ii) data statistics, (iii) data integration, (iv) location and tracking, (v) open and connected data, (vi) http://www.sbc.org.br/documentos-da-sbc/send/ 141-grandes-desafios/802-grandesdesafiosdacomputao/no-brasil ;-. ) (-. ;-. ) ("Linenumber=INTERBAIRROS IV", "Vehicle=MC ";"Smartcard= "; "DateTime= --: : ", "Birthdate= --";"Gender=F") (-. ;-. ) (-. ;-. ) ("Linenumber=INTERBAIRR II","Vehicle=DR ","Smartcard= ", "DateTime= --: : ", "Birthdate= --";"Gender=F") ... (-. ;-. ) (-. ;-. ) ("Linenumber=BIGORRILHO","Vehicle=BC ";"Smartcard= "; "Datetime= --: : ", "Birthdate= --";"Gender=F") contextual information, (vii) security and privacy, (viii) energy and management, (ix) use of cloud resources, and (x) trajectories with semantic information, among others. If we consider legislation, NBR ISO : ("Desenvolvimento sustentável de comunidades -Indicadores para serviços urbanos e qualidade de vida" ) could be cited as the Brazilian technical standard for sustainable cities. This standard is a translation and adaptation of the standard ISO : -"Sustainable development of communities -Indicators for city services and quality of life" , elaborated by TC-(Technical Committee) . NBR ISO : proposed a standardization of indicators related to the city (urban services offered and quality of life, among others). The objective was to provide opportunities for comparative analyzes of different communities, favoring the exchange of experiences and good practices (Couto, ). From the information perspective, several efforts could be listed. Curitiba and New York, for example, were analyzed (Parcianello et al., ), in a comparative study based on open data. This study identified similarities and contrasts between the existing systems and highlighted the importance of seeking to implement an inter and multimodal public transport system. Different transportation telematic services were proposed in Diniz Jr. ( ) (such as location without bus routes, average bus crowdness, alert for different routes and speeding alerts), along with efficiency (Braz et al., ), using the same data proposed in this paper. The comparison of open data on road network, demographics, territorial extension and other urban indicators from cities of the state of São Paulo was proposed in Spadon et al. ( ). The study also applied complex network concepts and clustering algorithms to classify such cities by similarity from different perspectives. The public transportation system from Rio de Janeiro was used in the study of Cruz et al. ( ), where the identification and classification of anomalies (using open text and data mining) using the Apriori algorithm.
The comparative analysis using complex network models in Chicago and Melbourne was presented in Saberi et al. ( ), as a first step towards a better understanding of the structure, interactions, and evolution of travel demand networks in cities. They suggested that the underlying processes in travel demand, viewed as a network, are also driven by the interaction strength between places (or nodes).
Images in GIS is an important task, but in abntcatalogo.com.br/norma.aspx?ID=366389 iso.org/standard/62436.html iso.org/committee/656906.html transportation, the movement is crucial (examples in Andrienko and Andrienko ( )) . As already mentioned in Ferreira et al. ( ): much of the work is based in trajectory data, where the location of moving entities is recorded. In contrast, multi-variate OD data has only the start and end positions, together with attributes associated with the movement. At the same paper, taxi data visualization is studied, and a solution is proposed for non experts, with different visualization approaches.
Regarding data apresentation, several projects might be mentioned, such as DataViz , VisualComplexity or CityGeographics . On the other way, open data have already coming with interactive solutions, such as Manhattan Population Explorer or Transit Accident Dashboard from IPPUC .
In Vila ( ), a web-mobile solution was proposed, allowing spatiotemporal use of georeferenced data. For the presentation of results, graphic resources were used as markers, thermal map and clustering of markers.
Studies involving OD can benefit from the adoption of different data visualization techniques Andrienko et al. ), Guerra ( ). According to Guerra ( ), OD studies aim to identify the amount of displacements made and the profile of the citizen who travels over a period of time from a home zone to a destination zone. These zones can be defined from geographic divisions (based on neighborhoods and macroregions, for example), via mathematical divisions (via data clustering techniques), among others. A general framework for using visual analytics techniques and workflows in place connectedness studies is presented by Andrienko et al. ( ), using place-centered tasks, link-centered tasks, interemediate-level tasks along with their costs. In Itoh et al. ( ), unusual phenomena (ex. marathons) and their propagation (cause/effect) on a spatio-temporal space is presented through visualization, using visualizations such as heatmaps, AnnimatedRibbon and TweetBubble. In Palomo et al. ( ) is presented a visual exploration tool (composed by trip and stop explorer), developed to identify, inspect and compare spatio-temporal patterns for planned and real transportation service. In Wood et al. ( ), OD vectors are mapped as cells rather than lines, using a hash grid spatial data structure for enhance scalability to large collection of vectors.
datavizproject.com www.visualcomplexity.com citygeographics.org manpopex.us ippuc.org.br/mapasinterativos/AcidentesDeTransito/dashboard. html The main challenges toward the use of several approaches for OD data include: ) the performance of processing and visualizing a huge amount of data; ) the integration of different technologies (not all of them are compatible); ) neither the available software is free nor the source code is not easy to understand and integrate with other software, among others (free datasets, metadata). Several theoretical approaches for Public Transport Systems can be mentioned, such as graphs (Silva et al., , Chapleau and Morency, , Zhang et al., ), Marey's graphs (Palomo et al., ), matrix (Diniz Jr., ) and hash grids (Wood et al., ). Clustering algorithms can also be used to analyze PTS. According to Cassiano ( ), data clustering (or cluster analysis) is a multivariate data mining technique that aims to group the n database cases into k groups called clusters. Data clustering can also be defined as the process as a grouping of information, considering: (i) the existence of a strong similarity between the elements belonging to the same group; (ii) existence of a weak similarity of elements belonging to different groups (Zaiane et al., ). In literature, the clustering can also be called cluster analysis, Clustering, Q-analysis,Typology, Classification Analysis or Numerical Taxonomy.
In particular, K-means clustering algorithm Silva et al. ( ), Osama et al. ( ) was used to partionate the regions into K groups within this paper, using their geographic position (given by latitude and longitude coordinates). After setting centroid coordinates randomly for each group, the algorithm basically consists of alternating between two steps: the assignment step where each region is assigned to its nearest group (considering the group centroid coordinates), and the update step where the group centroid coordinates are updated according to its assigned region. The K value as , for example, is used to present smaller regions of the city.

Prototype . Requirements.
For the prototype, a questionnaire was submitted to users of the PTS in academic community. This questionnaire was applied in April to , and it was composed by questions such as: a) if the citizen had already any contact with studies related to the use of the Curitiba PTS, b) if it would be interesting to develop a solution that would allow visualizing, quantifying and exploring data from the Curitiba PTS, c) which search filters the solution should offer and d) what would be the possibilities of using this solution. From the analysis of the answers obtained, we notice that: (a) all interviewed citizens had already had some contact with studies involving the use of the PTS Curitiba; (b) a solution that would allow them to view, quantify and explore data related to the use of public transport would be interesting; (c) filters should be  Fig. ), the most highly rated were: ) trips by drop-in and drop-off regions, ) trips by age group of users and ) trips based on passenger gender. Users also indicated that such a solution would be applicable ) to the study of public transport demand, ) to user profile analysis and ) to study the dynamics of urban space occupation. These answers subsidized the development of a prototype designed to allow users to easily perform OD analysis.

Figure :
The main data filters requested by users.
The report for OD data in Curitiba, from IPPUC (Pichiliani, ) was also used in order to understand which analysis were necessary and which variables and classifications were used, such as gender ("F" or "M"), day shift classifications ( : to : , : to : , : to : , : to : ), age classification ( to years, to years, to years, to , greater than years), along with the classification by neighborhood an regionals. Note that similar characteristics will be used for visualization in this prototype, such as having an interface similar to Tableau, and variables aggregation (similar division for age, time, regions and week aggregation). In particular, the age range " to years" and "greater than years" do not pay any fee for using the buses in Curitiba. The age range " to years" is used for children, and " to " used for adolescents, according to the Statute of Children and Adolescents in Brazil . The age range " to " is used for other users from the PTS. Note that the traditional way of OD analysis is by interviews.
In particular, this proposal led to the following requirements: ) to be able to visualize queries associated with a set of bus trips (and related attributes) for non experts (Ferreira et al., ), ) to visualize details of analysis of bus trips and services for Curitiba mentioned in Diniz Jr. ( ), ) to be able to scale the visualization using clustering of georreferenced data (Vila et al., ), with further options other than districts, and ) to enhance the OD analysis from Pichiliani ( ), along with its particular characteristics (such as Tableau , providing a visualization of both overview and details) and variables.

. Data
Initially, files (CSV and SHP formats) with a total size of . GB of data were used. The datasets used in this paper come from IPPUC and the Municipality of Curitiba . Fig. presents   shapefile format. Bus Terminals. The city has terminals (buses) and one terminal which also use trains (data provided in a shapefile format). NeighborhoodCity Regionals. The city has distinct neighborhoods and regionals, provided in shapefile format. User Cards. The anonymized data was provided as zipped csv files, with , , tuples from October . Bus Locations. The data was provided as zipped csv files, with , , tuples from October . Movement. The data from User Cards was combined with Bus Location providing the movement data or OD data. The final table presented , , tuples using the same criteria for selecting OD data present in Diniz Jr. ( ). The resulting data from this table can also be presented as shown in Table . ROIS. The possible regions of interests (ROIS) were defined as geometric definitions of neighborhood, city regionals or K-Means. The mathematic regions are using k-means (K equals , , , , , , , , or ) were inspired by Silva et al. ( ), Osama et al. ( ).
The overall objective of the prototype here was not only having traditional areas from the city (regionals, neighborhoods) but also regions based in mathematic division of the city (which might be bigger or smaller than the traditional ones). Fig. presents all the possible regional divisions for the city of Curitiba. The best K parameter for the clustering was based on Elbow Method (Tibshirani et al., , Stolfi et al., ), using seven days of data. On the other hand, the clusterization enhanced the visualization of the aggregation groups.
In order to improve performance, btree indexes were used in non GIS columns (gender, dates, smartcard number, among others), and gist indexes were used in all GIS columns. The biggest table (Bus Locations) used table partitioning by day and indexes in order to increase performance. The table movement (the biggest used by the interface with query in Fig. ) used three non GIS indexes and GIS indexes (origin and destination). Tests selecting a date range from , and days showed a % better average answer after creating such specific indexes (the biggest test query with , tuples returned in seconds, compared to seconds without indexes). Note that the majority of related work cited in this paper do not use GIS databases in order to store the OD data.

. The prototype.
With the requirements checked, the first step was to build the filter panel (listed in modules in Fig. ): start and end date, the ROIS (regionals, neighboorhoods or mathematical, via K-Means algorithm division), the origin and destination areas, terminals, bus stops, gender and age (as listed in Fig. -left). Internally, the selected filters are processed through specific indexes and queries (such as Fig. ) in the database. As defined in Ferreira et al. ( ), each query has a set of temporal, attributes and spatial constraints. Such constraints can also be mapped to Peuquet's Triad Framework (Peuquet, ): spatiotemporal data-space (where), time (when) and objects (what).
The results are presented in two panels: the OD visualization in Data Visualization followed by basic statistics in Graphics panel. The result of Q (what initially should be data similar to Table ) changes the plot at the Data Visualization and Graphics panels each time the query is changed. The Data Visualization panel at Fig.  (right side), has the origin data in blue and destination data in red. It is possible to execute intra-regions queries, selecting the same region as origin and destination. The prototype also presents a Graphic panel (Fig. ), with the following information: general drop-in in classification by age ("A"), general drop-in in classification by gender ("B"), drop-in classification by gender through the selected days("C"), drop-in classification by age through the selected days ("D"), and drop-in classification by age through the selected days ("E").
Note that the technologies presented in Fig. uses data which is stored in a remote GIS database, a web-based engine (OpenStreetMap), along with several libraries in order to provide the results to a client.

. Preliminar Evaluation.
In order to evaluate the usability of prototype, an experiment was conducted in November th, , with five males and two females, from to years old. Three of them were not from computer science (Sociology, Mathematics and Geography). The participants had to perform three activities: ) an "easy one", using a gender filter with data for one day (with an average execution time of , secs); ) a "medium one", using age classification for ten days of data (with an average execution time of , secs); and a ) a "difficult one", using days of data (with an average execution time of secs).
The tests ran in home PCs, in parallel. All the participants were able to perform the three activities, without any background in the application.
A questionnaire (with four specific questions and one descriptive question) was submitted to them, with the following results: ) the prototype was intuitive to use; ) the filters evaluation were fine; and ) the results were easy to understand using the filters. In parallel, a specialist which used the city report (Pichiliani, ) stated that the prototype enhanced the overall analysis. Further details are available at Parcianello ( ).

Case Studies.
In this section we present case studies in order to illustrate the usability of the prototype and the basic patterns exploring subjects such as clusterization.

. Classification drop-ins across day shifts in a month.
If we analyze the drop-ins by shift, we note that the majority is concentrated at morning shifts (as shown in Fig. ). Note that there is a higher demand on weekdays, and a decrease on weekends. Exceptions are noted in Saturday / and Thursday / (Children's Day holiday). Note also that the demand is higher during the morning shifts, followed by afternoon, night and dawn.

. Neighborhood drop-ins.
In order to analyze the drop-ins (origin) aggregated by neighborhood, Eq. ( ) was used. . Note that CIC, Curitiba's most populous neighborhood (more than , according to IBGE) and with the largest territorial area (about km ) is also the one with the highest IOco, indicating that there is a considerably greater movement of people. In particular, if we just analyze the drop-ins in CIC, we note that these citizens have as drop-offs the CIC itself, along with the neighborhoods in the region, as shown in Fig. . The prototype analysis of people who has the drop-ins in CIC during the morning shifts ( Fig. ) for thirty days shows that the majority of the citizens are female, with the highest amount of ages concentrated among and years. The map also shows that bus stop drop-ins are equally distributed in the CIC neighborhood.

IOco = Number of Events in the Neighborhood
.

Intra-Region Movements.
If we consider intra-region analysis (the movements where origin neighborhood are the same as the destination one), we can also notice that the CIC neighborhood and the downtown area have the highest rates (Fig. -B). The prototype analysis using K-Means equals (for data in Oct. th, - Fig. ) shows that CIC has some concentration of bus stops for drop-ins and drop-offs. .

Regional Movements.
If we consider intra-region analysis (the movements where origin neighborhood is the same as the destination one), we can also notice that the CIC neighborhood and the downtown area have the highest rates ( Fig. ). The prototype analysis of regionals from CIC and Downtown area (using them as drop-ins and drop-offs, with one month of data) shows that not only the main neighborhoods have the highest concentration of citizens, but also the districts around them. Here, the majority is still female, ranging from to years. Note that the Downtown regional has a highest concentration compared to CIC regional. .

K-Means for K=
In order to understand smaller regions (compared to neighborhood and regionals), we included a test using for k-means for K= . Note that this type of analysis is not available at Pichiliani ( ). The prototype analysis using one month of data, using drop-ins in CIC and drop-offs in Downtown area ( Fig. ) with K-Means for K= shows that the OD distribution among the regions are equal, with a highest concentration in the downtown area. That shows that not only the regions have a higher concentration of passengers, but also that the concentration is across their territories.

. Lessons Learned.
Along with the data presentation, several factors should be considered in order to design and impact OD prototypes: ) the filters and their classification (which ones are relevant to the user, how easy are their use and implementation, and which results are better understood). Although this paper mainly focused on filters based on data (with twelve categories and subcategories) using as base the local reports, several others could be included (such as different visualization techniques, events, cause/effect), which may lead to too many filter options; ) the final user and restrictions/technology background: the objective was to use as a starting point an interface which was familiar to the end user, and enhance it, to maximize the probability  of its use; ) the overall data visualization and which basic statistics impact the user (what general overview the final user is interested); ) movement and region variations (intra-region, regional, smaller variations (such as K-Means); ) the analysis of the architecture, data, database structure and queries in order to improve optimization (techniques such as query optimization, indexes, table partitions); ) the test and integration of different libraries (sometimes not freely available, or hard to integrate, or non available to different OS). In the prototype, for example, some Leafjet plugins were not compatible with others that we intended to add to the project.

Figure :
Movements having CIC as destination (left) and CIC as origin (right).

Figure :
Prototype Analysis for drop-ins in CIC, in the morning shift, for one month of data.

Figure :
Prototype Analysis for / / with K-means as , using the data from Oct. th, .
The prototype is available through video and source code .
https://youtu.be/KOzFHRc7lXA https://github.com/yussefparcianello/OrigemDestinoStpCuritiba Figure : Prototype Analysis with K-means as , using one month of data, with drop-ins in CIC neighborhood and drop-offs in Downtown area.

Figure : Prototype Analysis for drop-ins in CIC and
Downtown Regionals, for one month of data.

Conclusion
There are several challenges related to urban mobility, mainly faced by large urban centers. Public transport supply and demand, traffic jams, travel times, and the distribution and patterns of mobility are some of the problems that impact the development and planning of a city.
In this sense, this paper presented a prototype for visualization of OD data. The novelty relies on visualization through clustering of georreferenced data, allowing the analysis of different regions of interests (neighborhood, regionals or mathematical regions using K-means algorithm). The prototype was based on: ) queries associated with a set of trips (and related attributes) for non experts (Ferreira et al., ), ) analysis of trips and services for Curitiba mentioned in Diniz Jr. ( ), ) scalability for visualization, using clustering of georreferenced data (Vila et al., ), and ) local study of Origin-Destination from the Public Transportation of Curitiba (Pichiliani, ), along with interviews of local citizens. The filter options offered were established based on the analysis from data collected via a questionnaire applied to the academic community. The data was stored in a GIS database, using indexes and table partitioning in order to improve performance. The results are presented in two panels: the OD visualization in Data Visualization followed by basic statistics in Graphics panel.
A series of case studies were used in order to understand data patterns, using one month of data from Curitiba, with an average of . . smart card entries. A preliminar evaluation with users has shown that filters are helpful to understand the data, and the results were easy to understand through the interface. Within the lessons learned, we listed issues which impact not only the interface, but also the final user, technologies, architecture and database optimization.
For future work, we can mention the inclusion of complex queries, the expansion of filters (vehicle occupancy rate, average travel speed, etc.), the automation of data insertion, along with other visualizations (flowlines, displacements, speed alerts, etc.) and data (velocity alerts, route deviation, weather, points of interests, etc.).