@inproceedings{Kla2023, Author = {Klaas Dählmann and Timo Wolters and Christian Lüpkes and Andreas Hein}, Title = {Syntactic correction of social data for the evaluation of new treatment options}, Journal = {68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS)}, Year = {2023}, Month = {9}, Publisher = {German Medical Science GMS Publishing House}, Organization = {Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie}, Doi = {10.3205/23gmds003}, Url = {https://dx.doi.org/10.3205/23gmds003}, type = {inproceedings}, Abstract = {Introduction: The evaluation of new medical treatment options requires comparison between all available information of both the established and the newly proposed treatment options. In Germany, information on patients and their therapies, so-called social data (“Sozialdaten”), is collected and stored by health insurance companies (HICs) whenever events regarding a patient are reported. Social data may be used for quality control within HICs and therefore is important to the evaluation and improvement of different or new treatment options. Because it is reported to HICs by different medical providers and over potentially long periods of time, social data is prone to problems which can often be traced back to syntactic issues [1]. This impedes the evaluation of new treatment options as it becomes nearly impossible to assess or quantify their impact. Therefore, syntactic issues in social data must be resolved when evaluating specific treatment options without changing the semantics of the social data. This contribution addresses the question of how a general concept as well as the actual implementation of a syntactic correction system for social data of health insurance companies can be designed and realized. State of the art: Previous work mostly focuses on the resolution of semantic, but not syntactic, issues such as mismatches between gender, given name, gender-specific diagnoses [2] or the validation of diagnoses by matching diagnosis, medication, and treatment period [3]. Those that do address syntactic issues only mention the fact that syntactic correction may be required, but do not suggest or evaluate any actual methods and algorithms to resolve them [4], [5]. Concept: To develop a system for syntactic correction, we first organized the different data types and formats expected in social data into different categories such as medical standards, technical conventions, and project-specific formats. The resulting taxonomy is used in the evaluation to identify distribution and correction of syntactical issues by category. The proposed algorithm is based on deductive knowledge of the data types and formats. It uses methods such as regular expressions, check digits, or feature extraction for the syntactic correction. Implementation: The syntactic correction system is implemented as an in-place algorithm to minimize runtime and to allow for future application within an on-line approach for syntactic correction. The experimental data used to evaluate the implementation comes from about 17,000 patients of twelve different HICs in a context of follow-up care after having suffered a stroke. Lessons learned: Nearly 2,000,000 syntactic issues are identified during correction, about 87% of which are resolved by the algorithm. This emphasizes both the need of an automatic correction of syntax issues as well as the general applicability and benefit of the developed approach. Moreover, by using the proposed taxonomy, we examine the resolvability of issues within each of the individual categories and show that their issues are either completely or almost completely resolvable with the developed algorithm or almost unresolvable. For future work, we therefore suggest expanding the algorithm with more contextual information and lookup tables to tackle those issues not resolvable with the currently implemented methods.} } @COMMENT{Bibtex file generated on }